Codex Inside Claude Code. Subagents Inside Codex.
Two gaps, one tool
Claude Code has Task subagents. Opus 4.6 is a natural coordinator — it knows how to delegate, how to prompt, how to orchestrate multi-step pipelines. But it can only dispatch Claude. You can’t hand a job to Codex. You can’t reach OpenCode. The best prompt master in the game, locked inside its own ecosystem.
Codex is the opposite problem. Precise executor — give it a strict task with high reasoning and it delivers surgical code changes. But it has no subagent system at all. No Task tool, no nested agents, no orchestration primitives. A brilliant worker with no way to delegate.
Two of the most powerful AI coding engines on the planet. Neither can talk to the other.
agent-mux fixes both. One CLI. One JSON contract. Any engine.
Why this matters
Each engine has a personality.
Codex 5.3 at high reasoning is the programmer in a suit — precise, by-the-book, will follow your spec to the letter. Codex 5.3 at xhigh is your top-tier auditor — reads code like a lawyer reads contracts. Opus 4.6 is the prompt master — it doesn’t just execute, it manages. It knows how to break a complex task into subtasks, pick the right worker for each, craft the prompt, and synthesize the result. Codex 5.3 Spark is a perfect Haiku replacement - blazingly fast, reliable, and it’s fun to launch swarms of them.
But the real reason you want all three in one pipeline: mode collapse between Claude and OpenAI models is roughly orthogonal. The blind spots don’t overlap. What Opus misses in a code review, Codex catches. What Codex over-optimizes, Opus questions. Run both — not for redundancy, but for coverage.
This isn’t a nice-to-have. Once you’ve seen a Codex audit catch a bug that three rounds of Claude review missed, you don’t go back to single-engine workflows.
The pipeline
Here’s what my actual workflow looks like.
My main Claude Code session is a thin coordinator. It doesn’t write code. It doesn’t grep through files. It plans, delegates, and synthesizes. When a complex task arrives — “take this private repo and turn it into a polished open-source artifact” — it spawns a Get Shit Done coordinator as a Task subagent. GSD lives in .claude/agents for Claude Code setup and as a skill reference in Codex setup. And yes! It’s Claude inside Claude inside Claude! Or Codex inside Claude inside Claude. And oh man it works.
GSD reads its own operational playbook, breaks the task into steps, and starts dispatching workers:
1. Opus plans the migration — what to extract, what to redact, what to restructure
2. Codex 5.3 high swarm executes — 3-4 workers in parallel, each handling a file group
3. Codex xhigh audits the result — reads every line like it's going to production
4. Fixes go back through Codex high
5. Opus does a final synthesis — checks coherence, writes the README, verifies links
The whole thing runs 30-60 minutes autonomously. You kick it off, go make coffee, come back to a working result. Not a draft. Not a “here’s what I’d suggest.” A committed, tested, audited artifact.
The key insight: with proper internal documentation and clear project structure, this lands on the first attempt more often than you’d expect. The skills carry the institutional knowledge — the workers don’t need a 500-word prompt because the playbook is injected at dispatch time.
agent-mux — the glue
The architecture is deliberately simple. One thin core handles everything engine-agnostic: CLI parsing, timeout enforcement, heartbeat loop, activity tracking, JSON assembly. Each engine lives behind an adapter — codex.ts, claude.ts, opencode.ts — implementing a single run() interface. The core never knows or cares what ran underneath.
It’s SDK-native. The Codex adapter uses @openai/codex-sdk directly — thread creation, streamed execution, sandbox control. The Claude adapter uses @anthropic-ai/claude-agent-sdk with the query() async generator. No shell wrappers, no screen-scraping CLI output. This means auth works the way each engine expects: Codex reads your OAuth tokens from ~/.codex/auth.json (the same device auth you already set up), Claude SDK handles its own device OAuth automatically. If you have API keys in your environment, those work too. Zero auth configuration on agent-mux’s side.
The invocation is one command:
# Codex — precise code changes, high reasoning
agent-mux --engine codex --reasoning high --effort high \
"Refactor auth module in src/auth/"
# Claude — architecture, open-ended synthesis
agent-mux --engine claude --effort high \
"Design the rollback strategy for the payments migration"
# OpenCode — third opinion, different model family entirely
agent-mux --engine opencode --model kimi \
"Review this patch and challenge the assumptions"
Three engines. Same interface. --engine is the only thing that changes.
Every run — success, failure, timeout — returns the same JSON on stdout:
{
"success": true,
"engine": "codex",
"response": "Refactored auth module. Split monolith into...",
"timed_out": false,
"duration_ms": 84231,
"activity": {
"files_changed": ["src/auth/client.ts", "src/auth/tokens.ts"],
"commands_run": ["bun test"],
"files_read": ["src/auth/types.ts"],
"mcp_calls": []
}
}
The activity field is quietly powerful. The calling coordinator doesn’t have to parse the response text to understand what happened — it gets a structured log of files changed, commands run, files read, and MCP calls made. When you’re running five workers in parallel and deciding what to do next, this is the difference between orchestration and guesswork.
stdout is sacred — only the final JSON. Heartbeats go to stderr every 15 seconds, so they never enter the caller’s context window. Why heartbeats at all? Because when a Codex worker is refactoring a large module at --effort high, it can run for 20 minutes. Without a progress signal, you can’t tell the difference between “working” and “hung.” The heartbeat carries the last activity — [heartbeat] 45s — processing file changes — so the coordinator (or the human watching) knows the worker is alive. Timeouts are effort-scaled by default: low gets 2 minutes, high gets 20, xhigh gets 40. Hard process-level kills via AbortController — no silent hangs.
Coordinators — subagents for Codex
In Claude Code, orchestration is native. You spawn a Task subagent, give it a complex goal, and it breaks it down, dispatches agent-mux workers, synthesizes results. The 10x pattern from the pipeline section — that’s Claude Code’s home turf.
But what about Codex? What if you want the same multi-step orchestration — plan, dispatch, audit, fix — running on OpenAI’s engine? Codex doesn’t just lack nested agents — it lacks default subagents entirely. No Task tool, no delegation primitives, nothing.
The --coordinator flag fixes this. A Codex main session spawns Opus 4.6 as the GSD coordinator via agent-mux — and now Opus is running inside Codex, with full orchestration powers. From there, Opus dispatches whatever workers it wants: Codex 5.3 high for execution, Codex Spark swarms for parallel grunt work, another Claude for a second opinion. Codex gets a brain. The brain gets an army.
# Codex running a full coordinator pipeline
agent-mux --engine codex --coordinator get-shit-done-agent \
--effort xhigh --full \
"Migrate the auth module to the new API, test everything, audit the result"
The GSD coordinator is the reference implementation. It reads its own playbook, decides which engine fits each subtask, and — this is where the multiplier kicks in — selects which skills and MCP servers to inject per worker. A browser automation task gets --skill browser-ops --browser. A research task gets --skill web-search. A code refactor gets --skill react --skill test-writer. The coordinator doesn’t just pick the right engine — it assembles the right toolkit for each dispatch. Engine selection is 10x. Engine + skill + MCP selection per task is 69x.
The coordinator’s frontmatter is the configuration layer:
---
skills: [web-search, browser-ops, pratchett-read]
model: claude-opus-4-6
allowedTools: [Bash, Read, Write, Edit, Glob, Grep]
---
Skills in frontmatter auto-merge with --skill flags from the CLI. The model is a default — overridable at invocation. One persona definition, multiple engines. The same GSD playbook runs on Claude or Codex, adapting to each engine’s strengths while keeping the orchestration logic identical. Your main session stays a holy coordinator — thin, context-preserved, decision-making only. GSD does the sweating.
Skills > Prompts
The usual way to brief an AI worker: write a wall of text explaining your project conventions, your file structure, your naming rules, your testing expectations. Every dispatch, you repeat yourself. The context budget bleeds. The prompts drift.
Skills flip this. --skill browser-ops injects a full operational playbook — not a prompt, but a decision tree with failure recovery, anti-bot handling, and session management patterns. The worker reads its own briefing. The coordinator just says what to do.
agent-mux --engine codex --skill browser-ops --skill web-search \
"Find the pricing page for Acme Corp, extract the enterprise tier details"
The --skill flag is repeatable. Stack as many as the task needs. Each skill resolves to a SKILL.md in your skills directory — works the same whether the caller is Claude Code or Codex. And here’s the thing that makes skills fundamentally different from prompts: a skill could be a self-contained toolbox with batteries included. The SKILL.md carries the operational knowledge — decision trees, failure recovery, edge case handling. The references/ directory carries supporting docs the worker might need. The scripts/ folder carries executable tools that are auto-added to PATH at dispatch time. The worker gets the knowledge, the context, and the tools in one atomic injection.
A prompt says “search the web.” A skill says “search the web, and when Cloudflare blocks you fall back to Jina reader, and when Jina times out try duckduckgo-search with WebFetch, and here’s the exact extraction command for each tier, and here’s a CLI script that handles all three fallbacks so you just call web-fetch and it figures it out.”
This is the architecture opinion baked into agent-mux: prompts are one-shot. Skills encode judgment. A skill with bundled scripts and references is more powerful than an MCP server — it gives the worker not just tools, but the operational knowledge of when and how to use them.
Here’s the thing about MCP: every server you connect adds its tool schemas to the model’s context window. Five MCP servers and you’ve burned thousands of tokens just describing what tools exist — before the worker has even started thinking about the task. Skills don’t have this problem. The SKILL.md is injected as focused operational knowledge — not a list of function signatures, but a decision tree of what to do and when. The bundled CLI scripts sit on PATH — the worker calls them like any shell command, no tool schema overhead. As the OpenClaw founder put it: CLI-first is the trend. The agent ecosystem is converging on composable CLI tools over heavyweight server protocols. Skills with bundled scripts fit this trajectory naturally — they’re just markdown and executables, no daemon, no socket, no schema registry.
The coordinator decides WHAT needs to happen and selects the right skills. The skills tell the worker HOW — with all the institutional knowledge and tooling it needs to execute without asking follow-up questions.
But skills aren’t just for execution. You can inject thinking protocols — first principles reasoning à la Elon Musk, Karpathy-style assumptions checks, pre-mortem inversion logic. A --skill think-protocol doesn’t make the worker do a task — it changes how the worker thinks before it does the task. Stack a thinking skill with an execution skill and the worker doesn’t just code — it grounds, simplifies, verifies, then codes. The GSD coordinator does this by default: planning workers get thinking skills, execution workers get domain skills, audit workers get both. It’s not just a coding pipeline — it’s a full reasoning pipeline end to end.
I keep publishing my humble collection at fieldwork-skills — browser automation, web search, Google Workspace ops, vault secret management, and more. Each one is extracted from real daily usage and encodes the friction I’ve already walked through so the next worker doesn’t have to.
So
Unlike the X clickbait telling you it took 500 hours and $10k to set up the ultimate Claude Code / Codex / OpenClaw / whatever workflow — this setup of mine has converged only after 2 months of daily trial and error. Shell wrappers, MCP bridges, custom SDK scripts, three rewrites of the dispatch layer. I’m not claiming it’s ideal — it works for me now. But times are changing fast. Let’s see what Claude Code and Codex teams ship next. In the meantime I’ll be updating and improving both the agents swarm engine and my humble skills collection.
One of my agents actually managed to sign up on Reddit end to end today — created an account, verified email, the whole flow. He’ll help me distribute this post over there. All orchestrated through GSD. Proper inception.
P.S. The repos: agent-mux for the dispatch layer, fieldwork-skills for the skills and the GSD coordinator. Both Apache 2.0. Both extracted from daily usage.
P.P.S. I have just realized that not only agent-mux gives agents inside agents inside session, but you can go deeper if you want to; let’s see who will cook something insane here. Agents inside agents inside agents inside agents inside agents… (claude for more claude vibes)