I Ran Codex and Claude Code Side by Side for a Week

Codex crossed 1 million weekly active users last week. OpenAI was loud about it — blog post, tweet thread, the works. And look, the milestone is real: GPT-5.4 driving it, Figma MCP integration, cloud-native parallel execution. It’s a legitimately capable tool.

I’ve been using Claude Code as my primary coding assistant for about four months. It’s baked into my workflow. When Codex’s numbers dropped, my reaction was not “which one should I use?” but “let me actually test this properly.”

So I spent a week running the same tasks through both. Here’s what I found.

The Setup
Speed: Codex Wins, Mostly
Code Quality: Context Is Everything
Context Understanding: Claude Code’s Real Advantage
GitHub Integration
Pricing (The Actual Numbers)
Figma MCP: Codex’s Killer Feature
Where Each Tool Wins
What I’m Actually Using

The Setup

Tasks I ran through both tools:

Add pagination to an existing REST API endpoint
Write a migration script for a schema change
Review a PR with 12 files changed
Debug a failing test suite (9 failures, mixed causes)
Refactor a 400-line class into smaller components
Generate OpenAPI documentation from existing code

Same codebase, same context, measured wall-clock time and did my own quality evaluation on output.

This is not a scientific benchmark. It’s one engineer’s week of real work. Treat it accordingly.

Speed: Codex Wins, Mostly

Codex is fast. The cloud-native parallel execution is not marketing — when you give it a task that can be broken into sub-tasks, it actually runs them concurrently. The pagination endpoint took Codex about 90 seconds wall-clock. Claude Code took about 4 minutes with a few back-and-forth exchanges.

For the PR review, Codex gave me a structured summary in under 2 minutes. Useful, well-organized.

But here’s the caveat: speed on a well-scoped task is not the same as speed to a good outcome. The migration script Codex generated in 45 seconds had three subtle issues that would have caused data loss on large tables. I caught them in review. Claude Code took longer but asked clarifying questions about batch size and rollback behavior before generating the script. The output needed fewer corrections.

Raw speed: Codex. Time to correct output: more mixed.

Code Quality: Context Is Everything

Both tools generate competent code. The difference shows up at the edges.

For the refactoring task — 400 lines into smaller components — Codex produced a clean split with good naming. But it didn’t notice that one of the methods had a side effect that made it dangerous to extract without a specific pattern change. Claude Code flagged it unprompted.

For the test suite debugging task, Claude Code was notably better. It held multiple failing tests in context simultaneously and identified that three of the nine failures had a common root cause (a shared fixture being mutated). Codex diagnosed each failure independently, which was less efficient.

For simple, bounded tasks with clear specs: quality is close. For complex, interdependent problems: Claude Code’s reasoning depth shows.

Context Understanding: Claude Code’s Real Advantage

This is the clearest gap I found.

Claude Code’s local repo context is genuinely different from Codex’s cloud approach. When I asked Claude Code to add pagination, it looked at how we’d done pagination elsewhere in the codebase and matched the pattern — including a non-obvious custom cursor implementation we use instead of offset-based paging. I didn’t tell it to do this. It just did.

Codex generated offset-based pagination. Correct in the abstract, wrong for our codebase. I had to tell it about the cursor pattern and have it redo the work.

For greenfield projects or isolated tasks, this doesn’t matter. For existing codebases with established patterns, it matters a lot. I’ve written about this kind of context problem before — switching costs in AI tools often come from exactly this kind of context mismatch.

GitHub Integration

Codex’s automatic PR creation is genuinely useful. You describe a task, it executes, it opens a PR. The PR description is decent — not amazing, but workable. For a certain class of well-defined tasks (backfill a field, update a dependency, fix a linting issue), this flow is actually smooth.

Claude Code’s GitHub integration is more manual. You’re running it locally, you’re committing, you’re pushing. It’ll help you write a commit message and PR description if you ask. But it’s not automated.

If your team has a lot of small, well-defined tasks and you want them handled asynchronously: Codex. If you want a thinking partner for complex work that then helps you commit: Claude Code.

I’ve found the AI PR review bot comparison useful context here — the automated PR creation question is related to how you think about AI in your review workflow.

Pricing (The Actual Numbers)

Let’s be concrete because “expensive” and “cheap” are meaningless without context.

Codex:

Included in ChatGPT Pro ($20/month) with usage limits
ChatGPT Team/Enterprise tiers get higher limits
API access via GPT-5.4: ~$2.50 per million input tokens, ~$10 per million output tokens
Parallel execution can multiply your token usage

Claude Code:

Standalone CLI, billed through Anthropic API
Claude Sonnet 4: ~$3 per million input tokens, ~$15 per million output tokens
Claude Opus: ~$15/$75 per million tokens (for heavy reasoning tasks)
No built-in usage cap — your bill scales with usage

My actual spend for the test week: roughly $18 in Claude Code API calls, roughly $8 in Codex (above my existing ChatGPT Pro subscription). Codex was cheaper for that week because I was doing a lot of parallel smaller tasks where its speed efficiency offset the token cost.

For a small team doing sustained daily use: estimate $150-300/month per developer depending on task volume and complexity. Codex has a cost ceiling if you’re on a team plan. Claude Code does not — you can spend as much as your work requires.

The credit pricing trap is real. I’ve covered this before in AI coding tool credit pricing — make sure you understand what you’re actually paying per task before committing to either tool at scale.

Figma MCP: Codex’s Killer Feature

I’d be leaving something out if I didn’t talk about this.

The Figma MCP integration in Codex is the most interesting thing either tool has shipped recently. You connect your Figma workspace, point Codex at a design, and it generates UI components. Not “here’s some HTML that vaguely resembles this” — it reads the design tokens, the component structure, the spacing.

I tested it on a straightforward card component design. The output was about 85% correct with correct Tailwind classes and accurate spacing. I needed to fix a couple of hover states and one responsive breakpoint. For a design-to-code workflow, this is legitimately faster than anything I’ve used.

Claude Code has no equivalent. If your team works from Figma designs and that’s a significant chunk of your frontend work, Codex has a real advantage here.

Where Each Tool Wins

Use Codex when:

You have well-defined, parallelizable tasks
You want automated PR creation for smaller changes
Your team works heavily from Figma designs
You’re on ChatGPT Pro/Team and want to maximize existing spend
You need fast turnaround and can verify output quickly

Use Claude Code when:

You’re working in an existing codebase with established patterns
The problem is complex, interdependent, or requires multi-step reasoning
You want a code review partner that holds context across the review
You’re debugging something non-obvious
You need the AI to push back and ask good clarifying questions

The honest version of this: they’re optimized for different parts of the workflow. Codex is better at pipeline tasks — defined input, defined output, do it fast. Claude Code is better at exploration tasks — the ones where the question itself is still being figured out.

What I’m Actually Using

Both. Which is annoying to say, but it’s true.

Codex handles my dependency updates, small refactors, and linting fixes. It creates the PRs, I review them, I merge them. Fast and low-friction.

Claude Code is what I reach for when I’m actually thinking through a problem — architecture questions, debugging weird behavior, reviewing a PR where something feels off but I can’t articulate why yet.

The switching cost between them is real — I’ve written about tool switching costs and this is exactly the scenario. But for now, the two-tool approach is giving me better outcomes than either alone.

If I had to pick one: Claude Code, for my current mix of work (complex codebase, heavy reasoning tasks, lots of code review). But if I were leading a frontend team shipping lots of greenfield features from Figma designs, I’d probably flip that.

Neither tool is winning permanently. Both are moving fast. Check back in six months.

For a broader comparison that includes Copilot, see Copilot vs Cursor vs Claude Code. For tips on getting the most out of Claude Code specifically, I wrote up Claude Code tips and workflows last month.

Table of Contents#

The Setup#

Speed: Codex Wins, Mostly#

Code Quality: Context Is Everything#

Context Understanding: Claude Code’s Real Advantage#

GitHub Integration#

Pricing (The Actual Numbers)#

Figma MCP: Codex’s Killer Feature#

Where Each Tool Wins#

What I’m Actually Using#

Table of Contents