I was mid-refactor on a Thursday afternoon when my team’s Slack exploded. “GPT-5.4 just dropped.” Three engineers immediately abandoned their current Claude Code sessions to try it. By Friday morning, two of them had switched back. The third is still undecided.

That pretty much captures the GPT-5.4 experience so far — genuinely impressive in specific areas, surprisingly uneven in others, and nowhere near the “Claude killer” that Twitter proclaimed it within the first hour.

What Actually Changed

GPT-5.4 is OpenAI’s first model that merges the coding specialization of GPT-5.3-Codex with general reasoning into a single system. Before this, you had to pick: coding model or thinking model. Now you don’t.

The headline numbers:

  • SWE-Bench Pro: 57.7% (vs Claude Opus 4.6’s ~45.9%) — this is the harder, contamination-resistant benchmark
  • SWE-Bench Verified: ~80.0% (vs Opus 4.6’s 80.8%) — essentially tied on the standard version
  • OSWorld-Verified: 75.0% — surpassing human performance at 72.4% for desktop navigation
  • 83% match rate against human professionals across 44 occupations
  • 33% fewer hallucinated claims compared to GPT-5.2

That SWE-Bench Pro gap is the most interesting number. It was specifically designed to resist benchmark gaming, so a 12-point lead there actually means something.

Where It Genuinely Impressed Me

Computer Use That Actually Works

This is the real story, not the coding benchmarks. GPT-5.4 is the first general-purpose model with native computer use — it can operate browsers and desktop apps through both Playwright code and direct mouse/keyboard commands from screenshots.

I tested this on a real scenario: debugging a React component where the visual output didn’t match the expected layout. The model could look at the rendered page, identify the CSS issue, write the fix, and verify the result visually — all in one loop. No separate tool, no screenshot-paste-analyze workflow.

It’s not magic. It failed on complex drag-and-drop interactions and got confused by deeply nested iframes. But for the 80% case of “look at this UI and tell me what’s wrong,” it’s a real time-saver.

Tool Search Cuts Token Costs

A new API feature called Tool Search lets GPT-5.4 receive a lightweight tool list upfront, then look up full definitions on demand. In Scale’s MCP Atlas benchmark, this reduced token usage by 47% with no accuracy loss.

For teams running heavy MCP-based workflows, that’s not a minor optimization — it’s the difference between “this agent costs $3 per run” and “this agent costs $1.60 per run.” At hundreds of runs per day, the math matters.

Where It Falls Short

The Naming Situation Is a Red Flag

OpenAI released GPT-5.3 Instant on Tuesday and GPT-5.4 on Thursday. In the same week. The version numbering has become genuinely confusing — GPT-5.3-Codex, GPT-5.3-chat-latest, GPT-5.4, GPT-5.4 Pro — and it’s not just an aesthetic complaint. When you’re selecting models for production pipelines, unclear naming creates real operational risk.

Claude’s naming is boring (Opus 4.6, Sonnet 4.6). Boring is good when you’re writing infrastructure code at 2am.

Context Window Parity, Not Advantage

GPT-5.4 has a 1M token context window, matching Claude Opus 4.6’s beta offering. In practice, both models degrade on tasks that actually use the full window. I tested both on a 600K-token codebase analysis task — Claude held coherence slightly better on cross-file references, while GPT-5.4 was faster at generating the initial summary. Neither was great past 800K tokens.

The “1M context” marketing from both companies is aspirational. Real-world useful context is still closer to 200-300K for complex reasoning tasks.

Reasoning on Novel Problems

On ARC-AGI-2, which measures novel abstract reasoning, Opus 4.6 scores 68.8% compared to GPT-5.4’s estimated 52.9%. That’s not a gap — that’s a canyon. For coding specifically, this shows up when you’re asking the model to design a new architecture pattern it hasn’t seen in training data. GPT-5.4 defaults to familiar patterns more aggressively; Claude is more willing to reason from first principles.

Pricing Sticker Shock

GPT-5.4 Pro costs $200/month. Standard GPT-5.4 requires a ChatGPT Plus or Team subscription. API pricing hasn’t been fully disclosed, but early reports suggest it’s roughly 40-45% more expensive per token than Opus 4.6 for equivalent tasks. For individual developers, the math is questionable. For teams already paying for Cursor or similar tools, it might be bundled in — but check the fine print.

The Honest Verdict

GPT-5.4 is the first OpenAI model that genuinely challenges Claude for coding work. The computer use capability is a legitimate differentiator that no other general model matches. The SWE-Bench Pro lead suggests real engineering capability, not just benchmark optimization.

But “challenges” isn’t “replaces.” Opus 4.6 still leads on:

  • Standard SWE-Bench (the one with the most real-world validation)
  • Novel reasoning tasks (ARC-AGI-2 gap is enormous)
  • Code review quality for nuanced architectural feedback
  • Consistency on long-context tasks

My current setup: Claude Opus 4.6 for code review, architecture decisions, and complex refactors. GPT-5.4 for computer use tasks, UI debugging, and any workflow where I need the model to interact with actual software.

Running both isn’t cheap, but it’s March 2026 and no single model is the best at everything. Anyone telling you otherwise is selling something.

What to Watch

The real test comes in the next 2-3 weeks as more developers hit production workloads with GPT-5.4. Benchmarks are controlled environments. The question is whether that SWE-Bench Pro advantage holds up when the model is debugging your specific spaghetti codebase at midnight.

I’ll update this post as my experience deepens. For now, it’s a strong addition to the toolkit — not a reason to cancel your Claude subscription.


Running a team and trying to make sense of all these models? Check out my AI tools overview for tech leads for a broader perspective on what’s worth your time.