My Team Tried 'Agentic Engineering' for Two Sprints. It's Not What You Think

Two months ago, one of my engineers dropped a link in our team Slack: Simon Willison’s guide on “Agentic Engineering Patterns.” His message was simple: “We should be doing this.”

He was right. And also wrong. Let me explain.

What Is Agentic Engineering, Actually?

Simon Willison defined it as the practice of professional software engineers using AI coding agents — tools like Claude Code, OpenAI Codex, and Gemini CLI — to amplify their existing expertise. The key distinction from vibe coding is that you’re not ignoring the code. You’re directing the agent, reviewing its output, writing tests first, and maintaining engineering standards.

Think of it this way: vibe coding is letting the AI drive while you scroll your phone. Agentic engineering is letting the AI drive while you actively navigate, check mirrors, and occasionally grab the wheel.

Willison’s first two published patterns are compelling:

“Writing code is cheap now” — The cost to produce initial working code has collapsed. This changes how you think about prototyping, throwaway code, and exploration.
“Red/green TDD” — Write the test first, let the agent write the implementation. The agent iterates until tests pass. Minimal prompting needed.

Both sound great in theory. So I ran an experiment.

The Experiment: Two Sprints, Two Approaches

I split our team’s work across two sprints:

Sprint A (control): Normal workflow. Engineers write code themselves, using Copilot for autocomplete but no agentic tools for full implementations.

Sprint B (agentic): Engineers use Claude Code or Codex as primary implementation tools. They write test files first (red/green TDD pattern), provide AGENTS.md context files per repo, and let the agents implement. Engineers review and iterate.

Both sprints had comparable scope: API endpoints, data pipeline modifications, and UI components.

The Numbers (Week 1-2)

Sprint B’s initial velocity was roughly 35-40% higher. Not the “10x” you see on Twitter, but genuinely significant for the team.

Breakdown:

API endpoints: Agent-written code shipped about 45% faster. Boilerplate-heavy work is where agents shine.
Data pipeline changes: About 25% faster. The agents needed more guidance here — they’d write technically correct transformations that missed domain-specific edge cases.
UI components: Mixed. Simple components were 50% faster. Complex stateful components with animation were actually slower because the agent kept suggesting approaches that didn’t match our design system, and the back-and-forth cost time.

The engineers were enthusiastic. One of them told me, “I feel like I got promoted to architect overnight. I’m designing systems instead of writing CRUD.”

The Numbers (Week 3-4)

Then the bug reports started.

Over the two weeks following Sprint B, we logged 11 bugs traceable to agent-written code, compared to 4 from Sprint A. The agent-written bugs had a specific pattern:

Correct but fragile: The code did what the tests specified, but made brittle assumptions about input formats. One endpoint handled null values in 6 out of 7 fields — the agent never generated the seventh because the test didn’t cover it.
Subtly wrong error handling: The agents love wrapping things in try-catch blocks that swallow errors or return generic 500s. In three cases, we had silent failures in production that took hours to diagnose because the error messages were meaningless.
Test-shaped blind spots: When you write the test first and the agent implements to pass it, you get exactly what you tested — nothing more. The agents didn’t volunteer edge cases. They didn’t say “hey, what about rate limiting on this endpoint?” They just passed the tests and stopped.

The net result? Sprint B delivered faster but created roughly 18-22 hours of follow-up debugging work across the team over the next three weeks. When you factor that in, the real productivity gain drops to maybe 12-15%.

Still positive. But not the revolution the discourse suggests.

What Willison Gets Right

The TDD pattern is legitimately the most important insight. Before I read his guide, we were using agents in “just write it” mode — dump a prompt, accept the output, maybe review it. Switching to test-first changed everything:

Agents write tighter code when they have a test target. Less unnecessary abstraction, fewer “just in case” utilities nobody asked for.
Review becomes tractable. Instead of reading 400 lines of generated code and trying to spot issues, you read the tests first and check if they’re comprehensive. Then you spot-check the implementation.
Iteration actually works. “Tests 3 and 7 are failing, fix them” is a much better prompt than “this doesn’t work right, try again.”

If you take one thing from this article: write tests first, always. It’s the single highest-leverage change for agentic workflows.

What the Evangelists Skip

1. Context Engineering Is the Real Bottleneck

Willison mentions this, but it deserves more emphasis. The quality of agent output is directly proportional to the quality of your context: AGENTS.md files, example code, architecture docs, test fixtures.

My team spent roughly 6-8 hours in the first week just writing context files. That’s an investment that pays off over time, but nobody mentions it in the “I built an app in 20 minutes” tweets. If your repo doesn’t have good documentation, the agent will produce confidently wrong code that matches incorrect assumptions.

2. Not All Engineers Benefit Equally

My senior engineers got the most out of agentic workflows. They could spot when the agent was heading in the wrong direction within seconds and redirect. They knew which tests to write because they’d already built similar systems.

My mid-level engineer struggled more. He’d accept agent output that technically worked but was architecturally questionable — because he didn’t yet have the pattern recognition to catch it. Agentic engineering amplifies expertise, which means it also amplifies the gap between senior and junior developers.

3. The “Code Is Cheap” Mindset Has a Dark Side

Willison’s “writing code is cheap now” observation is accurate. But cheap code is still code you have to maintain. In Sprint B, we generated about 30% more code than Sprint A for equivalent functionality. More files, more abstractions, more surface area.

When I asked one engineer why a simple feature had four service classes, he said: “The agent suggested separating concerns.” Technically defensible. Practically, it’s three more files to maintain for a feature that could’ve been 80 lines in one file.

Cheap to write ≠ cheap to own.

My Updated Playbook

After two sprints, here’s how I’ve adjusted our agentic workflow:

DO:

Write tests first. Always. No exceptions.
Invest in context files (AGENTS.md, architecture docs, example patterns)
Use agents for boilerplate-heavy work: API endpoints, CRUD, data transformations, test generation
Senior engineers direct the agents; mid-level engineers review with senior guidance

DON’T:

Let agents write error handling without explicit specification of every error path
Accept agent-generated abstractions without asking “could this be simpler?”
Use agentic workflows for complex stateful logic, security-sensitive code, or anything involving concurrency
Skip the post-sprint bug review — track which bugs came from agent code specifically

MEASURE:

Track velocity AND follow-up bug hours. Velocity alone is a misleading metric.
Review code-to-test ratio. If agents are writing code without corresponding test coverage, something’s wrong.

Is Agentic Engineering Real?

Yes. It’s a meaningful step beyond autocomplete and a more disciplined approach than vibe coding. The TDD pattern alone justifies the terminology.

But the current discourse has a survivorship bias problem. You hear about the features shipped in 20 minutes, not the three-hour debugging sessions that follow. You see the velocity charts, not the maintenance burden.

Agentic engineering with proper guardrails — TDD, good context, senior oversight, bug tracking — delivers a real 12-15% net productivity improvement for my team. That’s significant. That’s worth doing.

Just don’t let anyone tell you it’s 10x. It’s 1.15x with a lot of asterisks.

If you tried vibe coding and it went sideways, I wrote about the fallout from our first attempt. Agentic engineering is the grown-up version — but it comes with its own growing pains. I also covered building an AI-powered development workflow from scratch if you want the full integration picture.

What Is Agentic Engineering, Actually?#

The Experiment: Two Sprints, Two Approaches#

The Numbers (Week 1-2)#

The Numbers (Week 3-4)#

What Willison Gets Right#

What the Evangelists Skip#

1. Context Engineering Is the Real Bottleneck#

2. Not All Engineers Benefit Equally#

3. The “Code Is Cheap” Mindset Has a Dark Side#

My Updated Playbook#

Is Agentic Engineering Real?#