My Team Vibe-Coded a Feature. Three Weeks Later, I'm Still Cleaning Up

Three weeks ago, one of my engineers shipped a feature in two hours that would’ve taken a day. I praised him in the standup. Last week we spent four days unraveling the cascading issues it created.

The feature worked. It passed code review. It passed QA. Then it hit production and quietly started corrupting a lookup cache under specific concurrency conditions — a bug that only manifested at our real traffic patterns, not in tests. The root cause: Cursor had generated a threading approach that was subtly wrong in a way none of us caught because we were all impressed the feature shipped so fast.

That’s vibe coding’s dirty secret. The invoice arrives later.

What “Vibe Coding” Actually Means on a Real Team

For the uninitiated: vibe coding is what happens when you give an AI agent a rough goal, accept whatever it generates without deeply reading it, and ship. The name comes from Andrej Karpathy’s description of just… vibing with the AI, trusting it to figure out the details.

This works great for prototypes. It works fine for personal projects where you’re the only one who suffers. It starts breaking down when:

Multiple engineers are working in the same codebase
The code has to stay maintainable for more than a month
You have non-trivial concurrency, state management, or security requirements
Someone else has to debug it at 2am

My team has a mix of engineers and data scientists. We were shipping 40-45% faster after adopting AI-assisted coding last year. What we didn’t account for was the shape of the code coming out the other end.

The Specific Ways It Goes Wrong

After the cache corruption incident, I went back and audited the last two months of AI-heavy PRs. Here’s what I found:

Pattern 1: Inconsistent abstractions. When two engineers use AI to independently write code that interacts, you often get two different patterns solving the same problem. One uses a custom retry decorator; the other uses exponential backoff inline. They look fine separately. Together they create weird retry storms.

Pattern 2: Overconfident error handling. AI tools love to write comprehensive try/catch blocks that swallow exceptions or log them into the void. The code looks defensive. It’s actually hiding failures. We had a data pipeline silently failing for 11 days because a broad exception catch was eating a critical error and logging “unexpected input, skipping” to a log file nobody watched.

Pattern 3: Tests that confirm intent, not behavior. AI-generated tests tend to test what the code does, not what it should do. If the AI wrote buggy code and then wrote the tests, the tests will pass — they’re testing the wrong thing. This is subtle and hard to catch in review.

Pattern 4: Copy-paste architecture. AI has absorbed every StackOverflow answer and tutorial ever written. Ask it to “add caching” and it’ll add something that looks reasonable but may not fit your specific consistency requirements. It pattern-matches to common solutions, not necessarily correct ones for your context.

None of these are catastrophic on their own. Together, they compound. And they’re hard to catch because the code is syntactically correct and often reasonably structured. You have to understand what it’s supposed to do at a systems level to spot the problem.

What I Changed (And What Actually Helped)

I didn’t ban AI — that would be counterproductive. But I changed three things:

1. “Author must explain the approach in plain English first.”

Before anyone submits a PR for a non-trivial feature, they have to write a two-sentence explanation of the approach in the PR description. Not what the code does — why this approach, and what the main risk is.

This forces the engineer to actually understand what they’re shipping, not just that the tests pass. It catches about 30% of vibe-coded PRs at the source because the engineer can’t actually explain the approach.

2. AI-flagged sections require extra scrutiny.

We use Claude Code’s review checklist as a starting point. I added one question: “Was this section AI-generated without significant modification?” If yes, a second reviewer specifically focuses on that section. It adds roughly 12 minutes to review time for affected PRs. Worth it.

3. Mandatory behavioral tests for concurrency and error paths.

This was the direct lesson from the cache corruption. Any PR touching shared state or async code now needs a test that specifically probes the failure mode. It’s annoying and takes time. It’s caught four real bugs since we introduced it six weeks ago.

This approach pairs well with what I described in AI Testing Strategies That Actually Work — the key is using AI to write better tests, not just to generate code that happens to have tests.

What I Didn’t Change (And Why)

I didn’t add a blanket “all AI code needs two reviewers” policy because it would slow everything down for no marginal benefit on simple changes. I also didn’t make this a culture of shame. When I found the cache bug, I said in standup: “We found a bug, here’s what it taught us, here’s the process change.” Not “who did this.” That matters. If engineers feel judged for using AI tools, they’ll hide it, and you’ll lose the signal entirely.

The Honest Tradeoff

Vibe coding is real and it’s not going away. The productivity gains are genuine — I’m not going back to pre-AI velocity. But the assumption that “AI code is basically the same quality as human code, just faster” is wrong. It has specific failure modes that cluster around exactly the things that are hard to test: concurrency, error handling, system-level consistency.

Daniel Stenberg shut down cURL’s bug bounty because roughly 20% of submissions were AI-generated garbage. Mitchell Hashimoto banned AI code from Ghostty. These are experienced maintainers making rational decisions about their specific contexts. For a product team, the tradeoff calculus is different — but the failure modes they’re reacting to are real.

My team is faster with AI. My team also needs more deliberate process around what “done” means. Both things are true.

If you’re a tech lead trying to figure out where to draw the line: the answer isn’t a rule, it’s a detection system. Know what vibe-coded code looks like. Know where it fails. Build process around those failure modes specifically.

The two-sentence PR description is the cheapest, highest-return change I’ve made this quarter.

What “Vibe Coding” Actually Means on a Real Team#

The Specific Ways It Goes Wrong#

What I Changed (And What Actually Helped)#

What I Didn’t Change (And Why)#

The Honest Tradeoff#

What “Vibe Coding” Actually Means on a Real Team

The Specific Ways It Goes Wrong

What I Changed (And What Actually Helped)

What I Didn’t Change (And Why)

The Honest Tradeoff