Four months ago I started logging code review outcomes. Not scientifically — but carefully enough to notice patterns. By 100 PRs I had something worth writing about.

The headline: AI code review is better at one specific thing than humans are, and humans are better at everything else. The nuance: that one specific thing matters more than I expected.

Table of Contents


How I Ran This

Every PR over four months went through two processes: standard human review (at least one engineer) and AI review via CodeRabbit plus manual Claude prompts for complex changes.

I tracked: comments that identified actual bugs, comments that were noise (wrong, irrelevant, pedantic), and architectural feedback. One person making judgment calls — imperfect, but better than vibes.


The Numbers

Across 100 PRs:

Bugs caught by AI only (humans missed): 23
Bugs caught by humans only (AI missed): 31
Bugs caught by both: 18
Total unique bugs found: 72

AI review comments (total): over 800
Useful AI comments: roughly 60%
Noise comments: ~40%

Human architectural/design concerns raised: 47
AI architectural/design concerns raised: 4


What AI Caught That Humans Missed

The 23 AI-only bugs clustered into categories where AI is consistently better than tired humans:

Off-by-one errors and boundary conditions. < vs <= in loop conditions. Humans miss this when reviewing quickly. AI doesn’t.

Null/undefined edge cases. AI asks “what happens if this is null?” on every relevant line. Human reviewers are inconsistent.

Security-adjacent patterns. User input flowing into queries. Tokens being logged. Not finding vulnerabilities exactly, but flagging the right patterns.

Missing error handling. “This can throw, what happens?” — AI catches this reliably, humans often assume the author thought about it.

These are checklist-style reviews that humans do inconsistently. The consistency is the value.


The Noise Problem: When ~40% of Comments Are Noise

3.4 noise comments per PR sounds manageable until you’re triaging 15 AI comments on a 500-line PR, trying to identify which 10 matter.

The noise categories:

  • Wrong codebase conventions. AI suggests adding JSDoc to internal utilities your team has agreed don’t need it.
  • Out-of-scope refactors. “This adjacent function could be simplified…” — maybe, but not what this PR is about.
  • False security positives. Parameterized queries it reads as string interpolation. Custom validation it doesn’t recognize.

The noise is fixable with configuration, but reviewer fatigue is real. If engineers learn to skim AI comments, the 23 bugs stop getting caught.


What Humans Caught That AI Missed

The 31 human-only bugs were different in character:

Logic errors requiring domain context. “This calculation is wrong because we handle X differently than you’re assuming” — you need to know the business logic. AI doesn’t.

Integration problems. “This will break when deployed with the current version of [other service]” — requires knowing the deployment environment. AI doesn’t.

“I remember when we tried this before” catches. Senior reviewers pattern-match on institutional memory. There’s no substitute for that.


The Architecture Gap

47 human architectural comments vs. 4 from AI — and the 4 AI architectural comments were all generic (“consider breaking this into smaller functions”). The 47 human comments were specific: “this is the wrong layer for this logic,” “this couples two modules we’ve been trying to decouple.”

Architecture review is the hardest thing a reviewer does and the most valuable. AI has no model of your system’s past, present, or intended future. It can flag that a function is long. It can’t tell you the function is in the wrong place.

This matches what I’ve seen in other contexts — see AI RFC Writing Experiment for a similar pattern where AI handles structure but not judgment.


Where AI Review Actually Fits

AI review should run before human review, not instead of it.

The 23 AI-only bugs — if caught before human review, the human can focus on higher-level concerns. They don’t have to ask “did you handle the null case?” because the AI already checked.

The noise problem is more manageable as a first-pass filter than as the whole review. A human reading AI comments can filter quickly; they can’t get back the time spent on a bug that both missed.

What I’d avoid: using AI review as a replacement for human review on anything complex. What I’d recommend: mandatory AI gate on every PR, human review required for anything touching core logic.

The data says AI and human review are complementary, not competitive. That’s not a satisfying headline, but it’s what 100 PRs showed me. For more on AI tooling in the review process, see Claude Code Review Checklist.