I Added an AI to Every PR for 30 Days. Here's What Actually Caught Bugs

I added CodeRabbit to our GitHub org on a Tuesday. By Friday, it had flagged 23 issues across 7 PRs, and my team thought I had secretly reviewed everything myself.

I hadn’t. The bot had. And roughly 14 of those 23 issues were legitimately worth fixing.

That 60% signal-to-noise ratio is better than I expected and worse than vendors claim. Over the next 30 days I ran CodeRabbit, Qodo, Ellipsis, and BugBot in parallel on the same PRs to get a real comparison. Here’s what I found.

Why I Did This

My context: I’m a tech lead managing a mixed team — a few engineers, a few data scientists. Everyone uses AI coding tools. We’re shipping more PRs per week than we were a year ago. I’m also the primary reviewer for most of them, and the review backlog was killing me.

I’ve written before about how I use AI to speed up code review. That post was about using AI assistants interactively. This is different — automated bots that post comments on every PR without me doing anything.

The appeal: catch things automatically before I even open the diff. The risk: noise that trains engineers to ignore bot comments.

The Contenders

CodeRabbit — the most mature of the bunch. Integrates with GitHub and GitLab, posts a PR summary plus line-level comments. Configureable via .coderabbit.yaml. Paid ($19/seat/month for pro, free tier available).

Qodo (formerly CodiumAI) — focuses on code integrity and test generation. Will suggest missing tests and flag logic bugs. More opinionated than CodeRabbit. Paid tier starts at $19/seat/month.

Ellipsis — different model: it doesn’t just comment, it can fix issues and push commits. This is either great or alarming depending on your trust level. $20/seat/month.

BugBot — Cursor’s PR review bot, focuses on catching bugs specifically rather than style or documentation. Designed for teams doing heavy AI-assisted coding. Pricing bundled with Cursor for Business.

All four were running on the same repos for the full 30 days.

What They Actually Caught

I tracked every comment across 94 PRs in the period. Here’s the breakdown by category:

Style/formatting issues — all four caught these. This is table stakes and honestly not that useful if you already have a linter. Lots of noise here.

Missing error handling — CodeRabbit and Qodo were the best here. CodeRabbit caught 7 cases where we were ignoring error returns from functions. 6 of those 7 were genuine issues. Qodo caught 5, with 4 genuine. Ellipsis and BugBot caught fewer but with better precision.

Security issues — CodeRabbit surprised me here. It flagged two cases of SQL query string interpolation (not parameterized queries) that our linter had missed. Both real. It also flagged three false positives where we were doing something unusual-but-correct. Qodo was similar. BugBot skipped most security pattern matching — it’s focused on runtime bugs, not vulnerability patterns.

Logic bugs — this is where BugBot earned its keep. It uses a more sophisticated analysis than the others and caught two bugs that I would genuinely not have caught in review: one was a subtle off-by-one in a pagination cursor, one was a race condition in an async function. Neither of the other bots flagged these.

Documentation gaps — Ellipsis and CodeRabbit both generate PR summaries and flag missing docstrings. Mildly useful. Ellipsis’s auto-generated PR descriptions are actually pretty good if the author didn’t write one.

The False Positive Problem

This is the real issue. After two weeks, two of my engineers had started automatically dismissing bot comments without reading them. The bots had cried wolf too many times.

CodeRabbit in particular has a verbosity problem. On a 200-line PR, it might post 12-15 comments. Even if 8 are valid, the sheer volume trains people to ignore them. I ended up configuring .coderabbit.yaml to suppress style comments and low-severity suggestions, which helped a lot. With tuning, it went from 12-15 comments per PR to 4-6, and the signal rate went up to roughly 75%.

Qodo was more precise but missed more. Ellipsis was somewhere in the middle. BugBot was the lowest volume — sometimes zero comments on a PR — but its hit rate was the highest.

If I had to rank by signal quality: BugBot > Qodo > CodeRabbit (tuned) > Ellipsis > CodeRabbit (defaults).

If I had to rank by overall value including ease of setup and breadth: CodeRabbit > Qodo > BugBot > Ellipsis.

What Doesn’t Work

Auto-fix is not ready. I turned off Ellipsis’s auto-commit feature after three days. It pushed a “fix” that introduced a new bug, and it took me 20 minutes to untangle what happened. Maybe fine for trivial changes; not for anything non-trivial.

Bot comments in code review threads create confusion. Engineers don’t always know what’s from a human versus a bot. One of my engineers spent 10 minutes writing a detailed reply to a CodeRabbit comment before realizing it was a bot. I now require all bots to have a distinct prefix in their comments.

These bots have no system context. They don’t know our architectural decisions, our naming conventions beyond what’s in the code, or why we made certain tradeoffs. Some of the most annoying false positives are technically correct suggestions that violate our actual design principles. No amount of configuration fully solves this.

Pricing adds up. Four bots × $20/seat × 7 people = ~$560/month. That’s not nothing. I’m keeping CodeRabbit and BugBot; canceling the others.

My Actual Recommendation

If you’re a tech lead drowning in PR review load: start with CodeRabbit, spend an hour configuring it to suppress the noise, and give it three weeks. It will catch real bugs. Not all of them, not even most of them — but enough to pay for itself in caught bugs alone.

Add BugBot if your team does heavy AI-assisted coding (the vibe coding debt problem is real and BugBot is the best at catching those specific failure modes).

Don’t bother with Ellipsis unless you have a very simple codebase and high confidence in auto-fixes. Don’t run all four simultaneously — the noise from multiple bots commenting on the same PR is genuinely worse than one well-configured bot.

The deeper issue these tools expose: if you need a bot to catch bugs in your PRs, you either have a review culture problem or a complexity problem. These bots are good bandaids. They’re not a substitute for engineers who actually understand the system they’re changing.

Why I Did This#

The Contenders#

What They Actually Caught#

The False Positive Problem#

What Doesn’t Work#

My Actual Recommendation#