Last month I merged a PR that looked clean on first pass: tests green, linter happy, Claude review comments looked thoughtful. Two days later we discovered duplicate billing events caused by a retry path that wasn’t idempotent.
That was a useful reminder: AI code review is great at coverage, but still weak at risk weighting. It can tell you 20 things; only 2 might matter.
So I stopped treating Claude Code as a “smart approver” and started using it like a structured risk scanner.
This is the checklist I now run before approval.
How I Use This Checklist
I ask Claude Code to review the diff with this exact prompt:
Review this PR diff for:
1) correctness bugs
2) security/privacy risk
3) reliability issues (timeouts/retries/idempotency)
4) maintainability issues that will hurt in 3+ months
Return only concrete findings with file + line context and severity.
Skip style nits.
Then I run the 12 checks below manually, with Claude as assistant—not decision-maker.
The 12 Checks
1) Idempotency on money/state-changing paths
If retries happen, does the same request create duplicate side effects?
Look for:
- webhook handlers
- billing writes
- status transitions
- job workers without dedup keys
If this fails, it’s usually a release blocker.
2) Timeout propagation across call chains
Most outages aren’t from one failure—they’re from hanging dependencies.
Check that timeout values are:
- explicit
- passed through downstream clients
- aligned with retry budgets
“Default timeout” is often “no timeout.”
3) Retry safety (not just retry existence)
A retry without jitter/backoff can create thundering herds.
Check:
- exponential backoff + jitter
- bounded retry count
- no retry on non-retryable errors
And confirm retry won’t replay unsafe writes.
4) Auth boundary leaks
Claude catches obvious auth checks, but subtle context leaks still slip through.
Verify:
- tenant/org scoping on every query path
- no fallback to global scope
- permission checks happen before side effects
5) Error handling that preserves signal
Many PRs “handle” errors by swallowing detail.
Check:
- logs preserve actionable context (request id, key identifiers)
- user-facing messages are safe but informative
- no broad
catchthat hides root cause
6) Null/optional assumptions in changed paths
AI tools are decent here, but still miss cross-file assumptions.
Check:
- nullable fields from external APIs
- schema mismatch between model and runtime object
- optional chaining that masks broken data contracts
7) Data race / ordering risk in async flows
If two workers process same entity, what happens?
Check:
- lock or CAS strategy
- transaction boundaries
- event ordering assumptions
If ordering matters and isn’t enforced, note it explicitly before merge.
8) Test intent vs real risk
“More tests” ≠ “better protection.”
Check that tests cover:
- failure branches
- replay/retry scenarios
- boundary conditions tied to business impact
A perfect happy-path suite can still hide production-grade bugs.
9) Config drift and default changes
Small config edits can create large operational changes.
Check:
- changed defaults
- environment-specific behavior
- feature flags with safe rollback
Ask: if this flag is wrong in prod, can we recover quickly?
10) Observability readiness
If this breaks at 2am, can on-call diagnose in 10 minutes?
Check:
- metrics added/updated
- dashboards/alerts still valid
- structured logging around new critical paths
No observability = blind deploy.
11) Migration/backfill safety
Any schema or data migration should be reversible or staged.
Check:
- backward compatibility window
- dual-write/read strategy where needed
- explicit rollback plan in PR description
12) Scope creep disguised as “cleanup”
Claude sometimes suggests broad refactors during review.
Check:
- does the PR still do one thing?
- are unrelated refactors increasing merge risk?
- can cleanup be split into follow-up PR?
Small, focused PRs ship safer and faster.
What I Track After Merge
The checklist is only useful if you measure outcomes.
I track three metrics monthly:
- false-positive rate from AI review comments
- escaped defect count in AI-reviewed PRs
- median review time for medium/high-risk PRs
If false positives rise, I tighten prompt and reduce tool surface. If escaped defects rise, I add explicit checklist guards.
Common Failure Pattern (and Fix)
Failure pattern:
- team enables AI review
- comment volume increases
- reviewers feel “more covered”
- critical risk still passes because signal-to-noise worsens
Fix:
- use one primary AI reviewer + one fallback
- enforce severity labeling (
P0/P1/P2) - require risk summary in PR comment before approval
Use this PR risk summary template to standardize that step:
PR Risk Summary
- Scope: [what changed]
- Highest risk: [P0/P1/P2 + one-line scenario]
- Evidence: [file/line + test/log]
- Rollback plan: [how to revert safely]
- Decision: [approve / request changes]
AI should reduce reviewer fatigue, not cosmetically increase review output.
Final Take
Claude Code is excellent at accelerating review prep. It is not a substitute for risk ownership.
If you adopt one thing from this post: standardize a checklist that prioritizes idempotency, timeout chains, and auth boundaries over style and micro-refactors.
That alone will prevent more incidents than adding another review bot.
If this checklist style is useful, you might also like I Ran 5 AI Review Agents on the Same PR. Only 2 Were Useful and How I Use AI to Review Code 3x Faster.