Last month I merged a PR that looked clean on first pass: tests green, linter happy, Claude review comments looked thoughtful. Two days later we discovered duplicate billing events caused by a retry path that wasn’t idempotent.

That was a useful reminder: AI code review is great at coverage, but still weak at risk weighting. It can tell you 20 things; only 2 might matter.

So I stopped treating Claude Code as a “smart approver” and started using it like a structured risk scanner.

This is the checklist I now run before approval.

How I Use This Checklist

I ask Claude Code to review the diff with this exact prompt:

Review this PR diff for:
1) correctness bugs
2) security/privacy risk
3) reliability issues (timeouts/retries/idempotency)
4) maintainability issues that will hurt in 3+ months

Return only concrete findings with file + line context and severity.
Skip style nits.

Then I run the 12 checks below manually, with Claude as assistant—not decision-maker.

The 12 Checks

1) Idempotency on money/state-changing paths

If retries happen, does the same request create duplicate side effects?

Look for:

  • webhook handlers
  • billing writes
  • status transitions
  • job workers without dedup keys

If this fails, it’s usually a release blocker.

2) Timeout propagation across call chains

Most outages aren’t from one failure—they’re from hanging dependencies.

Check that timeout values are:

  • explicit
  • passed through downstream clients
  • aligned with retry budgets

“Default timeout” is often “no timeout.”

3) Retry safety (not just retry existence)

A retry without jitter/backoff can create thundering herds.

Check:

  • exponential backoff + jitter
  • bounded retry count
  • no retry on non-retryable errors

And confirm retry won’t replay unsafe writes.

4) Auth boundary leaks

Claude catches obvious auth checks, but subtle context leaks still slip through.

Verify:

  • tenant/org scoping on every query path
  • no fallback to global scope
  • permission checks happen before side effects

5) Error handling that preserves signal

Many PRs “handle” errors by swallowing detail.

Check:

  • logs preserve actionable context (request id, key identifiers)
  • user-facing messages are safe but informative
  • no broad catch that hides root cause

6) Null/optional assumptions in changed paths

AI tools are decent here, but still miss cross-file assumptions.

Check:

  • nullable fields from external APIs
  • schema mismatch between model and runtime object
  • optional chaining that masks broken data contracts

7) Data race / ordering risk in async flows

If two workers process same entity, what happens?

Check:

  • lock or CAS strategy
  • transaction boundaries
  • event ordering assumptions

If ordering matters and isn’t enforced, note it explicitly before merge.

8) Test intent vs real risk

“More tests” ≠ “better protection.”

Check that tests cover:

  • failure branches
  • replay/retry scenarios
  • boundary conditions tied to business impact

A perfect happy-path suite can still hide production-grade bugs.

9) Config drift and default changes

Small config edits can create large operational changes.

Check:

  • changed defaults
  • environment-specific behavior
  • feature flags with safe rollback

Ask: if this flag is wrong in prod, can we recover quickly?

10) Observability readiness

If this breaks at 2am, can on-call diagnose in 10 minutes?

Check:

  • metrics added/updated
  • dashboards/alerts still valid
  • structured logging around new critical paths

No observability = blind deploy.

11) Migration/backfill safety

Any schema or data migration should be reversible or staged.

Check:

  • backward compatibility window
  • dual-write/read strategy where needed
  • explicit rollback plan in PR description

12) Scope creep disguised as “cleanup”

Claude sometimes suggests broad refactors during review.

Check:

  • does the PR still do one thing?
  • are unrelated refactors increasing merge risk?
  • can cleanup be split into follow-up PR?

Small, focused PRs ship safer and faster.

What I Track After Merge

The checklist is only useful if you measure outcomes.

I track three metrics monthly:

  • false-positive rate from AI review comments
  • escaped defect count in AI-reviewed PRs
  • median review time for medium/high-risk PRs

If false positives rise, I tighten prompt and reduce tool surface. If escaped defects rise, I add explicit checklist guards.

Common Failure Pattern (and Fix)

Failure pattern:

  • team enables AI review
  • comment volume increases
  • reviewers feel “more covered”
  • critical risk still passes because signal-to-noise worsens

Fix:

  • use one primary AI reviewer + one fallback
  • enforce severity labeling (P0/P1/P2)
  • require risk summary in PR comment before approval

Use this PR risk summary template to standardize that step:

PR Risk Summary
- Scope: [what changed]
- Highest risk: [P0/P1/P2 + one-line scenario]
- Evidence: [file/line + test/log]
- Rollback plan: [how to revert safely]
- Decision: [approve / request changes]

AI should reduce reviewer fatigue, not cosmetically increase review output.

Final Take

Claude Code is excellent at accelerating review prep. It is not a substitute for risk ownership.

If you adopt one thing from this post: standardize a checklist that prioritizes idempotency, timeout chains, and auth boundaries over style and micro-refactors.

That alone will prevent more incidents than adding another review bot.

If this checklist style is useful, you might also like I Ran 5 AI Review Agents on the Same PR. Only 2 Were Useful and How I Use AI to Review Code 3x Faster.