AI-Assisted Incident Response: A Practical Playbook

At 2:47 AM on a Tuesday, our event processing pipeline stopped processing. Not slowly — completely. The on-call engineer was staring at a wall of logs trying to figure out what happened, while the events queue grew at 50,000 messages per minute.

This incident, and the postmortem that followed, is what pushed me to seriously integrate AI into our incident response process. Not as a replacement for experienced engineers making judgment calls under pressure, but as a tool to reduce the time between “something is broken” and “I understand what’s broken.”

Here’s the playbook we’ve developed over the last few months.

Phase 1: Detection and Triage (First 5 Minutes)

When an alert fires, the first challenge is understanding what’s actually happening. Our monitoring generates a lot of data — metrics, logs, traces — and under the stress of an active incident, it’s easy to fixate on symptoms rather than causes.

We built a simple triage script that collects relevant data and feeds it to an LLM:

#!/bin/bash
# incident-triage.sh — Run this when an alert fires

SERVICE=$1
TIMEFRAME=${2:-"15m"}

echo "=== Collecting incident data for $SERVICE (last $TIMEFRAME) ==="

# Collect metrics
echo "## Metrics" > /tmp/incident-data.txt
curl -s "http://prometheus:9090/api/v1/query_range?query=rate(http_requests_total{service=\"$SERVICE\"}[5m])&start=$(date -d "-$TIMEFRAME" +%s)&end=$(date +%s)&step=60" \
  | jq '.data.result[] | {metric: .metric, values: .values[-5:]}' >> /tmp/incident-data.txt

# Collect error logs
echo -e "\n## Recent Error Logs" >> /tmp/incident-data.txt
kubectl logs -l app=$SERVICE --since=$TIMEFRAME | grep -i "error\|exception\|fatal" | tail -50 >> /tmp/incident-data.txt

# Collect recent deployments
echo -e "\n## Recent Deployments" >> /tmp/incident-data.txt
kubectl rollout history deployment/$SERVICE | tail -5 >> /tmp/incident-data.txt

# Feed to LLM for initial triage
cat /tmp/incident-data.txt | claude "You are an SRE triaging a production incident.
Analyze this data and provide:
1. What appears to be failing (1-2 sentences)
2. Most likely root cause category: deployment, infrastructure, dependency, traffic, data
3. Severity estimate: SEV1 (total outage), SEV2 (partial), SEV3 (degraded)
4. Top 3 things to check next, in priority order
5. Any patterns in the error logs that suggest a specific root cause

Be concise. This is an active incident."

In the 2:47 AM incident I mentioned, this script identified within about 90 seconds that the errors were all ConnectionRefusedError targeting our Redis cluster, that Redis had been unreachable for 3 minutes, and that no deployments had happened in the last 12 hours (ruling out a code change). It suggested checking Redis cluster health as the first priority.

The on-call engineer would have gotten there eventually, but “eventually” during a production outage is expensive. Shaving 5-10 minutes off triage time when you’re losing data or serving errors is meaningful.

What Doesn’t Work in Triage

The LLM is good at pattern-matching on logs and suggesting categories. It’s bad at understanding your specific system’s failure modes. It won’t know that your Redis cluster has a known issue with memory fragmentation under certain write patterns, or that the last time you saw ConnectionRefusedError it was actually a network partition, not a Redis crash.

We keep a known-issues.md file that we include in the triage prompt:

# Known Issues and Failure Patterns

## Redis
- Memory fragmentation under high write load → OOM kill → restart cycle
  - Symptom: ConnectionRefusedError followed by intermittent reconnects
  - Fix: `redis-cli memory purge` or rolling restart with memory limit increase

## Kafka  
- Consumer group rebalancing during deployments → temporary lag spike
  - Symptom: Consumer lag increases for 2-3 minutes then recovers
  - Fix: Wait. If lag doesn't recover after 5 min, check consumer health

## Database
- Connection pool exhaustion under load → timeout errors
  - Symptom: "connection pool exhausted" in logs, increasing p99 latency
  - Fix: Check for long-running queries, increase pool size if needed

This file is gold. It gives the LLM our institutional knowledge about failure modes, which dramatically improves the quality of its triage suggestions.

Phase 2: Investigation (5-30 Minutes)

Once you know roughly what area to investigate, AI can help you dig through logs and metrics faster. This is where I’ve seen the biggest time savings.

Log Analysis

Production logs during an incident are overwhelming. Thousands of lines per minute, most of them noise. We use the LLM to filter and summarize:

# Grab the last 1000 error lines and analyze
kubectl logs -l app=event-processor --since=30m | grep -i "error" | tail -1000 | \
  claude "Analyze these error logs from a production incident:
  1. Group the errors by type/pattern
  2. Show the timeline (when did each error type start/stop?)
  3. Are there any error chains (error A causing error B)?
  4. What's the most upstream error (likely root cause)?
  
  Format as a timeline."

The timeline format is particularly useful. In our 2 AM incident, the LLM identified that ConnectionRefusedError started at 2:44:12, followed by QueueFullError at 2:44:47 (because the pipeline couldn’t write to Redis and backed up), followed by MemoryError at 2:46:33 (because the backed-up queue exhausted memory). That causal chain — Redis down → queue backup → OOM — was clear in the LLM’s analysis but hard to see scrolling through raw logs at 3 AM.

Configuration Diff Analysis

“What changed?” is the most important question during most incidents. For configuration-driven issues:

# Compare current config with last known good
diff <(kubectl get configmap event-processor-config -o yaml) \
     <(git show HEAD~1:k8s/configmap.yaml) | \
  claude "This is a diff between current production config and the 
  previous version. Identify any changes that could cause:
  - Connection failures
  - Performance degradation  
  - Data processing errors
  
  For each suspicious change, explain the potential impact."

This caught a real issue for us once: someone had changed a connection timeout from 30 seconds to 3 seconds “to fail faster,” not realizing that our upstream service had p99 latency of 2.8 seconds. The LLM flagged it immediately. The human who reviewed the config change had missed it because 3 seconds seems like a reasonable timeout if you don’t know the upstream latency characteristics.

Phase 3: Resolution

For the actual fix, AI is less useful. Incident resolution often requires judgment calls that depend on context the LLM doesn’t have:

“Should we roll back or forward-fix?”
“Can we take this service down for 5 minutes to repair data?”
“Should we page the database team at 3 AM?”

These are human decisions. The AI can present options, but the decision requires understanding risk tolerance, organizational context, and the specific impact of the incident.

Where AI does help during resolution: generating the actual fix. Once you know what needs to change, the LLM can write the patch, the migration script, or the configuration update. During an active incident, speed matters, and having the LLM draft a fix while you think through the implications saves time.

The “make it safe” instruction is important. The LLM will add confirmation prompts, dry-run capabilities, and logging that you might skip if you were writing the script yourself under pressure.

Phase 4: Postmortem

This is where AI adds the most value with the least risk. Postmortems are important but tedious, and they often get deprioritized — which means the lessons from incidents get lost.

The AI-generated postmortem isn’t the final version — the team reviews and adjusts it. But having a solid first draft means the postmortem actually gets done within 48 hours of the incident instead of languishing for weeks. We went from completing about 40% of postmortems to completing close to 90% after we started using AI for the first draft.

The Incident Response Runbook

Here’s the condensed version of our current process:

# Incident Response with AI Assist

## When alert fires:
1. Run `incident-triage.sh <service>` → get initial assessment
2. Confirm severity, open incident channel
3. Start investigation based on triage suggestions

## During investigation:
4. Use LLM for log analysis: pipe logs through analysis prompt
5. Check for config/deployment changes with diff analysis
6. Consult known-issues.md (AI has this in context)

## During resolution:
7. Human decides on approach (rollback/fix/mitigate)
8. AI can draft fix scripts — always use dry-run first
9. Human reviews and executes

## After resolution:
10. AI generates postmortem draft from timeline + Slack
11. Team reviews and finalizes within 48 hours
12. Update known-issues.md with new failure pattern

What I Wish I’d Known Earlier

The LLM is most useful at the start and end of an incident. During triage, it reduces the time to understand what’s happening. During postmortem, it reduces the friction of documentation. During the actual investigation and resolution — the middle, high-pressure part — it’s a modest helper at best.

Include your known-issues file. Generic LLM responses about “check your connection settings” are useless during an incident. Your institutional knowledge about failure modes is what makes the AI’s suggestions actionable.

Never let AI execute production commands automatically. I’ve seen proposals for “AI-driven auto-remediation” and I think it’s premature for most teams. The blast radius of a wrong automated fix during an incident is terrifying. AI suggests, humans execute.

Speed of AI response matters less than you think. During an incident, the bottleneck is almost never “waiting for the LLM to respond.” It’s decision-making, coordination, and understanding complex system interactions. The AI saves time on information processing, which is valuable, but it doesn’t eliminate the hard parts.

Practice in non-incident conditions. The worst time to learn your AI-assisted workflow is during a real incident at 3 AM. We run incident simulations quarterly and include the AI tools. This helps the team build muscle memory and understand the tools’ limitations before they’re under pressure.

📦 Free: AI Code Review Prompt Pack — 10 prompts I use on 15+ PRs/week.

Newsletter: One practical AI workflow per week, plus templates I don’t publish here. Subscribe →

Phase 1: Detection and Triage (First 5 Minutes)#

What Doesn’t Work in Triage#

Phase 2: Investigation (5-30 Minutes)#

Log Analysis#

Configuration Diff Analysis#

Phase 3: Resolution#

Phase 4: Postmortem#

The Incident Response Runbook#

What I Wish I’d Known Earlier#

You might also like#