Sprint planning is one of those ceremonies that every team does and almost nobody loves. You sit in a room (or a Zoom call) for an hour or two, argue about story points, discover halfway through that two tickets depend on the same external team, and leave with a plan that’s already outdated.

About six months ago, I started experimenting with using LLMs to improve different parts of our sprint planning process. Some of these experiments worked surprisingly well. Others were somewhere between unhelpful and actively counterproductive. Here’s the honest breakdown.

What We Tried

1. AI-Assisted Ticket Decomposition

This was the first experiment and the biggest win. One of the most common problems in sprint planning is tickets that are too large or too vague. “Implement the new data pipeline” is not a sprint-sized ticket — it’s a project. But engineers often write tickets like this because decomposition is tedious and feels like overhead.

I started using an LLM to decompose large tickets into sprint-sized work:

HTDbesTDf------eieavteeortsterccrTDADERelccnehoiecesi'erhtamotscptss:iesmspnlceeikpsitoeerpnmsaMt.wnasitdaiihgceepaetotgoWikntnnerirnels:tgiccdca:eyhioeiuktnsPinnecneeOektysecsoktueeete(rmnerdemhir2iopodvpon)-tnlweectisnt.3eensnuonh,oroxsctrgoFsitirrmuKsoeahtipeoblaprneyprnvadfrt(r:totetkieeticchananetSoestha,tccsi/nsyon-hetcMsspdPssak/fitrrloit)beLoneeoesziltrgmactecesle~gdk)ofp-s5reurrts0ettrooiiKSi:mcmnQcueegeLkpbsv,ecassfetotetonRsmcsrrteihesd(neaa/iegtvmnmsaoeiaicwnnlnhostgyurtstt1krfie-:eioc3anrsamtdi1lean5avpyg-teesmenaintkoncs.fuy.t-wesToehrneksitive

The output was remarkably good. The LLM identified work items I would have come up with — setting up Kafka topics, building the consumer, implementing the routing logic — but also ones I might have forgotten in a quick planning session, like schema migration scripts, monitoring dashboards, and rollback procedures.

The key insight: the LLM’s decomposition wasn’t perfect, but it was a much better starting point than a blank whiteboard. Instead of spending 30 minutes decomposing a ticket as a group, we’d spend 10 minutes reviewing and adjusting the AI’s decomposition. That’s a meaningful time savings when you’re planning 15-20 tickets per sprint.

Where it fell short: The LLM consistently underestimated work that involved coordination — anything that required talking to another team, getting access to a system, or waiting for a dependency. It would estimate “configure Kafka ACLs” as a small task when in reality it takes 3 days because you need to file a ticket with the platform team and wait. You always need to add your own organizational knowledge.

2. Historical Velocity Analysis

I fed our last 6 sprints’ worth of completed tickets and their actual effort into an LLM and asked it to identify patterns:

Hp1234[eo....priaenWWAWsthhrhtaseeeaerrrteaeet'snhsptddderhooroieaeuncwwrttteepiuaadcactcaklovttteneeuatcsrra]soi-nlmsesfptsvrletieoeninlmttmoilawcooythiuneatru?tytnlidstamelrseriet.-pnesd6As?ntvsaisplmryawizthneeat:?ts,fiwniitshhepslaenanreldy?story

Some interesting findings:

  • We consistently underestimated tickets tagged “infrastructure” by about 40%
  • Tickets involving database migrations always took longer than planned, with a pattern: the migration itself was fast, but testing it against production-size data was what blew the estimate
  • Our velocity had been declining slightly over the last 3 sprints — about 6-8% — which correlated with onboarding two new team members (makes sense, mentoring costs velocity)
  • Tickets estimated at 8+ story points had a 70% chance of not being completed in the sprint

Honestly, a good engineering manager should already know most of this from experience. But having it quantified and presented clearly was useful for the team discussion. When I shared these patterns, the team immediately started adjusting their estimates for infrastructure work.

The catch: The LLM was working with limited data (6 sprints ≈ 100 tickets). Its “patterns” sometimes reflected noise rather than signal. It flagged “frontend tickets are overestimated” based on a sample of 7 tickets, which isn’t statistically meaningful. You have to apply judgment to the analysis, not just accept it.

3. Dependency Detection

This is the experiment I was most excited about, and it turned out to be mediocre. I fed the LLM our upcoming ticket list and asked it to identify dependencies:

H1234[e....praeHSPSsaooutarftgerdtegenetddtsi2eeitc2ppaekeeldetnntiddcoceeorlknnndieccfestiilrtseeiisscnwftgio((strXXh(omaXduunersdastncnYdrebixeaYptrtdemisooopnrdnreeisilf]nbaytet.fetodhIr,edeessnYhatomicuefalyndc:osbdteea/rcstoy)osrtdeimnsa)ted)

The LLM found obvious dependencies — things like “the API endpoint ticket depends on the database schema ticket.” But it missed the non-obvious ones that actually cause problems in practice. It couldn’t know that tickets A and C would both need to modify the same Terraform module, or that ticket B required a library upgrade that would affect ticket D’s test suite.

The dependencies that cause sprint problems are almost always the implicit ones — shared resources, unspoken assumptions, organizational bottlenecks. These aren’t in the ticket descriptions because nobody thinks to write them down. An LLM can only work with what you give it, and the important dependencies are usually the ones nobody thought to mention.

Verdict: Marginally useful as a first pass, but nowhere near replacing the “does anyone see any conflicts?” conversation in planning.

4. Sprint Retrospective Summarization

At the end of each sprint, I started feeding the retro notes into an LLM:

HI1234[ed....preaenRATSstecrutaictegerfuingeyroder:rnssetitthniieregtndoentmtfnohseooteactemtmuesehsssas]ftear(nrotwtemheiairmsoneeugnfrsptorrltoahptsaohttsee4kdneesebxpputrticrnonetmetivrrneoegrtrucopos)mppelcettievdes.

This was surprisingly useful — not because the insights were profound, but because it kept us honest about follow-through. The LLM flagged that “improve CI pipeline speed” had appeared as an action item in 3 consecutive retros without being completed. That’s the kind of thing you know intuitively but don’t confront directly. Seeing it spelled out made us actually prioritize it.

It also identified a sentiment trend I hadn’t consciously noticed: the team’s comments about code review turnaround had been getting progressively more negative over 4 sprints. That prompted me to look into it, and I found that our review queue had grown from an average of 4 hours to almost 2 days — a problem that was brewing but hadn’t hit crisis level yet.

The Process We Landed On

After six months of experimentation, here’s what our sprint planning now looks like:

Before planning (async, ~30 minutes for me):

  1. Feed upcoming epics/tickets to the LLM for decomposition
  2. Run the velocity analysis on last sprint’s data
  3. Generate a draft sprint backlog with AI-suggested priorities

During planning (1 hour, down from 1.5-2 hours):

  1. Review AI-decomposed tickets (10 min) — adjust, merge, split as needed
  2. Group estimation with historical data visible (20 min) — “last time we estimated infra work at M, it took L”
  3. Dependency discussion (15 min) — AI-identified deps as starting point
  4. Commitment and assignment (15 min)

After sprint (10 minutes for me):

  1. Feed completion data to the LLM for trend analysis
  2. Prep retro summary

The time savings are real but modest. We probably save 30-45 minutes per sprint cycle, and more importantly, the quality of our planning has improved. Our sprint completion rate went from about 72% of planned points to about 83% over the six months — partly because of better decomposition, partly because the historical data made us more realistic about estimates.

What Didn’t Work At All

Story Point Estimation by AI

I tried having the LLM estimate story points for tickets. It was bad. Not bad in a “needs calibration” way — bad in a “fundamentally doesn’t understand what story points measure” way.

Story points aren’t a measure of effort or complexity in the abstract. They’re relative to your team’s capabilities, experience, and codebase knowledge. A ticket that’s a 2 for a senior engineer who wrote the module is an 8 for someone who’s never seen it. The LLM has no way to factor in team-specific context, and its estimates were essentially random when compared to our actual team estimates.

Don’t do this. Estimation is a team activity because the conversation is the point, not the number.

Automated Stand-up Summaries

I briefly tried having people post their stand-up updates in Slack and then using an LLM to summarize and flag blockers. The summaries were fine, but nobody read them. The value of standup is the synchronous communication, not the information transfer. When we replaced the meeting with async + AI summary, blockers went unresolved for days because nobody felt urgency from a bot-generated summary.

We went back to real standups within two sprints.

Priority Scoring

I asked the LLM to score ticket priority based on factors like user impact, technical debt, and strategic alignment. The scores were plausible but useless — they didn’t account for organizational politics, quarterly goals, or the fact that the VP of Product really cares about feature X regardless of its “objective” priority score.

Prioritization is a human decision. It involves tradeoffs that can’t be reduced to a formula.

Honest Assessment

AI has made our sprint planning incrementally better. The biggest wins are in areas that are tedious but not difficult — decomposition, pattern recognition in historical data, and keeping retro action items honest. These are tasks that benefit from thoroughness and consistency, which LLMs are good at.

The experiments that failed were the ones where I tried to use AI for tasks that require judgment — estimation, prioritization, and dependency detection that depends on organizational knowledge. Those are irreducibly human activities, and I’ve stopped trying to automate them.

If you’re thinking about trying this, start with ticket decomposition. It’s the highest-value, lowest-risk application, and it requires almost no setup — just paste the ticket into your LLM of choice. If that saves you time, try the historical analysis. Skip the estimation and prioritization experiments; I’ve already learned that lesson for you.


You might also like


📦 Free: AI Code Review Prompt Pack — 10 prompts I use on 15+ PRs/week.

Newsletter: One practical AI workflow per week, plus templates I don’t publish here. Subscribe →