Eight months ago, a senior engineer on my team spent a week with a 12,000-line file nobody fully understood. Three authors, five years, no tests, inconsistent naming, and a comment at the top that said “here be dragons.”
He tried using AI to help understand it. The AI was confidently, fluently, badly wrong about what half the code was doing.
That failure taught me more about AI and legacy code than any success since.
Table of Contents
- Why Legacy Code Is Hard for AI
- Where AI Actually Helps
- Prompts That Work
- Where AI Makes Things Worse
- The Refactoring Trap
- A Framework for Deciding
- Failure Cases I’ve Seen
Why Legacy Code Is Hard for AI
Legacy code has a gap between what it says and what it actually does. Comments lie. Variable names mislead. Functions handle seven cases based on implicit state set elsewhere. Hacks that were “temporary” in 2019 are now load-bearing.
Working with this code requires building a mental model accurate about actual behavior, not surface-level structure. That takes forensic work through git history, careful reading, and usually some runtime verification.
AI tools are trained to be helpful and plausible. When they don’t understand something, they often generate a confident explanation that’s superficially reasonable but subtly wrong. On legacy code, subtly wrong can send you down bad paths for days.
Where AI Actually Helps
Understanding unfamiliar patterns or languages. General pattern questions, not “what does this specific function do.” “What is this C callback convention doing?” — fine. “Explain this 500-line function” — risky.
Generating documentation for code you already understand. High-ROI use case. Once you’ve read and understood a function, AI generates the docstring faster than you write it. Key constraint: you already understand it. AI reduces typing, not comprehension.
Writing tests for behavior you’ve verified. You’ve run the code, you know what it does, you want coverage. AI generates test structure quickly. You verify assertions match actual behavior.
Translating between formats. Configs, data schemas, API shapes — mechanical transformations. “Here’s a JSON schema, generate TypeScript types” — fine. “Here’s a function that modifies this schema, explain the output” — verify carefully.
Prompts That Work
For understanding without assuming:
This forces the AI to be honest about uncertainty instead of generating a plausible-but-wrong explanation.
For documentation, tell it exactly what the function does and ask for a docstring — don’t let it infer. For tests, list the exact behaviors you’ve verified and tell it not to add more. The common thread: constrain what AI is allowed to infer. Narrower scope = more reliable output.
Where AI Makes Things Worse
Explaining complex logic without context. AI reads code, forms a model, and applies it confidently — even where the model is wrong. Fluent explanations are persuasive. The engineer following a false trail for two days doesn’t know it’s false until they’re deep in it.
Writing tests based on what code says, not what it does. AI writes tests that match surface behavior — which may not be intended behavior. A test that confirms a bug blocks the bug from being fixed.
Generating summaries of large modules. “This module handles X, Y, and Z” — when it actually only handles X and Y, with Z triggered as a side effect elsewhere. Partial correctness is more dangerous than being obviously wrong.
The Refactoring Trap
The pitch: “Let’s use AI to modernize this legacy module.” Feed it old code, ask it to rewrite in current patterns.
The output looks great. Clean. Modern conventions. Docstrings. Sometimes even tests.
The problem surfaces two weeks later in production. Something that worked in the old code — some edge case, some implicit assumption, some accidental correctness — doesn’t work in the AI rewrite. Because the rewrite looks so clean, it takes a while to believe the bug is in the new code.
Rule: don’t use AI to rewrite legacy logic you haven’t fully understood and tested.
Use AI to help you understand the logic (carefully). Use AI to document it after you understand it. Use AI to write tests for verified behavior. Do the actual rewrite yourself, with AI as a pair programmer where you’re driving.
This connects to AI Database Migration Cautionary Tale — AI-assisted migrations fail when the human stops being the one who understands what’s happening.
A Framework for Deciding
Before using AI on legacy code, ask three questions:
Do I already understand this, or am I using AI to avoid understanding it? If the latter, slow down.
Is there test coverage that would catch behavior changes? If no, don’t use AI to modify the code.
Is this mechanical or logical? Format conversions, documentation, boilerplate — mechanical, AI is useful. Business logic, state management, complex conditionals — logical, AI is risky.
All three green: use AI, save time. Any red: proceed carefully or not at all.
Failure Cases I’ve Seen
The confident wrong explanation. AI explained a function based on its name and structure. The name was misleading. Two days of debugging later.
The clean rewrite that broke production. Modern, readable code. Missing a 7-year-old workaround for a downstream service quirk that nobody knew was there until it wasn’t.
The test that tested the wrong thing. AI wrote tests for a legacy auth function. Tests passed. Function had a flaw the tests didn’t cover — they tested the obvious path, not the edge cases that mattered.
The right mental model: AI is a very fast reader who summarizes what they’ve read, not an expert who understands what the code is doing. A fast reader can tell you what the function says. They can’t tell you what it means in context, what assumptions it relies on, or what happens when those assumptions break.
For more on where AI fits in broader engineering workflow, see AI Development Workflow and AI Pair Programming Guide.