Using AI to Refactor Legacy Code Without Breaking Everything

Every engineering team has That Module. The one nobody wants to touch. The one where the original author left two years ago, the comments lie, and there’s a function called processData2_final_v3 that somehow handles both payment processing and email notifications.

We had one of those — a 3,400-line Python file that processed incoming data streams. It had been written under deadline pressure by someone who clearly knew what they were doing at the time, but the context had been lost. Tests were sparse. Documentation was a README that said “see Confluence” and the Confluence page was deleted.

I spent about three weeks using AI to help refactor this module. Here’s what worked, what went wrong, and what I’d do differently.

Step 0: Understand Before You Touch

The biggest mistake with legacy refactoring — with or without AI — is jumping straight to “let me clean this up.” You have to understand what the code does first, and legacy code resists understanding by design.

I started by feeding the file to an LLM in chunks:

This was genuinely valuable. The LLM identified several things I’d missed on my initial read-through:

A global dictionary that was being used as a cache but never cleared (memory leak)
Two different error handling strategies used in the same file (one returned None, the other raised exceptions)
A subtle dependency on insertion order in a dict that only worked because Python 3.7+ guarantees it
Three functions that were defined but never called (dead code)

It also got some things wrong. It flagged a thread-local variable as “potentially shared state” when it was actually safe. And it misidentified a performance optimization as a bug — the code was intentionally batching writes in a non-obvious way.

The lesson: AI is good at pattern-matching and surface-level analysis, but it can’t tell you why code was written a certain way. The “why” requires domain context that isn’t in the code.

Step 1: Generate Characterization Tests

Before changing anything, I needed tests that captured the current behavior — right or wrong. Martin Fowler calls these “characterization tests.” The goal isn’t to test correctness; it’s to detect if your refactoring changes the behavior.

The LLM generated about 25 tests per major function. I ran them against the actual code, and about 18-20 would pass immediately. The ones that failed usually fell into two categories:

The LLM misunderstood the code — it assumed a function would return a list when it actually returned a generator, or it got the error handling wrong.
The LLM found actual bugs — the code’s behavior on certain inputs was clearly unintentional (returning None instead of raising, for example).

For category 2, I made a note of the bugs but wrote the tests to match the current behavior. The point of characterization tests is to preserve existing behavior during refactoring. You fix bugs separately, after the refactoring is stable.

After about a day of generating and fixing tests, I had 73 characterization tests covering the main code paths. Not perfect, but enough to refactor with some confidence.

Step 2: Extract and Name

The file had 47 functions. Many were doing multiple things. The refactoring started with extraction — pulling coherent blocks of code into well-named functions.

This is where AI was most helpful. I’d paste a long function and ask:

The LLM’s suggestions were usually about 70% right. It was good at identifying logical blocks but sometimes drew the boundaries wrong — putting two related operations in separate functions when they really should have stayed together, or grouping unrelated operations because they happened to share a variable.

Here’s a concrete example. The original function:

def process_batch(events, config):
    # 120 lines of:
    # - input validation
    # - deduplication
    # - enrichment (adding metadata from external source)  
    # - transformation (converting formats)
    # - batching for output
    # - error handling throughout
    ...

The LLM suggested splitting into 6 functions. I ended up with 4:

def validate_events(events: list[dict], config: Config) -> list[dict]:
    """Remove malformed events and log warnings."""
    ...

def deduplicate_events(events: list[dict]) -> list[dict]:
    """Remove duplicate events based on event_id, keeping earliest."""
    ...

def enrich_and_transform(events: list[dict], config: Config) -> list[dict]:
    """Add metadata and convert to output format."""
    ...

def process_batch(events: list[dict], config: Config) -> BatchResult:
    """Main entry point: validate, dedup, transform, and batch events."""
    validated = validate_events(events, config)
    deduped = deduplicate_events(validated)
    transformed = enrich_and_transform(deduped, config)
    return batch_for_output(transformed, config.batch_size)

I kept enrich_and_transform as one function instead of the LLM’s suggested split because enrichment and transformation were tightly coupled — the transformation logic depended on the enrichment data, and splitting them would have required passing around intermediate state.

Step 3: Incremental Migration

This is the part where discipline matters more than tools. I refactored one function at a time, running the characterization tests after each change. The cycle was:

Extract a function
Run tests — all green? Continue
If tests break, check if the behavior change is intentional
Commit

I used the LLM to help write the extracted functions, but I never let it rewrite large sections at once. Every change was small, tested, and committed separately. My git log for this refactoring has 34 commits over three weeks.

The one time I broke this rule — letting the LLM rewrite a 200-line function in one shot — I spent an entire afternoon debugging a subtle change in behavior where the refactored code handled empty lists differently than the original. The tests caught it (thank god for characterization tests), but fixing it was painful because the rewrite changed too many things at once to pinpoint the issue.

When AI Rewrites Go Wrong

AI-generated rewrites of legacy code have a specific failure mode: they make the code look cleaner while subtly changing its behavior. Some examples I ran into:

Changing exception types — the original raised ValueError, the rewrite raised TypeError. Upstream code caught ValueError specifically.
Reordering operations — the original validated before deduplicating, the rewrite did the opposite. For our data, the order mattered because validation removed malformed records that would crash the dedup logic.
“Improving” None handling — the original returned None in certain error cases, the rewrite returned an empty list. Three callers checked if result is None specifically.

These are exactly the kind of bugs that are hard to catch in code review because the rewritten code looks better. Only the tests saved me.

Step 4: Documentation Generation

Once the refactoring was done and all tests passed, I used the LLM to generate documentation:

This was the most straightforward AI win. The generated documentation needed light editing — maybe 20% of it needed tweaking for accuracy — but it was infinitely better than writing it from scratch. The “architecture doc” was particularly useful; the LLM drew a data flow diagram (in text/ASCII) that became the README for the module.

Results

The refactored module went from 3,400 lines in one file to about 2,800 lines across 6 files (plus 1,200 lines of tests). The reduction in lines was modest — refactoring often adds code because you’re making things more explicit — but the improvement in readability was dramatic.

More importantly:

The memory leak was fixed (the global cache now has TTL expiry)
Test coverage went from ~15% to ~82%
Two engineers have since modified the module without asking me for help — the old version required a guided tour
A bug that had caused intermittent data inconsistencies was found during characterization testing

The whole thing took about 15 working days. Without AI assistance, I estimate it would have taken 22-25 days — most of the time savings came from characterization test generation and documentation. The actual refactoring was about the same speed because you can’t safely rush that part.

What I’d Do Differently

Start with characterization tests, not comprehension. I spent two days understanding the code before writing tests. Next time I’d write the tests first — they force you to understand the code anyway, and you end up with something useful.
Use smaller context windows. I initially tried to feed the entire 3,400-line file at once. Breaking it into focused chunks gave much better results from the LLM.
Keep a “behavior log.” I should have maintained a document listing every intentional behavior I discovered, especially the non-obvious ones. This would have saved time when the LLM’s rewrites changed behavior and I had to figure out whether the original or the rewrite was correct.
Don’t refactor and fix bugs simultaneously. I fixed two bugs during the refactoring and it made the test results ambiguous. Refactor first, fix bugs in a separate PR.

📦 Free: AI Code Review Prompt Pack — 10 prompts I use on 15+ PRs/week.

Newsletter: One practical AI workflow per week, plus templates I don’t publish here. Subscribe →

Step 0: Understand Before You Touch#

Step 1: Generate Characterization Tests#

Step 2: Extract and Name#

Step 3: Incremental Migration#

When AI Rewrites Go Wrong#

Step 4: Documentation Generation#

Results#

What I’d Do Differently#

You might also like#