I used to think of testing as the part of development you did because you had to. Write the feature, write the tests, move on. The tests were mostly there to catch regressions and keep CI green — a safety net, not a thinking tool.

That started changing about six months ago when I began experimenting with LLMs as part of my testing workflow. Not replacing tests with AI, to be clear — that would be insane. But using AI to think differently about what to test and how to test it.

Here’s what actually worked, what flopped, and where I’ve landed.

The Problem: We Were Testing the Wrong Things

Our team had solid test coverage — hovering around 78% line coverage, which looks respectable on paper. But we kept having production incidents caused by edge cases that none of our tests covered. The classic scenario: someone writes a feature, writes tests for the happy path and one or two obvious failure modes, and ships it. Three weeks later, a customer hits a weird combination of inputs nobody thought of, and things blow up.

The issue wasn’t that we weren’t testing. It was that we were testing predictably. Every engineer on the team (myself included) had the same blind spots — we’d test for the cases we could imagine, which tended to be the same cases every time.

Experiment 1: AI-Generated Edge Cases

The first thing I tried was asking an LLM to generate edge cases for functions I’d already written tests for. Here’s the kind of prompt I used:

H[H[W------FepephoraraaBMCTRSresestoaoiet'ttulnmsaeseaeenfceotarddouzuecafetgarrorhueermrnccfntsyeeeeoeuchtcdn/rdntesacclergci]soiyoxuetoeennchpinxsdpiaatco]iiusluiansattsesostriutnetieoepi,hnnsroagmsonetibxtslppeselrsimaotnsicsgne:?sWsTHehYsiniiktncacobomouiulntdg:feavielntanddatwariatnedaatteismtesstkaemlpe:ton.

The first time I ran this, I was genuinely surprised. The LLM suggested 11 additional test cases, and about 7 of them were things I hadn’t thought of. The best ones:

  • Empty string timestamps — our parser didn’t handle "" vs null differently, but downstream code assumed they were distinct
  • Timestamps exactly at midnight UTC — we had a >= vs > boundary bug that had been there for months
  • Payloads with Unicode characters in field names — our validation regex only handled ASCII
  • Concurrent writes to the same entity within the same millisecond — our dedup logic used timestamp comparison, which… yeah

That midnight UTC bug? It had caused a subtle data inconsistency we’d been investigating for weeks. A single AI-suggested test found it in five minutes.

The Downsides

Not everything the LLM suggested was useful. About 30-40% of its suggestions were:

  • Technically valid but practically impossible — like testing what happens when you pass a 50GB string as a timestamp. Sure, it could happen, but we have input validation layers upstream.
  • Already covered by integration tests — the LLM didn’t have context about our broader test suite.
  • Just wrong — occasionally the suggested test had incorrect expected behavior because the LLM misunderstood the business logic.

You have to filter. This isn’t a “run it and ship it” workflow — it’s more like brainstorming with a colleague who’s read a lot of code but doesn’t know your system.

Experiment 2: Property-Based Test Generation

This one was more interesting. Instead of asking for specific edge cases, I started asking the LLM to help me write property-based tests — tests that define invariants rather than specific input/output pairs.

# Traditional test
def test_process_event_normal():
    event = {"type": "click", "timestamp": "2025-11-01T10:00:00Z", "user_id": "u123"}
    result = process_event(event)
    assert result.status == "processed"
    assert result.user_id == "u123"

# Property-based test (AI helped me think of these properties)
from hypothesis import given, strategies as st

@given(
    event_type=st.sampled_from(["click", "view", "purchase", "scroll"]),
    timestamp=st.datetimes(min_value=datetime(2020, 1, 1), max_value=datetime(2030, 1, 1)),
    user_id=st.text(min_size=1, max_size=50, alphabet=st.characters(whitelist_categories=("L", "N")))
)
def test_process_event_properties(event_type, timestamp, user_id):
    event = {
        "type": event_type,
        "timestamp": timestamp.isoformat() + "Z",
        "user_id": user_id
    }
    result = process_event(event)
    
    # Property 1: Processing should never raise an unhandled exception
    assert result is not None
    
    # Property 2: Output user_id should always match input
    assert result.user_id == user_id
    
    # Property 3: Processed timestamp should be within 1 second of input
    assert abs((result.processed_at - timestamp).total_seconds()) < 1
    
    # Property 4: Status should always be one of known states
    assert result.status in {"processed", "filtered", "error"}

The prompt I used was something like:

Ihdoahltadavetprrtuehesiesrrevfgauatnricdotlnie,osnsid[oepfmapsiotntepeunftcu?yn,cTthaiinondnk]o.arbdWoehuratitnmgiantgvhuaearmriaaatnnittcesaelss.hporuolpderatliweasy,s

This was genuinely valuable. The LLM helped me articulate properties I intuitively knew should hold but had never formalized. “The output user_id should always match the input” sounds obvious, but we’d never explicitly tested it — and it turns out there was a code path where it silently fell through to a default value.

Where This Broke Down

Property-based testing with AI help works best for pure-ish functions with clear inputs and outputs. For anything involving state, external services, or complex business logic, the LLM’s suggested properties were often either too weak (trivially true) or too strong (violated by valid edge cases it didn’t know about).

I also found that Hypothesis (the Python property-based testing library) has a learning curve that the LLM sometimes papered over. It would generate strategies that looked right but subtly didn’t match our data model, leading to flaky tests that took time to debug.

Experiment 3: Test-First AI Development

This is the one that stuck the most. Instead of writing code and then tests, I started asking the LLM to write tests first, based on a natural-language spec of what I wanted to build.

I------WDronTRKRHSineaeeeaht'ekmetnoetdeopudusvsrllcwaenedorastssmifhwptuldeeeoreniuvmrecspeepkhtttlantehiirtywneooclsisnfaiitiitesnhvmceesopepavstruullettppelnboeymetacdbttedsscyoeneubst`(dryr1tadere0teaoett0tidcnniukeouhcmrsnpieenet.lhdsivsiabtnecsyagnfamtotitpesredim_,mapwteestihvtscytieitehsnmanlotemdiufsspistu`tntnagectmxthpcia,eotsn:ps.aiyvIlenocamldeu)mdoeryedgecases.

The LLM would generate 15-20 tests covering normal cases, empty input, single element, all duplicates, mixed duplicates, large inputs, events with identical timestamps, and so on.

Then I’d implement the function to make the tests pass. The tests became the spec.

What I liked about this approach: it forced me to think about the contract before the implementation, the AI-generated tests often included cases I’d have skipped writing myself, and having the tests first made the implementation faster because I knew exactly what “done” looked like.

What I didn’t like: sometimes the LLM’s understanding of the spec didn’t match mine, leading to tests that encoded the wrong behavior. You have to review the tests carefully, which partly defeats the time-saving argument. I’d say about 20% of AI-generated tests needed modification before they accurately represented what I wanted.

Experiment 4: Mutation Testing with AI Analysis

This was the most recent experiment and probably the most powerful combination. We run mutation testing with mutmut on critical code paths — it introduces small changes to your code and checks if your tests catch them. Surviving mutants mean your tests have gaps.

The problem with mutation testing has always been the output. You get a big list of surviving mutants and it’s tedious to figure out which ones matter and what tests to write. So I started feeding the mutation reports to an LLM:

Hp123[er...proaecIIPLsesfrOtasiWerstmoeoher(mriaicu2.sntot3iysmFan:musogeturmfHtreuIiovealGcuian,H)tvcipihnw(unghctgofao]nutumellu:tdtgeaascnptatusowsroefuralpodrmfoacdmlauustctectahitpoioinostni?btutigev)se,t?iMnEgDo(ncooduerqeuvaelnitty),

The triage alone saved me probably 2-3 hours. The LLM correctly identified that about 8 of the 23 were false positives (mutations in logging statements, cosmetic string changes, etc.) and prioritized the rest in a way that mostly matched my judgment. I disagreed with maybe 2-3 of its priority assignments, which is a reasonable hit rate.

What Changed in Our Numbers

After six months, here’s roughly where we are compared to before:

MetricBeforeAfterNotes
Line coverage~78%~84%Modest improvement
Branch coverage~61%~73%Bigger improvement here
Production edge-case bugs3-4/month1-2/monthHard to attribute solely to testing
Time writing tests~30% of dev time~25% of dev timeSlightly faster
Test quality (subjective)MediocreNoticeably betterTests catch more real issues

The coverage numbers aren’t dramatic, but coverage was never the real problem. The improvement I actually care about is test quality — our tests now cover weirder, more realistic scenarios, and they’ve caught several bugs that would have made it to production under the old approach.

My Current Testing Stack

Here’s what I actually do now for any new feature:

  1. Write a natural-language spec of what the feature should do
  2. Ask an LLM to generate initial tests from the spec — review and fix them
  3. Implement the feature to pass the tests
  4. Ask the LLM for edge cases I might have missed
  5. Run mutation testing on the critical paths
  6. Feed surviving mutants to the LLM for triage and test suggestions

Steps 2 and 4 are where AI adds the most value. Steps 1 and 3 are still fully manual. Steps 5-6 I only do for high-stakes code.

Honest Assessment

AI hasn’t revolutionized my testing — it’s made it incrementally but meaningfully better. The biggest shift is psychological: I now think about testing as a conversation with a brainstorming partner rather than a solo checklist exercise. That change in mindset has been more valuable than any specific technique.

The tools still have real limitations. They don’t understand your system architecture, they sometimes suggest tests for impossible scenarios, and they can make you overconfident if you’re not careful about reviewing what they produce. The worst outcome would be a team that thinks “AI wrote our tests, so they must be thorough.” That’s exactly backwards — AI-generated tests need more scrutiny, not less, because they encode assumptions about your code that may not match reality.

But as a way to break out of your own testing blind spots? I’ve found it genuinely useful. Your mileage will vary, but it’s worth experimenting with.


You might also like


📦 Free: AI Code Review Prompt Pack — 10 prompts I use on 15+ PRs/week.

Newsletter: One practical AI workflow per week, plus templates I don’t publish here. Subscribe →