The Rise of the “AI Testing Shortcut”
Over the past year, I’ve observed a significant shift in the way software development teams operate. Developers who once dreaded writing tests now have a shortcut: just ask GitHub Copilot, Cursor, Claude Code, or another trendy AI tool to “generate unit tests”, and watch as your screen lights up with more green checkmarks than you know what to do with.
Living the good life, aren’t we? Managers breathe easier. Coverage reports look beautiful. Pull requests merge faster. All is well with the world.
Except, often, all is NOT well. Beneath the surface, something subtle but dangerous is happening: we are replacing validation with transcription. And the more we automate testing without understanding it, the more we risk turning our existing bugs into features we’ll proudly defend in meetings.
Self-Fulfilling Prophecies
The problem begins with how these LLM-based tools actually “test” your code.
When you ask an AI model to generate unit tests, it almost always starts by analyzing the code you’re asking it to test. It reads your function signatures, interprets variable names, infers likely branches, and then produces test cases whose expected outputs match whatever your current code does. This sounds reasonable until you realize what it means: your tests now validate the implementation, not the intention. Unlike some of you super coders out there, one of the primary reasons I write tests is that I’m certain my code is almost always partially incorrect. The idea of molding tests around my code makes me shiver with dread. The point I’m trying to make is this: if your code is wrong, then tests derived from it will faithfully reflect that wrongness.
Here’s a simple example:
def divide(a, b):
if b == 0:
return 0 # bug: should raise an error
return a / b
An LLM may generate this “unit test”:
def test_divide_by_zero():
assert divide(10, 0) == 0
This looks like a legitimate test. It even passes 😎 !
However, the “test” does not verify correctness. It simply characterizes a buggy implementation. The model didn’t know that the intent was to raise an error; it just mirrored what it saw and transcribed it into a different form. It’s like asking a student to grade their own essay with a rubric that they wrote themselves. You’ll always get an A. But you’ll never learn anything new.
Why Lying Tests Feel So Good
Part of the danger lies in how satisfied AI-generated tests make you feel. You see 90% coverage in seconds. Your manager sees metrics moving in the right direction. The team feels productive. Best software development team ever! 💃🏾
But it’s all an illusion. The tests aren’t exploring edge cases, enforcing contracts, or asserting business rules. They’re simply replaying what the code already does. This ultimately creates a feedback failure. The critical loop between intent (what should happen) and observation (what actually happens) quietly breaks amid the blinding green lights of checkmarks and dizzying ecstasy of high code coverage. We get the form of testing without most of its substance.
When all of this happens at scale, across thousands of lines of AI-generated tests, we will have successfully built the perfect machine for confirming our own mistakes faster than ever before.
AI-Generated Tests are Useful Sometimes
Now, to be fair, there is one legitimate use for AI-generated tests based on implementation: characterization testing.
Characterization testing is a phrase coined by Michael Feathers in Working Effectively with Legacy Code. They are designed not to verify correctness, but to capture current behavior, especially when dealing with legacy systems that lack tests.
Suppose you have a thousand-line function that nobody dares to touch; letting an LLM auto-generate tests can be incredibly helpful. Those tests freeze the current behavior in place, allowing you to refactor with some safety. However, it is important to acknowledge that these tests don’t tell you whether the system is right. They are snapshots of current behavior and safety nets for refactoring or other code changes.
Unfortunately😥, this is not how most teams use AI-generated tests today. They’re not using them to document old code; they’re using them to replace the thinking required for testing new code.
And therein lies the danger!
The Better Way: Using AI for Intent
AI can absolutely be a force multiplier in testing when we anchor it in intent rather than implementation.
Here are a few practical shifts that make all the difference:
- Generate tests before you write the code. Treat the LLM like a collaborator in Test Driven Development (TDD). Give it your requirements, your assumptions, or your acceptance criteria. Ask: “Given this description, what test cases would you write?”
- Ask it for failure modes, not success paths. Don’t say “write tests for this function”. Say “how could this function break?” or “what edge cases might this logic miss?”.
- Use it for creativity, not confirmation. LLMs can help brainstorm boundary conditions, equivalence partitions, and fuzz inputs you might not think of, all of which expand coverage beyond what your code currently does.
- Measure test quality with mutation testing. Mutation testing frameworks (like MutPy or PIT) intentionally inject small bugs into your code to see whether your tests catch them. These buggy versions of your code are referred to as “mutants.” If the mutants survive, then your tests aren’t strong enough. Use AI to propose new tests that kill those mutants.
When you reframe AI as an assistant for thinking, not a substitute for it, you restore what testing was always meant to be: a conversation between intent and implementation.
Don’t Automate Judgment
There’s a larger pattern here beyond testing.
Across software engineering, we’re learning that AI excels at generating artifacts, but not at making judgments. It can fill in boilerplate, write code, or even mimic style. But it cannot understand why something is correct, safe, or aligned with user intent.
And yet, people are increasingly outsourcing judgment, the one thing that can’t be automated yet. When we let AI write tests that only mirror our code, we’re effectively saying:
“I no longer need to think about what correctness means.” 🤡
That’s not productivity. It’s abdication of responsibility.
Rediscovering the Purpose of Tests
At its core, testing was never about coverage metrics or automation frameworks. It’s about expressing intent in a clear and sometimes executable form.
A good test says: “Here’s what should happen, no matter how it’s implemented.”
A bad test says: “Here’s what happens, so let’s call that correct.“
LLMs can help us write more good tests, but only if we remember the difference between good tests and bad tests. So the next time someone proudly says, “We’ve started using AI to generate all our tests,” pause before celebrating. Ask: “Generating tests from what?” If the answer is from our code, then what you have is not testing. It’s transcription. And transcription doesn’t protect you. It only presents your bugs to you in a different form that you’re unlikely to recognize.
Closing Thought
Ok. It’s been quite the rant. And I’m getting sleepy. Let’s wrap this up. 😴
The future of software engineering does not belong to those who automate everything. Instead, it belongs to those who automate the right things and keep thinking about the rest. LLMs can amplify good engineering discipline, but they cannot replace it. Our job as engineers is to understand what “correct” means before we ask anyone, or anything, to check it for us.