If you’ve spent any time online lately, you’ve seen the videos: an engineer using Cursor or Claude Code to scaffold a full-stack application in 20 minutes. Authentication, database migrations, API endpoints, all generated with a few prompts. The demos are genuinely impressive.
But here’s a scenario you won’t see in those videos: an engineer spending two days tracing why a feature flag behaves differently in production than in staging. The root cause? A subtle interaction between middleware, a config override introduced three years ago, and an undocumented database index that affects query timing. Very few demos showcase this kind of work because it’s slow, context-heavy, and difficult to compress into a flashy 2-minute video.
Both scenarios represent real engineering work, but they’re fundamentally different problem spaces. As organizations rush to adopt AI-assisted software development, expectations are being shaped by the work that’s easiest to demonstrate rather than the work that’s most common in mature systems. Tools like GitHub Copilot, Cursor, and Claude Code can turn vague ideas into working prototypes in minutes. These tools are spectacular when the canvas is blank: new project, clean abstractions, modern tech. But for most of us, the canvas isn’t blank. It’s layered, brittle, and full of half-forgotten history.
To be clear: AI coding tools are genuinely transformative in the right contexts. I use them daily. They’ve made me significantly more productive. I’m now able to churn out personal projects that I’ve dreamed of for almost a decade. The key is knowing where and how to use AI coding tools effectively.
Two Worlds of Software: Greenfield vs. Legacy
Every experienced software engineer knows the difference between greenfield projects and legacy systems. Unfortunately, this difference is rarely explicitly acknowledged in how we talk about AI coding tools and productivity.
Greenfield Work: Where Generation Shines
Greenfield projects start fresh with no historical baggage. You design clean abstractions, choose modern frameworks, and build coherent patterns. The kinds of boilerplate-heavy tasks required by greenfield projects — REST endpoints, schema definitions, basic authentication — benefit enormously from automated generation.
Ask an AI coding tool to:
- Scaffold a new backend service
- Translate Python to TypeScript
- Write tests for a predictable flow
- Stub out a REST API
And it will likely do it instantly, with idiomatic code, recognizable patterns, and high confidence. Startups and R&D teams see these productivity gains first because their systems are new, cohesive, and built with modern patterns. They live in the same world that the AI model was trained on.
Legacy Systems: Where Comprehension is Critical
Legacy systems are an entirely different world. They’re thick with overlapping layers of semi-dysfunction: no automated tests, a half-finished migration, a function rewritten twelve times, a workaround that unintentionally became a feature, code that may have been written before you were born, and no one person understands the whole codebase. To make a safe change, you have to spend hours digging, tracing, and testing. Working in such a codebase often requires comprehension more than it involves creation.
Imagine asking an AI model to:
- Add a feature to a 10-year-old monolith with inconsistent naming conventions across different modules.
- Debug a race condition that only surfaces under specific production load patterns.
- Refactor a core utility that 47 other modules explicitly and implicitly depend on.
- Investigate why the same API call returns different results when invoked from a background job versus the main request handler.
Most of us spend significant engineering time on problems like these. AI struggles here for several interconnected reasons. Solving these problems requires tracing through chains of files, decorators, configuration files, and runtime behaviors.
Here’s a concrete example from my own work. I was investigating why a null error occurs only when certain payments fail. My favorite AI coding tool suggested adding a null check 🙄 —a surface-level fix that clearly didn’t address the actual problem. In fact, the actual problem was that the external payment API my code interacts with returns a different response structure when payments fail in a specific way than when they succeed. And because my code assumed the response structure would always remain the same, it tried to access non-existent attributes in the response object when payments fail.
This is not a particularly complex example. But notice how finding the real problem requires understanding the external API’s behavior in different situations, the abstraction that encapsulates interactions with the API, and the code that handles the API’s response? No single file contains this knowledge. The AI can read each file individually, but it can’t reconstruct the system’s behavior, especially when crucial context exists outside the code entirely.
Built to Generate, Not to Comprehend
I don’t yet know enough about AI to make any definite conclusions. However, I suspect that the behavior of AI coding tools in legacy contexts is closely tied to how the underlying models are built. Here’s something worth pausing on: we call these systems Generative AI for a reason.
Large Language Models (LLMs) are designed to produce text, code, or images that fit familiar patterns. They take patterns from training data and recombine them into new, plausible outputs. That’s what “generative” means. The illusion of “understanding” that they sometimes demonstrate is a byproduct of and a means to generation rather than the goal of their design. Therefore, they learn to speak code fluently, but not to reason about it as deeply as a human would, especially about the consequences of their actions. This is why LLMs flourish in creation and stumble in comprehension. They were never trained to ask the questions that legacy systems demand:
Why was this built this way? What are the first and second order effects of this change? What hidden contract does this function rely on?
For LLMs to work effectively with legacy code 😉, they need to have some subset or superset of the following capabilities:
- Trace execution paths through a call stack the way a debugger does.
- Maintain some model of mutable state across files.
- Reconstruct the intent behind architectural decisions from scattered historical context.
- Precisely simulate runtime behavior under specific edge cases.
Do you know what LLMs do instead? They pattern-match.
I’ve seen code like this before, and based on statistical patterns in my training data, the next token is usually X. 🤡
In well-defined and familiar problem spaces like building a REST API or setting up basic authentication, pattern-matching works beautifully. But legacy systems present a different challenge entirely. You need causal reasoning (“If I change this, what breaks?”), historical knowledge (“why was this done?”), and runtime simulation (“what happens when…?”).
When we use AI coding tools in legacy contexts, we’re asking a generation engine to do comprehension-heavy work. Sometimes it succeeds through increasingly sophisticated pattern matching and reasoning. Most of the time, especially in the most complex legacy contexts, it struggles. It turns out that most software engineering work is interpretive, not generative. In other words, most software engineering work comprises understanding, maintaining, and safely evolving what already exists.
Thankfully, AI models are getting better every day. Research continues to push the boundaries of what’s possible. But as far as I can tell today, in practice, AI models excel at generation but struggle with comprehension. Greenfield work rewards generation; legacy work demands comprehension.
The Different Levels of AI Effectiveness
Not all legacy work is equally challenging for AI. The spectrum ranges from tasks where AI provides moderate assistance to scenarios where it actively struggles.
Greenfield features in legacy systems sit at the moderate end of the effectiveness scale. AI can generate new code well, but integration points often require human navigation of existing patterns and implicit contracts. AI can generate a new API endpoint easily enough, but you need human judgment to understand where to wire the endpoint into the authentication flow, how to handle the three different error-reporting mechanisms already in use, and which caching layer to respect.
Bug hunting across modules is where AI’s magic starts to fade. Static analysis is not sufficient. Building a mental model of runtime behavior across files and layers is what the work demands. As I alluded to earlier, with an example from my own work, the gnarliest bugs are not in any one place; they often live at intersection points where components interact under specific conditions.
Refactoring with implicit dependencies is even harder. AI can’t reliably predict what will break until tests fail, or worse, until production breaks. The challenge is that legacy systems often lack tests. Codebases are rife with implicit dependencies: functions that appear unused but are actually called via reflection, modules that have side effects when imported, or seemingly independent components that must be initialized in a specific order.
Migrating old patterns to new ones falls somewhere in between. AI is quite good at mechanically transforming code, but it often misses edge cases, business-logic nuances, and the “why” behind existing implementations. Old patterns might look outdated, but they usually persist for reasons that aren’t always evident in the code itself.
Reading and documenting legacy code is an area where AI provides moderate value. It can summarize what code does, but often misses why it exists or why it was implemented a particular way. It turns out that the reasons why something was done a specific way are frequently the most important things to document.
Why the Industry Narrative Misses Key Context
The Visibility Gap
The public conversation around AI coding tools is shaped by what’s easiest to demonstrate. Impressive demos show AI building apps from scratch in minutes. Startup success stories feature companies with entirely greenfield codebases. Benchmark results measure code-generation speed and correctness on isolated problems. Blog posts and case studies highlight dramatic productivity gains in specific, bounded contexts.
What’s much less visible is the reality that engineers at mature companies spend the majority of their time understanding existing systems before writing new code. There’s cognitive overhead in validating AI suggestions against years of accumulated edge cases and implicit business rules that exist only in institutional memory. There are hidden costs when AI-generated code passes initial review but creates maintenance burdens or subtle bugs that surface weeks, months, or years later. Engineers end up spending time fighting with AI tools that confidently suggest solutions that won’t work in the specific context of their legacy system.
This limited visibility of legacy challenges relative to greenfield success isn’t anyone’s fault. Greenfield work is easier to demonstrate and faster to quantify. Legacy work is messy and hard to showcase in a 2-minute demo or a benchmark dataset.
The Benchmark Mismatch
Most public benchmarks for AI coding tools emphasize code generation (e.g., HumanEval, MBPP). Real-world bug-fixing benchmarks do exist (e.g., SWE-bench/Verified/Live, Defects4J, and BugsInPy, but none fully capture production complexity such as environment-specific race conditions or undocumented coupling. So far, there don’t seem to be benchmarks that adequately assess AI models with tasks like: “trace why this race condition occurs in production but not in staging,” “refactor this module without breaking 42 undocumented dependencies,” or “explain why this architectural decision was made and whether it still makes sense.”
Learning as We Go
We’re still in the early stages of understanding how AI tools fit into different software engineering contexts. Organizations adopting these tools are experimenting in real-time, and initial expectations still appear to be shaped predominantly by the most visible success stories. The opportunity now is to refine our mental models and adopt AI more strategically by matching tools to the actual work, and not the idealized work we see in demos. This means being honest about the work breakdown (greenfield vs. legacy) in our organizations and adjusting expectations accordingly.
The Shift We Actually Need (And Should Demand)
The current generation of AI coding tools is impressive, but it’s just the beginning. As the technology matures, we have an opportunity to build tools designed explicitly for comprehension-heavy legacy work.

What we need are tools that help engineers deeply understand existing code, not just write new code. Imagine tools that can:
- Visualize dependencies and not just suggest code changes, but also show you “this function is called from 23 places across 8 files, here’s the control flow for each invocation context, and here’s what would break if you change the function signature.”
- Integrate with git history, issue trackers, and documentation to answer questions like “why was this pattern used? What GitHub issue or design doc motivated this issue? What constraints was the original author working under?”
- Explain runtime behavior by simulating code paths with specific inputs, showing system state at each step, and answering questions like “What database queries ran? Which external services were called? Where was the data transformed?”
- Analyze a codebase and surface findings such as “you have three different approaches to authentication, here’s where each is used, and here are the subtle differences in behavior that look unintentional.”
Some of these capabilities are already emerging. The key is shifting the optimization from “generate plausible code” to “help engineers build accurate mental models of complex systems”.
Matching Tools to the Work
AI coding tools are a genuine leap forward in software engineering. However, technology alone won’t solve all our problems. Organizations need to adjust expectations and practices. We need to be thoughtful about where and how we deploy AI coding tools.
For Engineering Leaders
Understanding the actual breakdown of work in your organization is the first step. Survey your teams: how much time is spent on greenfield work versus legacy maintenance? Set expectations accordingly. If your team spends 70% of their time on legacy systems and 30% on greenfield projects, you probably shouldn’t expect AI to provide 10x productivity gains across the board. A more realistic expectation might be significant gains on that 30% of work that is greenfield, with more modest but still valuable assistance on the other 70%. Setting realistic expectations starts with stopping the practice of measuring AI effectiveness solely by greenfield benchmarks or startup success stories. We need to set different productivity expectations for different types of work.
Invest in documenting, refactoring, and improving legacy code. Consider documentation not just as a “nice to have” but as infrastructure that multiplies the effectiveness of both human engineers and AI tools by providing much-needed context and coherence.
Experiment and measure carefully. This means piloting AI tools in greenfield side projects first, then evaluating their applicability to legacy systems. Measure not just “how fast can we ship features” but also “how often does AI-generated code cause bugs, tech debt, or unsustainable review burden for engineers.” Track not just velocity, but quality, maintainability, and engineer experience. Share learnings across teams: what works, what doesn’t, and in what contexts.
Finally, value the work of understanding, maintaining, and safely changing systems, not just building new ones. The goal is to maximize engineering effectiveness, not AI usage.
For Individual Contributors
Use AI where it genuinely helps: scaffolding new features, generating boilerplate, exploring API options, and writing tests for new code TDD-style. Don’t force it where it struggles: deep debugging, refactoring critical paths, and making architectural decisions in complex legacy contexts.
When AI suggestions feel wrong, trust your judgment and experience. You have the unique ability to reason across systems, time, and context; skills that humans learn painfully, and AI models haven’t yet learned sufficiently. Document the “why” behind your decisions, as such documentation helps future humans (and future AI) understand your code.
Share learnings with your team: “Here’s where AI saved me three hours,” and “Here’s where AI sent me down a rabbit hole.” This kind of knowledge sharing helps everyone calibrate their expectations and use the tools more effectively.
Rethinking Productivity
People often talk about speed as if it’s synonymous with progress. But in mature systems, speed without understanding just moves risk downstream. You get code faster, but you don’t actually get safer or more reliable outcomes. Problems surface later—in QA, production, or future maintenance—and the cost shifts from engineering time now to firefighting later.
Writing code has never been the primary bottleneck in software engineering. Reading code and truly grasping its intent and boundaries has always been the hard part. What we really need is a reliable way to use AI to close the gap between making a change to a system and feeling sure it won’t break anything.
The Bigger Picture
The research community is actively developing models with stronger reasoning capabilities, longer context windows, and deeper causal understanding. Tool builders are experimenting with agents, multi-step workflows, and better integration with development environments. The future is bright.
In the meantime, the most successful teams will be those that understand the current limitations, set realistic expectations, and deploy AI strategically rather than universally. The promise of AI-assisted development is real. Realizing that promise requires matching the tool to the work, rather than twisting our understanding of the work to fit the tool.