I recently hunted down a database deadlock. It was the kind that only showed up in production, about once every day on average, and caused us to breach some of our most critical Service Level Objectives (SLOs).
I fed the problematic code into my favorite AI coding tool, and after a lengthy back-and-forth with it and a few database tweaks, I identified the issue. A web request thread and a background job were contending to update two specific records. This created a classic AB-BA deadlock. My trusty AI assistant suggested adding a pessimistic lock to one of the transactions, as shown below.
record_1.lock!
// database write operations
// other database write operations
I almost accepted the suggested code change. The syntax looked right. The explanation made sense. But something nagged at me.
“What happens to the lock after the database writes are complete?” I asked. The AI cheerfully assured me that the lock would “automatically release.”

I thought that was pretty sus.
“But what if an exception is thrown mid-transaction?” I probed further.
Pause. Flibbertigibbeting. “You’re right. We need to use with_lock to wrap this in a transaction with proper error handling to ensure the lock is released.”
Here’s the thing: I didn’t need to know the exact Rails syntax for with_lock. The AI handled that. What I needed to know was what a lock lifecycle looks like, especially that locks must be acquired and released, that exceptions can interrupt that flow, and that database transactions provide guarantees about cleanup. I also needed to know why we were using a lock in the first place. Not just “to prevent concurrent updates”, but why preventing concurrent updates mattered for this specific data, why a lock was better than an optimistic retry strategy, and whether the performance cost of serializing these updates was acceptable given our traffic patterns.
I knew what to do. I knew why it mattered. The AI knew how to do it.
This is a story about a fundamental shift in what software engineering skills mean. I am writing this as a guide for myself and for anyone else trying to figure out what software development skills to prioritize in this new and exciting age of AI.
Knowing What and Why
What does it mean to know “what” and “why”? Let’s break this down into three concrete categories: recognizing problem shapes, understanding trade-offs, and predicting failure modes.
Recognizing Problem Shapes (“What”)
Consider these scenarios:
Scenario 1: Your checkout endpoint occasionally processes the same order twice.
Scenario 2: Your checkout endpoint sometimes returns a 500 error with no useful message.
Scenario 3: Your checkout endpoint slows down throughout the day, then speeds up overnight.
An AI coding tool, given any of these symptoms, might suggest adding a database lock. And for Scenario 1, that might even work. But these are three fundamentally different problems:
Scenario 1 is a race condition. You need idempotency, probably via a unique constraint on order ID or a distributed lock.
Scenario 2 is an exception handling issue. You need better error propagation and logging, not concurrency control.
Scenario 3 is a resource leak. Something isn’t getting cleaned up (database connections, memory, file handles), and it accumulates until a nightly restart clears it.
The skill here is pattern recognition. You need to know the taxonomy of common problems so you can correctly name what you’re looking at. Once you can name it, you have the information you need to search for established solutions. If you can’t name it, you’re just guessing. I’ve previously alluded to the importance of identifying things by name in a much older post.
But naming the pattern is only half the battle. You also need to understand why each pattern occurs:
- Race conditions happen because modern systems are concurrent, and operations aren’t atomic by default.
- Exception handling issues arise because failure is the default state in distributed systems.
- Resource leaks occur because computers have finite resources, and manual cleanup is error-prone.
Understanding the “why” lets you predict where else you may encounter the same problem. If you know race conditions arise from non-atomic concurrent operations, you’ll start seeing potential race conditions everywhere multiple threads or processes touch shared state.
Understanding Trade-offs (“What” and “Why”)
Let’s say you’ve correctly identified a race condition (the “what”) in your checkout flow. The next step is to figure out “what” a potential solution could be and “why” it may or may not be suitable in your context.
You could use:
Pessimistic locking: Acquire a row lock on the database before manipulating the data. This prevents concurrent modifications but blocks other transactions from proceeding, potentially affecting throughput.
Optimistic locking: Read the data, track its version, and only commit if the version hasn’t changed. This allows high concurrency but may require transactions to retry. If your checkout flow is complex and expensive to recompute, those retries could waste significant resources.
Unique constraints: Let the database enforce idempotency via a unique index. This works beautifully for simple cases, but requires careful schema design and doesn’t help with multi-step operations that need to be atomic.
Distributed locks: Use Redis or a similar system to coordinate across multiple application servers. This works across process boundaries but introduces new failure modes. What happens when the lock service is unreachable? You’ve just made your checkout flow dependent on an additional piece of infrastructure.
AI can implement any of these (the “how”). But YOU need to provide information about your specific context and the trade-offs that apply to it, including:
- How much concurrency do you expect?
- How expensive is a retry?
- How complex is your transaction?
- What’s your tolerance for additional infrastructure dependencies?
- What failure modes can your product tolerate?
But more importantly, you need to understand why these trade-offs exist in the first place:
- Pessimistic locking trades throughput for correctness because preventing concurrent access is fundamentally exclusive.
- Optimistic locking trades retry cost for concurrency because detecting conflicts is cheaper than preventing them.
- Unique constraints trade flexibility for simplicity because the database can enforce invariants more reliably than application code.
- Distributed locks trade availability for coordination because consensus across unreliable networks requires sacrifice.
When you understand why the trade-offs exist, you can reason about which one applies to your situation. You’re not just memorizing “use optimistic locking for high-concurrency scenarios.” You’re thinking, “We have high concurrency and cheap retries, so the cost of detection is lower than the cost of prevention.”
Every architectural decision is a choice between competing downsides. You can’t avoid trade-offs. The skill is knowing which trade-offs you can afford to make and why those trade-offs are fundamental rather than incidental.
Predicting Failure Modes (“Why”)
Here’s a deceptively simple endpoint:
@app.post("/users/{user_id}/avatar")
def update_avatar(user_id: int, file: UploadFile):
user = db.query(User).filter(User.id == user_id).with_for_update().first()
# Process and upload the image (takes ~5 seconds)
avatar_url = upload_to_s3(file)
user.avatar_url = avatar_url
db.commit()
Let’s say AI helped you write this. It even included the with_for_update() lock, like a good defensive programmer. Time to yeet it to prod, right? Not so fast!
At 1,000 concurrent users: Each request holds a database lock for ~5 seconds while uploading to S3. Your database can handle maybe 100 concurrent connections. After the 100th concurrent avatar upload, new requests start queuing or failing.
When S3 is slow: Upload time increases from 5 seconds to 30 seconds. Lock hold time increases proportionally. Your database connection pool gets exhausted. Every other query in your application starts failing.
When a user uploads a malicious 5GB file: Your server tries to hold that file in memory while processing it. Your application server runs out of memory and crashes, taking down other requests with it.
When S3 returns a 500 error: The transaction rolls back. The lock is released. But you’ve already spent 5 seconds of lock time for nothing. A retry will spend another 5 seconds.
The skill here is mental simulation, i.e., the ability to run the system forward in your mind to see where it breaks. But seeing where things could break via mental simulation requires you first to understand why systems fail:
- Concurrency failures occur because locks are exclusive, and holding them during I/O serializes independent operations.
- The S3 slowness cascades because database connections are a finite resource, and exhaustion affects unrelated operations.
- The memory issue occurs because holding large objects in memory while doing I/O blocks that memory from other uses.
- The retry waste happens because we’re holding expensive resources during operations that might fail.
When you understand the “why,” you can generalize the failure mode. You can go from knowing that you should not hold locks during S3 uploads (specific) to understanding that you shouldn’t hold exclusive resources during unreliable I/O operations (general). Now you can apply this principle to any similar situation: don’t hold database transactions during HTTP requests to third-party services, don’t hold mutex locks during file operations, don’t hold connection pool entries during expensive computations.
Experienced engineers do this instinctively. They look at code and think: “What happens at 10x load? What happens when this external service is down? What happens when an attacker tries to exploit this?” In addition to pattern matching, they reason from first principles about why resources are constrained, why external services are unreliable, and why attackers can exploit specific patterns. You develop this skill by watching things break, and more importantly, by understanding why they broke and why the fix works.
Why This Matters Now More Than Ever
AI has made it easier to write code, but harder to know whether the code is correct.
In the olden days of yore, you’d write every line of a checkout endpoint by hand. You’d forget to handle the race condition. A customer would call support, saying they got charged twice. You’d spend a day debugging, eventually learn about database transactions, and hopefully never make that mistake again. The friction of writing code by hand created learning opportunities. Every keystroke and bug increased your likelihood of learning something new and gaining a better understanding of what you were doing.
Now, AI writes the checkout endpoint for you. The code looks professional, handles errors, and might even pass code review. But when that double-charge bug surfaces in production, you’re not as equipped to fix it because you never built a sufficient mental model of the code to realize what transactional integrity means or why it matters.
Like it or not, we are moving toward a world where AI will generate most new code. It is now more critical for Software Engineers to have good judgment in order to assess and fix AI output properly. Not only do you need to know how to recognize problem shapes, you also need to deeply understand why specific solutions work, why they fail, and why one approach is better than another in a specific context.
The good news is that with current AI tools to help you generate code faster, you can now spend your time learning these things — systems thinking, causal reasoning, and first-principles understanding —rather than memorizing boilerplate that changes with every new framework version.
How to Build “What and Why” Judgment
Most software engineering advice tells you to “build projects” or “read documentation.” That’s not enough anymore. AI can build projects and read documentation. What you need is deliberate practice aimed at developing specific skills AI can’t replicate: recognizing patterns and understanding the underlying reasons those patterns exist.
Here are four overlapping practices that will help you build engineering judgment:
Study Systems and First Principles
When you read documentation, you primarily learn how to call a function. When you study systems, you learn what that function does, how data moves through the layers, where performance bottlenecks hide, how side-effects propagate, and why changing one small piece can ripple through the rest of the system.
Once you understand the system, study its first principles. Ask:
- Why does this abstraction exist at all?
- Why is it designed this way? What constraints shaped its design?
- What trade-offs were made between simplicity, flexibility, and performance?
You build judgment by seeing not just how something works, but also why it works that way, and why other ways wouldn’t work.
Learn Through Failure and Root Cause Analysis
The fastest way to learn what breaks is to watch things break. But the quickest way to learn why things break is to perform thorough root cause analysis.
Read post-mortems obsessively and focus on the “why”. Companies often publish detailed incident reports explaining exactly how their systems failed. These reports are treasure troves of “what” and “why” knowledge.
Learn from your own bugs even more aggressively. When AI fixes something for you, don’t just accept the fix and move on. Spend some time asking:
- What broke, and what was the observable symptom?
- What category of problem is this? Race condition? N + 1 query? Memory leak?
- Why did this break? What property of the system allowed this failure?
- Why does the fix work? What property does it leverage, or what constraint does it enforce?
- What would have prevented it from breaking in the first place?
- Why wasn’t that prevention already in place? Cost? Complexity? Lack of foresight?
- What else in my codebase could break the same way?
- Why is this failure mode common? Is there a deeper principle at work?
Practice Diagnosis and Reasoning, Not Just Feature-Building.
Most software developers today spend more time building new features than debugging code. As AI continues to get better at writing code and building features, you are increasingly likely to spend more time debugging code than writing it. But don’t just debug. Practice reasoning about why the bug exists and why the fix works.
Deliberate debugging practice
One way to deliberately practice debugging is to ask AI to write buggy code for you. Seriously. Ask AI to “Write a multi-threaded cache in Python with a race condition.” Then find the bug without asking the AI where it is. Do this across different categories:
- Generate code with an N + 1 query. Understand why N + 1 queries are slow (because network round-trips are expensive and database connection setup has overhead).
- Generate code with a memory leak. Understand why it leaks (because references prevent garbage collection, or manual allocation wasn’t matched with deallocation).
- Generate code with incorrect error handling. Understand why the error wasn’t caught (because exceptions propagate up the call stack until handled, and this code doesn’t handle that exception type).
- Generate code that’s vulnerable to SQL injection. Understand why injection is possible (because string concatenation doesn’t distinguish between code and data, allowing data to be interpreted as code).
Review AI-generated code like a scientist
Every time AI generates a solution for you, ask, “Why does this work?”
- If it uses a lock, why does the lock prevent the race condition? What property of locks guarantees mutual exclusion?
- If it uses caching, why is caching safe here? What invariants ensure cache consistency?
- If it adds an index, why will this index help? What does the query optimizer need to use this index?
- If it uses async/await, why is this operation safe to make asynchronous? What dependencies does it have?
Train yourself to see not just the pattern but the reasoning behind it.
Use performance profiling and ask “why” at every level
When something is slow, use a profiler (e.g., py-spy, rbspy, database query analysis tools) to find out what is slow. But don’t stop there. For each bottleneck, ask why:
- CPU bottleneck: Why is this computation expensive? Is it algorithmic complexity? Inefficient implementation? Missing optimization?
- Memory bottleneck: Why is memory usage high? Large objects? Memory leaks? Retention of unnecessary references?
- I/O bottleneck: Why is I/O slow? Network latency? Disk seeks? Lock contention?
- Database bottleneck: Why is this query slow? Missing index? Sequential scan? Lock waits?
For each answer, dig deeper. Why does algorithmic complexity matter? Why are sequential scans slow? Why does lock contention hurt performance?
Engage with Architecture Decisions and Their Justifications
Architecture isn’t something that happens once at the beginning of a project. It’s the accumulation of hundreds of small decisions. Should this be synchronous or async? Should this data be cached? Should this be a separate service? Each decision has a “what” and “why”. Even as a junior engineer, you can start practicing architectural reasoning by adopting these two habits:
- Listen in design reviews and pay attention to the reasoning behind decisions. If your team does design reviews, attend them even when they’re not about your code. Pay attention to the questions senior engineers ask and the reasoning behind those questions.
- For any feature you build, write a short design doc that captures both “what” and “why. Over time, you’ll automatically start thinking through these questions before you build anything. Here’s a helpful template:
- Context: What problem are you solving? Why does this problem exist?
- Decision: What approach are you taking?
- Reasoning: Why is this the right approach given your constraints?
- Alternatives: What else did you consider?
- Why not: Why did you reject those alternatives? What trade-offs did you prefer?
- Consequences: What are the downsides of your choice?
- Why acceptable: Why are those downsides acceptable in your context?
A Practical Reading List
I have compiled a set of books that teach “what” and “why” knowledge that AI can’t replicate. I am still working my way through most of these. Given our goal of learning from first principles, some of these books were written over a decade ago. They are organized by skill area as follows.
Foundation: How Computers Actually Work
Code by Charles Petzold: This book is a gentle introduction to how bits become programs. It helps you understand what’s actually happening when you write x = 5 and why that operation requires the hardware primitives it does.
Computer Systems: A Programmer’s Perspective by Bryant and O’Hallaron: This is a deeper dive into memory, concurrency, and I/O. It is dense but valuable for understanding why certain things are slow and why “thread-safe” is hard to achieve.
The Elements of Computing Systems by Nisan and Schocken: As you work through this book, you’ll build a simulated computer from first principles. You’ll understand why computers work the way they do.
Databases and Data
Designing Data-Intensive Applications by Martin Kleppman: You probably already guessed that this book will be on the list, and it is arguably the most important one for you to read. The book covers storage engines, replication, partitioning, transactions, and consistency models. Every chapter explains why the applicable trade-offs exist.
Database Internals by Alex Petrov: This book goes deeper into how databases actually work. You’ll learn why different storage engines make different trade-offs and why you can’t optimize for everything simultaneously.
SQL Performance Explained by Markus Winand: This is a short, practical guide to indexing. It teaches you to predict which SQL queries will be slow and why.
Concurrency and Parallelism
The Little Book of Semaphores by Allen Downey: This book teaches the core concepts of locks, semaphores, and coordination through puzzles. It explains why these primitives are necessary and why they have the properties they do. It is available as a free PDF.
Java Concurrency in Practice by Brian Goetz: This book does a good job of explaining why concurrent code is hard. Despite the name, the principles described in this book are language-agnostic.
The Art of Multiprocessor Programming by Herlihy and Shavit: This book provides deeper theoretical foundations. It explains why certain concurrent operations are impossible without coordination, why wait-free algorithms are better than lock-free algorithms, and why both of these are better than blocking.
Distributed Systems
Understanding Distributed Systems by Roberto Vitillo: This is a concise and practical book. It covers networking, coordination, consistency, and failure modes. Each section explains why distributed systems are fundamentally different from single-machine systems.
Designing Data-Intensive Applications (Part II) by Martin Kleppmann: Yes, this book is so good it appears twice on this list. The second part of the book focuses on distributed systems and explains why consensus is hard, why exactly-once delivery is impossible in general, and why different consistency models exist.
Designing Distributed Systems by Brendan Burns: This book describes patterns and anti-patterns for building systems that span multiple machines. It explains why each pattern works and why the anti-patterns fail.
Debugging and Operations
The Art of Debugging by Norman Matloff and Peter Salzman: This book describes systematic approaches to debugging and teaches you to reason about programs. It emphasizes understanding why bugs exist, not just finding them.
Release It! by Michael Nygard: This book catalogs production failure modes and stability patterns. More importantly, it explains why each failure mode occurs and why each stability pattern prevents it.
Thinking in Systems by Donella H. Meadows: This book is not specifically about software, but it teaches you to reason about complex systems and why they behave the way they do.
Site Reliability Engineering (Google): This book describes what it takes to run production systems at scale. It covers principles (SLOs, error budgets, eliminating toil) and practices (monitoring, incident management, on-call). It explains why learning from failure matters, why Site Reliability Engineering (SRE) practices exist, and why they work at scale.
Architecture and Design
A Philosophy of Software Design by John Ousterhout: This is a short book with an opinionated take on software design. It focuses on complexity management and interface design. It explains why certain designs lead to complexity and why others reduce it.
Fundamentals of Software Architecture by Mark Richards and Neal Ford: This is a comprehensive overview of architectural patterns, trade-offs, and decision-making frameworks. It is heavy on the why: why microservices vs. monoliths, why event-driven vs. request-driven, and why each pattern has the trade-offs it does.
Domain-Driven Design by Eric Evans: This book teaches you to think about why you’re modeling things the way you are and how domain understanding affects architecture.
Performance and Scalability
Systems Performance (2nd Edition) by Brendan Gregg: This is a comprehensive guide to performance analysis. It explains why systems are slow at every level: CPU, memory, disk, network, etc.
Security in Modern Systems
Alice and Bob Learn Application Security / Alice and Bob Learn Secure Coding by Tanya Janca: These two books provide an accessible introduction to modern application security. They explain why specific vulnerabilities exist and why certain defenses work.
Systems Thinking and First Principles
Thinking in Systems by Donella H. Meadows: This book appears twice on this list because it applies across multiple skill areas. It teaches you to see feedback loops, emergent behavior, and system dynamics. It will help you understand why complex systems behave in counterintuitive ways.
The Goal by Eliyahu M. Goldratt: This is a novel about manufacturing that teaches the Theory of Constraints. The ideas in this book apply to software systems and help you understand why bottlenecks matter and why local optimization doesn’t improve global throughput.
How to Solve It by George Pólya: This is a classic book on problem-solving. It teaches you to reason from first principles and understand why certain problem-solving strategies work.
Gödel, Escher, Bach by Douglas Hofstadter: This is a deep dive into recursion, self-reference, and emergence. It helps you understand why certain problems are hard and why systems exhibit unexpected behavior.
Learning How to Learn and Reason
Make It Stick by Peter Brown, Henry Roediger, and Mark McDaniel: This book presents evidence-based study techniques that will help you learn things more efficiently.
Thinking, Fast and Slow by Daniel Kahneman: This book will help you understand how your brain makes decisions. You will learn why intuition often fails and why systematic thinking is important.
The Scout Mindset by Julia Galef: This book is about reasoning accurately rather than defensively. It will help you understand why engineers sometimes defend bad decisions and how to reason more objectively.
So, What Now?
If you’re reading this and feeling overwhelmed, that’s actually a good sign. It means you’re recognizing the gap between “can write code” and “understands systems deeply.” These are the two things I want you to remember:
AI hasn’t made you obsolete. Yes, it has made syntax and boilerplate cheap. But it has made other skills significantly more valuable. The skills that matter now are your ability to recognize problem patterns, understand why those patterns exist, evaluate trade-offs based on first principles, and reason about system behavior. These skills don’t expire when a new framework is released or when a new programming language becomes popular. They are based on fundamental constraints that don’t change: networks are unreliable, storage is slower than memory, concurrent access requires coordination, and complexity grows with system size.
You’re not competing with AI. Your job is to know what to build, why it solves the right problem, and whether the implementation is correct. AI’s job is to generate the boilerplate and handle the syntax. You get to focus on the interesting parts: reasoning about trade-offs, understanding failure modes, and making decisions based on constraints and requirements.
Code is easy now. Knowing if it’s right and knowing why it’s right is the hard part. It is the part that makes you valuable in this new age of AI. It is the part that’s worth learning deeply.
Discover more from David Adamo Jr.
Subscribe to get the latest posts sent to your email.