End-to-end (E2E) tests naturally trend toward uselessness.
Early in my career, I worked at a company where we considered a 50% pass rate on our E2E test suite a green light for our biannual release. In fact, I was hired to help find a way to reduce reliance on these horrid E2E tests using AI. But that’s a story for another day.
We had thousands of E2E tests and almost no unit tests. Releases happened twice a year and were a company-wide ordeal. People worked late triaging failures, trying to distinguish real bugs from flaky tests, and re-running test suites in hopes of a better outcome. When we saw test failures, we asked, “Is this failure real?” and those test failures triggered re-runs rather than investigations. When half the tests passed, there was great rejoicing and sighs of relief. It was time to ship.
The crux of this story is that our E2E tests were untrustworthy. Untrustworthy tests are worse than no tests at all because they consume time without providing confidence. This article is about how to avoid ending up with untrustworthy E2E tests. It’s about governance: who owns end-to-end tests, when to add them, when to remove them, and how to prevent the slow accumulation that turns a useful verification tool into a release-blocking pain in the shin. However, governance only makes sense when you understand why E2E tests are fundamentally different from other tests; different enough that the instinct to “just treat them like other tests” will seal your doom.
Let’s Talk about Trust
Trust in a test suite is a function of two things: signal clarity (when a test fails, how confident are you that something is actually broken?) and response cost (how much effort does it take to investigate a failure?). People trust the test suite when signal clarity is high and response cost is low. This means that failures are investigated, bugs are caught, and tests do their job.
When signal clarity drops (because of flakiness) or response cost rises (because failures are hard to diagnose), most rational people stop trusting the test suite. They rerun tests instead of investigating. They resort to other, often less effective, means of assessing quality because the automated tests no longer provide confidence.
E2E tests have structural properties that naturally push signal clarity and response cost in the wrong direction. When you understand these properties, you’ll also understand why E2E test suites trend toward uselessness without active intervention and relentless management.
Why It’s Hard to Trust E2E Tests
Flakiness destroys signal clarity.
Most end-to-end tests are unavoidably flaky because they cut across many system layers within which any number of components could act in non-deterministic ways. Given the inherent flakiness of E2E tests, let’s do the math. If each test has a 2% independent flake rate, the probability that your test suite runs cleanly drops exponentially with the size of the test suite. At 200 tests, 98% of runs will have at least one random failure. At 500 tests, that’s where you find yourself celebrating 50% pass rates. This suggests it is probably in your best interest to keep your E2E test suite as small as possible.
Slow feedback increases response cost.
When your E2E suite takes 30 minutes to run, investigating a failure means waiting 30 minutes to verify every potential fix. Most good engineers are rational. When response cost is high, they do less investigation. After all, a quick rerun is cheaper than a real investigation. If the rerun passes, they’ll ship it and move on. If it fails again, maybe rerun it. The toil required to investigate actively discourages treating failures as meaningful.
Cross-cutting failures obscure causation.
When a properly written unit test fails, you’ll usually know exactly why it failed because of the small scope of functionality it operates on. When an E2E test fails, you may have to open an investigation that might span five teams’ code. Unclear causation increases response costs, which reduces investigation, which in turn reduces trust.
The Shapes of Testing Strategy
The testing community has developed visual metaphors for distributing testing effort across different levels. Each metaphor encodes assumptions about cost, confidence, and where you should build trust.
The Test Pyramid
The Test Pyramid strategy consists of a wide base of unit tests, narrowing through integration tests, with a small layer of E2E tests. The pyramid shape reflects cost economics: unit tests are fast, cheap, and stable; E2E tests are slow, expensive, and flaky. This means it is in your best interest to invest most of your effort at the bottom of the pyramid, where costs are low and trust is easy to maintain.
Of course, the pyramid’s implicit assumption is that your code is testable at the unit level. If so, you can verify most behavior cheaply and reserve E2E for tasks only E2E can handle.
The Test Trophy
The Test Trophy strategy emerged from the JavaScript testing community. It recommends a small base of static analysis, a modest unit-test layer, a large bulge of integration tests, and a small E2E layer. The argument is that unit tests can pass while the system is broken because they often test implementation details rather than behavior. Integration tests verify that components work together, providing greater confidence per test.
The trophy’s implicit assumption is that you have good tooling for integration testing. Tools like React Testing Library are incredibly helpful here since they let you test components in realistic conditions without the cost of full E2E testing.
One thing you may have noticed is that although the test pyramid and test trophy disagree about where the bulk of testing effort should go (unit vs. integration), they agree that E2E tests are expensive and should be used sparingly.
The Ice Cream Cone
The Ice Cream Cone is what you end up with when you don’t heed the time-tested advice of keeping your E2E testing layer small. You end up with an inverted pyramid: most tests are E2E or manual, with few integration tests, and almost no unit tests.
The cone isn’t usually a deliberate choice. It emerges when code isn’t testable at lower levels, causing all verification to be pushed to the outer edge of the system under test. The ice cream cone is a symptom of a lack of testability and leads directly to E2E test uselessness.
These metaphors matter because they determine where trust lives. In a pyramid or trophy, most trust comes from fast, stable, low-level (whitebox) unit and integration tests. E2E tests provide a final sanity check but aren’t load-bearing. Conversely, in an ice cream cone, all trust depends on E2E tests. As we’ve seen, E2E tests are too slow, too flaky, and too expensive to bear the weight of all that trust.
One E2E Test Per Feature: The Road to Hell
On the surface, it is hard to argue against the suggestion that developers should treat E2E tests like all other tests. Afterall, every feature should have tests. Unit tests verify functions. Integration tests verify components. E2E tests verify features. Each layer maps to a level of abstraction, and coverage should be comprehensive at every level. This thinking works for unit and integration tests. It fails catastrophically for E2E tests.
The difference has a lot to do with how costs scale. When you add a unit test, you add milliseconds of execution time and a small localizable maintenance burden. The hundredth unit test costs about the same as the tenth. When written properly, you can keep adding unit tests indefinitely without significantly degrading the trustworthiness of the test suite.
E2E tests don’t scale in the same way as unit tests or even integration tests. Each test adds tens of seconds or minutes of execution time. Each test inherently introduces another opportunity for flakiness and adds a cross-cutting maintenance burden that touches multiple layers of code and multiple teams. The hundredth E2E test costs much more than the tenth because it pushes the test suite closer to the flakiness threshold where trust collapses. Remember the math, and remember that E2E tests are inherently flaky because they cut across so many layers. If your tests have a 2% individual flake rate, a 200-test suite will fail in 98% of runs. At that point, “Is this failure real?” becomes the default response to any failure.
The aforementioned test pyramid and test trophy are, among other things, warnings about test accumulation at the wrong layer. They strongly recommend a small E2E layer because E2E tests are expensive in ways that compound. Treating E2E test coverage like unit or integration coverage ignores those economics and accelerates the trend toward uselessness.
Treat E2E Testing Like a Budget
If trust erodes as E2E test suites grow, the only way to maintain trust is to limit growth. This is why the right mental model is to treat E2E tests like a budget rather than an infinite backlog that you add to on a whim.
Unit tests can be accumulated because, relative to E2E tests, it is much easier to prevent their costs from compounding in ways that destroy trust. Signal clarity stays high because unit test failures typically have clear causes. Response cost stays low because unit tests run fast.
E2E tests can’t be accumulated in the same way. Each additional E2E test significantly increases execution time, compounds flakiness probability (again, remember the math), and adds cross-cutting maintenance burden. Unlike unit tests that typically have linear marginal costs, the marginal cost of an E2E test is superlinear.
Managing E2E tests within a budget means accepting a fixed capacity, constrained by execution time, tolerance for flakiness, and maintenance bandwidth, and making deliberate allocation decisions within that capacity. Adding a test means either expanding capacity or replacing a less valuable test. Fixed capacity forces you to ask questions like: Is this test worth the trust cost? Is it more important than any other test we have to remove to accommodate this new one? Can this be verified at a level that doesn’t erode E2E trust?
Push Testing Downward
When you identify a gap in test coverage, the instinct is often to add an E2E test. And yes, E2E tests are easy to reason about because they mirror what users do, they don’t require understanding internal architecture, and they provide clear confidence that “the thing works.”
Resist this instinct as if your life depends on it. Instead, ask: what’s the lowest level at which I can cover this coverage gap? If a bug slipped through because of incorrect business logic, that’s a unit test gap. If components aren’t communicating correctly, that’s an integration test gap. E2E tests should only cover gaps that genuinely can’t be caught at lower levels, e.g., interactions with real external services, full-stack user journeys where the integration points themselves are tricky, or behavior that only emerges when the entire system runs together.
Pushing testing downward has compounding benefits:
-
Faster feedback: A unit test runs in milliseconds; an E2E test runs in seconds or minutes. When you push a test down, you make the feedback loop tighter for that specific verification.
-
Clearer causation: Lower-level tests have smaller blast radii. When they fail, you know where to look. E2E failures require investigation across the stack.
-
Better trust economics: Every test you push down is a test that doesn’t consume E2E budget. The E2E suite stays small, fast, and trustworthy.
-
Testability pressure: If you can’t push a test down because the code isn’t testable at lower levels, that’s valuable information. The inability to write a unit test often indicates a design problem. The inability to write an integration or contract test indicates a tooling and/or infrastructure problem. Pushing testing downward creates pressure to improve testability, which pays dividends beyond the immediate test.
When a production incident occurs, the reflexive response is often “add an E2E test so this doesn’t happen again.” Before you do, ask: why didn’t a lower-level test catch this? Sometimes the answer is “because only E2E could have caught it” because a third-party integration failed, or when a failure resulted from the interaction of correctly functioning components. In those cases, an E2E test is appropriate. But often, the answer is “because we didn’t have adequate unit or integration coverage.” The right response then is to add a lower-level test, not to paper over the gap with an E2E test.
The goal is a test suite where each layer does what it does best. Unit tests catch logic errors. Integration and contract tests catch interface mismatches. E2E tests catch system-level failures that only manifest in production-like conditions. When you push testing downward, not only will you save E2E budget, you’ll also be more likely to put each test where it provides the clearest signal at the lowest cost, thus maintaining much-needed trust.
What to Spend Your E2E Budget On
Critical user capabilities. Test for flows that must page the on-call engineer if broken. These are often worth the trust cost because catching failures matters enough to justify the expense. Some important examples include revenue-generating journeys, authentication, and the happy path through core features.
Integrations you can’t test otherwise. There are certain integrations that only E2E tests can verify in production-like conditions. Some examples include third-party payment processors, identity providers, and external APIs. In these cases, the trust cost is justified because there’s no cheaper alternative.
Things you’d otherwise verify manually. If QA is doing manual regression on a flow before every release, that’s a candidate for E2E automation. In this case, you’re converting human effort to machine effort and replacing existing manual cost, not adding more coverage that supposedly didn’t exist before.
What are some things not worth spending your E2E testing budget on? Edge cases, implementation details, flows that are already well-covered by integration tets, and anything where failure wouldn’t actually warrant immediate investigation.
Who Should Own E2E Tests?
Some organizations put E2E tests under the ownership of a dedicated Quality Engineering (QE) team. Others distribute test ownership across feature teams while a platform team owns infrastructure and standards. Ultimately, the right answer depends on your team structure.
I’ll focus on a variant that fits engineering organizations with separate frontend and backend teams and a web platform team that maintains test infrastructure. I’m focusing on this variant because that’s the structure that’s most relevant to my day-to-day work, and it presents a specific challenge: E2E tests exercise the full stack, but teams are split by layer.
Frontend Ownership with Backend Accountability
The core tension in a frontend/backend split is that E2E tests verify user-facing behavior that spans both layers. A checkout test touches frontend code (UI components, client-side validation, state management) and backend code (APIs, business logic, database). When the test breaks, the cause could be in either layer. Someone has to own the test, but both teams affect whether it passes.
In my opinion, the ownership model that preserves trust is: frontend teams own E2E tests, backend teams are accountable for backend-caused failures, and web platform enforces the budget and adjudicates disputes.
Why Frontend Should Own E2E Tests
E2E tests verify user-facing behavior. Users interact with the frontend. When you write an E2E test, you’re scripting what a user does: click this button, fill this form, see this result. That’s frontend’s domain.
Frontend teams also feel the pain of E2E failures most directly. A CSS change that moves a button or a state management refactor could break E2E tests. Frontend lives with the consequences of test brittleness in a way backend doesn’t, which creates incentive to write maintainable tests.
Backend teams, by contrast, can hide behind API contracts. If the API returns the right response, backend’s job is done. The fact that the E2E test failed because of how that response gets rendered is “a frontend problem.” This isn’t malicious; it’s just how layer-based ownership works. But it means backend teams don’t feel E2E test pain viscerally, and that makes them a poor choice for ownership.
What Each Team Should Own
Frontend teams should own:
- Writing E2E tests for critical journeys that touch their features.
- Maintaining tests when UI changes break them.
- Initial triage of all failures to determine whether the cause is frontend or backend.
- Test data setup that involves UI state.
Backend teams should own:
- Fixing backend bugs that cause E2E failures once frontend identifies the cause.
- Providing test data endpoints or seeding mechanisms.
- Ensuring API stability so that non-breaking backend changes don’t break E2E tests.
- Communicating breaking API changes to frontend before they land.
Web platform should own:
- Test framework, infrastructure, and CI integration.
- Suite-level health metrics: execution time, flakiness rate, and pass rate.
- Budget allocation and enforcement.
- Standards for test quality, including review criteria.
- Quarantine automation and deletion policies.
- Escalation resolution when frontend and backend disagree about failure ownership.
Backend’s Stability Obligation
For this model to work, backend teams have specific obligations beyond fixing backend bugs.
Test data endpoints. E2E tests need to set up state. Backend should provide endpoints (behind a test-only flag or in non-production environments) that lets tests create users, seed orders, set up inventory, and so on. Without these, frontend will write fragile tests that depend on shared test data or complex UI flows just to reach the right preconditions.
API stability guarantees. Backend should distinguish between breaking and non-breaking changes. Non-breaking changes such as new optional API fields, new endpoints, and performance improvements should never break E2E tests. If they do, then the test is too brittle and frontend should fix it. Breaking changes require coordination: backend notifies frontend before merging, and together they update the affected tests.
If backend ships a change that breaks E2E tests without warning, backend owns the fix or rollback. The E2E failure is treated as a backend bug, not a test maintenance burden for frontend.
Contract documentation. Frontend needs to know what API behavior is contractural (guaranteed, safe to test against) versus incidental (might change without notice). Backend should document this, formally or informally. A test that asserts on incidental behavior will break eventually; a test that asserts on contractual behavior should be stable.
Platform’s Enforcement Role
Web platform makes the model work by providing infrastructure, visibility, and enforcement.
Fast and reliable infrastructure. If the test framework is slow or flaky independent of test quality, trust erodes regardless of governance. Platform owns making the framework trustworthy before anyone else can be held accountable.
Clear standards with examples. Don’t just say “tests must not be flaky.” Provide patterns: how to wait for elements, how to handle async operations, how to isolate data. Provide anti-patterns: what causes flakiness, and what coupling to avoid. Make it easy for frontend teams to write good tests.
Budget allocation. Platform sets an overall execution time budget (e.g., 1 hour for the full suite) and allocates portions to frontend teams based on the criticality of the journeys they own. For example, a team that owns checkout probably should get more budget than a team that owns admin settings. Teams can negotiate budget transfers with platform approval.
Visibility dashboards. Teams need to see suite execution time over time, flakiness rate by test and by team, failure rate trends, and quarantine status. Public visibility creates accountability. When a team’s tests are consistently flaky, everyone knows.
Automated enforcement. Quarantine policies and deletion rules should be automated. Tests that flake above a threshold (three flaky failures in a week) get automatically moved to a separate suite that doesn’t block CI. The owning team has a fixed window (two weeks) to fix or delete. If they do nothing, the test is deleted automatically.
Escalation Paths
Frontend says it’s a backend problem, backend disagrees. Platform should adjudicate disagreements like these, review failure evidence and make a binding call. If these sorts of disagreements repeat with the same teams over the same set of tests, platform should escalate to engineering leadership as an organizational issue.
Backend is slow to fix failures. Frontend can’t be blocked indefinitely due to failure in the E2E suite. If backend misses the investigation SLO twice for the same test, frontend can choose to quarantine the test by removing it from CI or delete it with platform approval and document it as “removed due to backend instability”. Both options help create visibility in the suite’s health dashboard.
Persistent flakiness that no one can fix. Platform needs to make the call here. If frontend and backend have both attempted fixes and failed, platform can unilaterally delete the test. Trust matters more than coverage.
Failure Modes to Watch For
Frontend becomes the dumping ground. If backend teams don’t take their accountability seriously, frontend teams will end up doing all the investigation work, filing tickets that languish, and eventually give up on E2E tests. Watch for rising frontend frustration, increasing test deletions, and declining coverage of backend-heavy journeys.
Platform becomes a bottleneck. If every test addition leads to lengthy platform review, it causes teams to route around the process either by not writing E2E tests or by writing them outside the official suite. Watch for declining test submissions, shadow test suites, and complaints about review latency.
Budget becomes political. If budget allocation is perceived as unfair, teams will game the system or disengage. Watch for teams hoarding budget they don’t use, disputes over journey criticality, and tests being reclassified to fit under budget.
Lifecycle Rules for Trust
Regardless of which ownership and governance model you choose, E2E tests require unrelenting management to prevent the slow accumulation that causes suites to bloat and eventually become useless.
When to Add an E2E Test
Add an E2E test when:
- A new user journey is introduced that meets your criticality bar.
- You add an integration with an external system that can’t be verified at a lower level.
- A production incident reveals a failure mode that only E2E testing could have caught.
- You’re replacing a manual QA check with automation.
Before adding the E2E tests, ask: What’s the predicted flakiness? Could this be caught at a level that doesn’t erode E2E trust? Will failures from this test get investigated or dismissed?
If you can’t answer the last question, you don’t have enough visibility into how your team responds to failures. Understanding whether failures from the new test will get investigated or dismissed is a prerequisite for adding the test.
When to Update an E2E Test
Update E2E tests when:
- The user-facing flow changes e.g., a new step in checkout, a different authentication mechanism, or a redesigned onboarding sequence.
- You can fix the test’s flakiness without heroic effort.
- Test data or environment assumptions shift.
Be wary of updates triggered by implementation changes that don’t affect user behavior. If a refactor breaks an E2E test even when users wouldn’t notice any difference, the test is probably coupled too tightly to implementation details.
When to Remove an E2E Test
Remove an E2E test when:
- The capability or user journey it verifies no longer exists.
- The flow has been restructured enough that the test no longer matches real user behavior.
- Lower-level tests have been added that provide equivalent confidence.
- The test is persistently flaky and the cost of fixing it exceeds the value of keeping it.
- You’re over budget and this test protects something less critical than others.
- The test’s failures are routinely dismissed without investigation.
Removal is trust maintenance. A smaller, trustworthy test suite beats a larger, ignored one. Deleting a flaky test improves overall trust even if it reduces coverage.
Quarterly reviews
Schedule explicit review sessions quarterly to evaluate suite health. In these review sessions, look at execution time trends, flakiness rates, failure frequency by test, and coverage relative to current product priorities. Ask: Are these tests still covering critical paths? Has anything become redundant? Are there new journeys that lack coverage?
These reviews force you to consider removing troublesome tests. It’s psychologically easier to delete a test as part of a scheduled review than to do it ad hoc.
It All Comes Down to Testability
Pause and take a breather when you find yourself reaching for an E2E test because “there’s no other way to test this” or because “every feature or code change must have a test”. This usually indicates a testablity problem or a misunderstanding of the role E2E tests play in a healthy test strategy. If you accumulate E2E tests in an undisciplined way and without clear rules, you are choosing to verify behavior at the most expensive layer, where every test degrades trust.
The right response is often not to write an E2E test but to make the code testable at a lower level. Yes, figuring out whether or how to write a test at a lower level is harder than just writing yet another E2E test. It may require changing production code before adding test code. It may require architectural changes that span multiple layers and teams. But you cannot govern your way out of an untestable codebase and the consequences thereof.
E2E test proliferation is a signal. First, it may be a signal that there is some misunderstanding about the role of E2E tests and how to manage them. Second, if your suite keeps growing despite good governance, the tests are telling you something about your architecture. They are telling you that verification is being pushed to the edges because there’s nowhere else to do it. A healthy codebase needs very few E2E tests because most behavior can be verified cheaply and reliably closer to the code. If your E2E suite only grows, and never shrinks, it is trying to tell you something important. Listen.
Discover more from David Adamo Jr.
Subscribe to get the latest posts sent to your email.