All posts

The Compounding Problem: Why Your AI-Generated Codebase Is Quietly Deteriorating

The Compounding Problem: Why Your AI-Generated Codebase Is Quietly Deteriorating

There's a seductive narrative in AI-assisted development right now that as the models keep getting better, so does the code that they generate. Claude Opus 4.6 writes cleaner code than 3.5 did. GPT-5.2 outperforms GPT-4 on SWE-bench. Gemini 3.0 Pro is more intelligent and produces better output than its predecessors. The benchmarks go up. The demos get more impressive. The assumption is that the trajectory is monotonically upward.

The assumption is wrong. Or rather, it's correct in isolation and misleading in practice.

Yes, today's models are remarkably good at generating applications from a clean prompt. But most professional developers aren't starting from a clean prompt. They're maintaining and extending codebases that were generated often by models that were meaningfully less capable than what's available today. I'm not even accounting for historically legacy codebases with decades of cruft. Decisions those older models made, the architectural shortcuts, the misunderstood requirements, the subtly incorrect implementations that passed casual review, don't disappear when you upgrade your model. They compound.

AI-generated code degrades over time in ways that are structurally different from how human-written code degrades, and the standard playbook for managing technical debt doesn't fully account for it.

How errors compound differently

Each time you ask an AI agent to extend or modify code that contains these embedded mistakes, the new code adapts to the existing mistakes rather than correcting them. The model reads your codebase, infers patterns from it, and replicates those patterns in new code. If your data access layer has an inconsistent error handling approach because the original model hallucinated a pattern that doesn't match your ORM's conventions, every new feature touching that layer will inherit and propagate that inconsistency. The model isn't being lazy. It's being consistent with the codebase it was given. That's exactly what you'd want it to do, until the thing it's being consistent with is wrong.

This is where compounding kicks in.

Bad patterns multiply

This is perhaps the most insidious dynamic in AI-assisted codebases: bad patterns don't just persist, they actively reproduce.

When a model generates code, it treats the existing codebase as a style guide. If there's a convoluted way of handling authentication scattered across three modules, the model will faithfully replicate that convolution in the fourth. If there's an unnecessary abstraction layer that the original model introduced because it misunderstood the requirements, every subsequent feature will route through that abstraction, because that's the established pattern. The model is doing what any reasonable developer would do by following the conventions of the codebase it's working in. The problem is that the convention is wrong.

The multiplication effect is what makes this qualitatively different from traditional technical debt. In a human-written codebase, a bad pattern might exist in two or three places before someone on the team notices, raises it in code review, and puts a stop to it. AI agents will gladly propagate it to thirty instances across the codebase without hesitation, each instance making the pattern feel more entrenched and harder to reverse. What started as a single questionable decision by an older model becomes a load-bearing architectural choice that's woven through the entire system.

Left unchecked, this multiplication reaches a tipping point where the codebase becomes increasingly difficult to maintain. Each new feature interacts with more instances of the bad pattern, each refactoring attempt has a larger blast radius, and the cost of correction grows superlinearly with the number of propagations. There's a window in which a bad pattern is cheap to fix. That window closes faster in AI-assisted codebases than it does in human-written ones, because the propagation velocity is so much higher.

When this happens, the initial velocity that made AI-assisted development so appealing begins to collapse. The codebase becomes progressively harder to maintain, each new feature takes longer than the last, and the team hits a plateau where the cost of working around accumulated bad patterns rivals the cost of building new functionality. The speed that felt like a superpower in month one becomes a liability by month six.

The model versioning problem

Many teams have codebases where the foundational modules were generated by an early model, middle layers were built with an intermediate version, and recent features were written with the latest release. Each generation of model has different strengths, different failure modes, and different stylistic tendencies. The code written by GPT-4 in early 2024 has a recognizably different character from code written by Claude Sonnet 4.5 in 2026. They favor different patterns, handle edge cases differently, and make different implicit assumptions about error handling, null safety, and abstraction boundaries.

In a human team, you'd call this "inconsistent coding standards" and address it through code review and style guides. But the inconsistency runs deeper than style. It's inconsistency in reasoning about the problem. The older model may have misunderstood a requirement in a way that's baked into the data model. The newer model, working with that data model as a given, builds correct logic on top of an incorrect foundation. Everything downstream is technically sound and fundamentally wrong.

The review gap

In theory, every line should be reviewed with the same rigor as human-written code. In practice, the volume and speed of AI-generated output creates enormous pressure to review superficially. When a model produces 500 lines of well-structured, syntactically correct code in 30 seconds, the cognitive effort required for thorough review is wildly mismatched with the apparent effort of generation.

The result is a review gap where code gets merged that a developer would have caught issues in if they'd written it themselves line by line. Not because the reviewer is negligent, but because the mode of engagement is different. Reading code for correctness is harder than writing code for correctness, and AI has dramatically increased the ratio of code-to-read to time-available-to-read-it.

Each piece of insufficiently reviewed code becomes part of the foundation that the next round of AI-generated code builds upon. The compound interest accrues quietly.

Breaking the cycle

None of this means AI-assisted development is a mistake. It means we need specific, practical strategies for managing a class of technical debt that compounds faster and differently than what we're used to. The overarching principle is curation where you have to actively tend the codebase, simplifying and correcting as you go, because the cost of deferred maintenance grows faster than it does in traditionally written software. The same capabilities that introduced the problem in the first place, can also be deployed to effectively address the problem if we approach intentionally.

Here's how it can work.

Invest in end-to-end tests before you refactor anything. This is the foundation. Before you attempt to fix compounded errors in an AI-generated codebase, you need a safety net that validates behavior at the boundaries, not at the unit level. Unit tests on AI-generated code often test the implementation rather than the intent, because the model wrote both the code and the tests to match each other. End-to-end tests that verify actual user-facing behavior give you the confidence to refactor aggressively without worrying that you're breaking something that only appeared to work. Write these tests yourself, or at minimum, write the test specifications yourself and review the generated tests with extreme care. The test suite is the one artifact where you cannot afford compounded errors.

Identify bad patterns early and kill them completely. This is the most important habit for long-term maintainability. When you spot a bad pattern, don't just fix the instance you're looking at. Search for every propagation of that pattern across the codebase and remove them all. Then document the anti-pattern explicitly so that it doesn't get reintroduced. Add it to your project's coding guidelines, your AI prompt context, or your rules file. If you're using a tool like Claude Code, put it in CLAUDE.md. If you're using Cursor, put it in your rules. The model will follow whatever conventions you give it, so give it the right ones.

Use current models to audit code from older models. Take your codebase, or specific modules you suspect have accumulated issues, and ask a current-generation model to review it with explicit instructions to identify architectural inconsistencies, anti-patterns, and assumptions that don't hold. Don't ask it to rewrite. Ask it to diagnose. The newer model's improved reasoning capabilities can often spot issues that the original model introduced and that subsequent models faithfully preserved. Treat the output as a triage list, not a fix-it list. Human judgment is still required to determine which issues matter and in what order they should be addressed.

Refactor with specifications, not prompts. When you do refactor, don't just ask the model to "clean up this module." Write a clear specification for what the module should do, what its interfaces should look like, and what invariants it should maintain. Then generate fresh code from that specification, validated against your end-to-end test suite. This breaks the inheritance chain where new code adapts to old mistakes. You're not asking the model to improve existing code; you're asking it to implement a spec, which is a fundamentally different task that doesn't carry forward the accumulated context of prior errors.

Establish architectural boundaries with human-written scaffolding. The highest-leverage code a human developer can write in an AI-assisted codebase isn't feature code. It's the architectural scaffolding: interface definitions, type contracts, module boundaries, error handling conventions. These are the constraints that keep AI-generated code from drifting. When the model generates an implementation, it should be filling in a clearly defined shape, not inventing the shape as it goes. If the skeleton is sound, the flesh can be regenerated as models improve. If the skeleton is also AI-generated and subtly wrong, everything built around it inherits that wrongness.

Simplify continuously, not periodically. The natural tendency is to defer cleanup to a dedicated "refactoring sprint" that never actually gets prioritized. In AI-assisted codebases, this is more dangerous than usual because of how fast bad patterns multiply. Every feature that ships on top of a compromised foundation makes the eventual cleanup more expensive. The discipline of simplifying as you go, removing dead code, consolidating duplicate patterns, reducing unnecessary abstractions with every change, is what keeps the codebase in a state where it can absorb the next round of AI-generated code without further degradation. Treat it as ongoing curation, not periodic renovation.

The maintainability horizon

There's a threshold that every AI-assisted codebase approaches, a point where the accumulated complexity and compounded errors make the system harder to modify than it would be to rebuild significant portions of it. The codebases that cross this threshold aren't the ones that used AI most aggressively. They're the ones that used AI without actively curating the output.

The developers who thrive in this environment won't be the ones who generate the most code fastest. They'll be the ones who understand that a codebase is a living system that requires constant tending, who invest in the specifications and test suites that provide guardrails, and who know when to stop layering new code on a compromised foundation and instead regenerate from solid ground.

The models will keep getting better. But better models writing new code on top of a flawed foundation still produce a flawed system. Long-term maintainability isn't a property of the model that writes your code. It's a property of the discipline with which you manage what gets written.

Anuradha Weeraman
Anuradha Weeraman

CTO of Verdentra, software architect, and contributor to Linux & Debian. Building intelligent products and systems that scale.