An LLM Is Not a Compiler: Why Skipping AI Code Review Is Risk Transfer

May 13, 2026 11 min read Sebastian Sigl

An LLM Is Not a Compiler

“Nobody reviews compiler output. Why would you review AI output?”

The argument surfaced on May 6, 2026 in a post by John Crickett that hit 33,500 views and 837 likes in a day. Uncle Bob Martin jumped in to reinforce the rebuttal. The developer community split into two camps. One camp felt the analogy was obvious and correct. The other camp felt it was dangerously wrong.

The second camp is right. And understanding why it is wrong might be the most important mental model shift for developers using AI coding tools in 2026.

The compiler analogy is seductive because it feels true at a surface level. You write something. A tool transforms it into something else. You use the result without examining every instruction. But this surface similarity hides a fundamental difference in how trust was earned, what guarantees exist, and what happens when things go wrong.

If your team is using this analogy to justify skipping AI code review of generated code, you are not saving time. You are transferring risk. And the bill comes due at 2am when something fails in a way nobody can explain.

How Compilers Earned Your Trust

You do not review compiler output. That is true. But you do not review it for specific reasons that took decades to establish.

A compiler takes a formally specified language as input. C has a specification. Java has a specification. TypeScript has a specification. Every construct, every keyword, every operator has a defined behavior. When you write int x = 5;, the compiler knows exactly what that means. There is no ambiguity. There is no interpretation. There is a spec, and the compiler follows it.

Given the same input, a compiler produces the same output. Every time. On every machine. With every run. This property is called determinism, and it is not a nice-to-have. It is the foundation of why you can trust the result without checking it. If you compiled your code yesterday and it worked, compiling the same code today will produce the same binary. You can build on that guarantee.

When a compiler encounters something it cannot handle, it fails loudly. You get an error. A line number. A description of what went wrong. The compiler does not guess. It does not produce something that looks right but might be wrong. It stops and tells you the problem.

These three properties together (formal specification, determinism, and loud failure) are what make compiler output trustworthy. But none of them were free. The C specification took years to stabilize. Early compilers had bugs. The trust was earned by millions of engineers using the same tools over decades, finding and fixing edge cases, building a shared understanding of what “correct compilation” means.

You trust gcc because 40 years of engineers have validated it. You trust the TypeScript compiler because its type system is formally defined and its behavior is predictable. That trust was not given. It was earned.

Why LLMs Have None of These Properties

Now take the same framework and apply it to an LLM generating code from a prompt.

flowchart TD
    classDef start fill:#1C1816,stroke:#5A5550,stroke-width:1px,color:#A9A299
    classDef good fill:#101E14,stroke:#5A8A58,stroke-width:1.5px,color:#8AB888
    classDef goodEnd fill:#0D2812,stroke:#7A9B76,stroke-width:2.5px,color:#7A9B76,font-weight:bold
    classDef bad fill:#221410,stroke:#C06040,stroke-width:1.5px,color:#E09070
    classDef badEnd fill:#351812,stroke:#D97656,stroke-width:2.5px,color:#D97656,font-weight:bold

    A["Formal spec input"]:::start --> B["Deterministic output"]:::good --> C["Loud failure\non error"]:::good --> D["Trustworthy result"]:::goodEnd
    E["Ambiguous NL input"]:::start --> F["Non-deterministic\noutput"]:::bad --> G["Silent failure\non confusion"]:::bad --> H["Unknown risk"]:::badEnd

    linkStyle 0 stroke:#5A8A58
    linkStyle 1 stroke:#5A8A58
    linkStyle 2 stroke:#7A9B76
    linkStyle 3 stroke:#C06040
    linkStyle 4 stroke:#C06040
    linkStyle 5 stroke:#D97656

No formal specification for input. When you write “create a function that handles user input safely,” what does “safely” mean? Input validation? SQL injection prevention? XSS filtering? Rate limiting? Payload size limits? All of the above? The LLM picks an interpretation. You do not know which one until you read the code. Natural language is ambiguous by design. That ambiguity cannot be removed by better prompting. It is fundamental to how language works.

Non-deterministic output. Give the same prompt to the same model twice and you may get different code. Temperature settings, batch sizes, hardware variance, floating point operations during inference, even the order of parallel computations push the output in different directions. This is not a bug in the model. It is how statistical text generation works. The model samples from a probability distribution. Sampling is inherently variable.

Silent failure. When an LLM cannot handle a request well, it does not throw an error. It produces something. Always. It generates code that looks syntactically correct, uses reasonable variable names, and might even include helpful comments. But that code might misunderstand your requirements, introduce a subtle logic error, or silently ignore an edge case you assumed would be handled. There is no red squiggly line. There is no “ERROR: ambiguous input on line 7.” There is just code that looks right.

These are not temporary limitations that better models will fix. They are fundamental properties of the technology. Natural language will always be ambiguous. Statistical sampling will always be variable. And generative models will always produce output rather than refusing to act on unclear input.

When AI code passes tests but fails in production, it is often because these properties were not accounted for. The code compiled. The tests passed. But the silent assumptions baked into the generated code did not match the production environment.

We Are at the Hand-Written Assembly Stage

Here is a historical parallel that makes the situation clearer.

In the 1950s and 1960s, early compilers existed but engineers still reviewed their output. Not because the engineers were paranoid. Because the compilers had not earned trust yet. The specs were incomplete. Edge cases were uncharted. The only way to know if the compiled output was correct was to read it.

Over time, as specifications stabilized, as millions of engineer-hours validated the tools, as the community built shared mental models of what “correct” looked like, the need for reviewing compiler output diminished. Trust was earned incrementally.

With LLMs, we are at the hand-written assembly stage. The tools exist. They produce output. But the trust has not been earned. And unlike compilers, there is a strong argument that it cannot be earned in the same way, because the fundamental properties (ambiguous input, non-deterministic output, silent failure) are not temporary bugs. They are inherent to the architecture.

This does not mean AI coding tools are useless. It means the relationship developers have with them must be different from the relationship they have with compilers. A compiler is a tool you trust with verification. An LLM is a tool you trust with generation but verify yourself. The distinction between vibe coding and agentic coding maps directly onto this: vibe coding treats the LLM like a compiler. Agentic coding treats it like a junior developer whose output needs review.

The developers building production systems understand this intuitively. The UC San Diego study found that zero out of thirteen professional developers practiced vibe coding. Not because they are slow adopters. Because they understand the trust model is different.

The Silent Failure Experiment

To make this concrete, I ran a reproducible experiment. Same model (Claude 3.5 Sonnet), same prompt, five consecutive runs with default temperature. The task: “Write a TypeScript function that validates and sanitizes user email input for a registration form. Include appropriate error handling.”

Five runs. Five different implementations:

Run 1: Regex validation only, throws on invalid input
Run 2: Regex plus length check, returns a result object with success/failure
Run 3: Regex, length check, DNS MX record lookup (async), throws custom error class
Run 4: Simple regex, trims whitespace, lowercase normalization, returns boolean
Run 5: Regex, checks for disposable email domains against a hardcoded list, returns validation result with specific error codes

Five different interpretations of “validates and sanitizes.” Five different error handling strategies. Five different return types. All syntactically correct. All would pass a basic test suite for valid versus invalid email. None explicitly wrong.

But only one of them matches what your system actually needs. And which one depends on context the LLM does not have: your error handling conventions, your API contract, whether you need async validation, whether you care about disposable emails, what your downstream systems expect as input.

A compiler given validateEmail(input: string): boolean will always produce the same machine code. An LLM given “validate email” will produce something different every time, and silently assume context it was never given.

This is why AI code review is not optional. It is not about catching bugs in the traditional sense. It is about verifying that the AI’s silent assumptions match your system’s actual requirements. You can reproduce this experiment yourself with any model and any task description. The variance is not an edge case. It is the default behavior.

What Risk Transfer Actually Costs Your Team

When a team decides that AI-generated code does not need review “because it is like compiler output,” they are making a specific bet. They are betting that the AI’s interpretation of their prompt matches their actual requirements. That the silent assumptions are correct. That the non-deterministic generation happened to produce the variant their system needs.

Sometimes that bet pays off. For simple, well-constrained tasks with clear specifications, the AI output is often correct. But for anything involving architectural decisions, system integration, error handling strategy, or domain-specific logic, the bet fails often enough to matter.

The cost does not show up immediately. It shows up three months later when a developer opens a file and cannot understand why the email validation function makes an async DNS call. When an on-call engineer gets paged because the error handling strategy in one module contradicts the pattern used everywhere else. When a security audit discovers that “sanitizes user input” was interpreted as “trims whitespace” rather than “prevents injection attacks.”

This is risk transfer. The time you saved by not reviewing is not gone. It is borrowed. It compounds as interest in debugging sessions, incident investigations, and architectural confusion. Every piece of unreviewed AI code is a small bet that the AI’s silent assumptions are correct. Stack enough of those bets and the probability of a painful surprise approaches certainty.

The practical question is not “should I review AI code.” It is “what happens to my team if I do not.” And the answer is: you accumulate invisible technical debt that nobody can explain because nobody read the code when it was written.

If you want a concrete framework for what effective review actually looks like in practice, the guide on reviewing AI-generated code breaks down the specific skills involved.

The Trust Spectrum in Practice

Not all AI-generated code carries the same risk. The practical answer is not “review everything with equal intensity.” It is “understand where on the trust spectrum each piece of generated code falls.”

Low-risk, high-trust territory: boilerplate code with clear specifications. A React component that renders a list of items. A database migration that adds a nullable column. A test file that follows an established pattern. These are cases where the AI interpretation space is small, the correctness criteria are clear, and a quick glance confirms the output is reasonable.

High-risk, low-trust territory: anything involving system integration, error handling strategy, security boundaries, performance-critical paths, or domain logic. These are cases where the AI interpretation space is large, where “looks correct” and “is correct for your system” are different questions, and where silent failures have real consequences.

The skills that AI cannot replace are precisely the ones needed to make this distinction. Knowing where trust is warranted and where it is not. Knowing which of the five email validation variants your system actually needs. Knowing why the previous engineer avoided that pattern.

The Compounding Risk Nobody Measures

There is a second-order effect that is harder to see. When a team stops reviewing AI code, they also stop building shared understanding of their codebase.

Code review was never just a quality gate. It was how knowledge moved between team members. When you review a piece of AI-generated code and ask “why is it done this way,” you discover that nobody decided it should be done this way. The AI decided. And the AI cannot explain its reasoning because it did not reason. It predicted tokens.

Six months of unreviewed AI code produces a codebase where nobody can explain why decisions were made. Not because the decisions were bad. Because there were no decisions. There were predictions. And predictions do not come with rationale.

This is the risk nobody measures. Not individual bugs. Not single failures. The gradual erosion of shared understanding about how and why the system works the way it does. The transformation of a codebase from “a system we built and understand” to “a system that was generated and we navigate.”

What This Means for Your Practice

The compiler analogy is wrong, but the instinct behind it is understandable. Developers want to move fast. They want to trust their tools. They want to focus on the interesting problems and let automation handle the rest. Those are reasonable desires.

But “trust” is not binary. It exists on a spectrum, and it is earned through evidence, not analogy. The right response to AI coding tools is not to review nothing (compiler fallacy) or review everything with paranoid intensity. It is to build the judgment that lets you know which code needs scrutiny and which does not.

That judgment is a skill. It develops through practice. Through reading code critically. Through noticing when AI output looks correct but feels wrong. Through building the pattern recognition that flags silent assumptions before they become production incidents.

The developers who will thrive in the AI-first era are not the ones who generate code fastest. They are the ones who can look at generated code and tell you what is trustworthy and what is not. Who can read the five different email validation implementations and immediately identify which one fits their system. Who treat the LLM like what it is: a powerful generation tool that requires human verification.

Not because AI is bad. Because the technology’s fundamental properties require it. An LLM is not a compiler. It never will be. And the developers who understand this distinction will be the ones trusted with the codebases that matter.

Sources

Ready to sharpen your engineering skills?

Practice architecture decisions, trade-off analysis, and system design with AI-powered drills. 5 minutes a day builds judgment that compounds.

Start Free Trial

7-day free trial. Cancel before it ends and pay nothing.