← Blog
Verification

AI coding verification: why agents that grade their own homework always fail

June 26, 2026  ·  6 min read  ·  Concertor Engineering

The model that wrote the buggy code is the model being asked whether the code has bugs. Its blind spots do not turn off during review. This is not a product quality problem — it is a structural property of how LLMs fail, and no amount of prompting fixes it.

The self-grading problem

When a student writes an exam and then marks their own paper, the same reasoning errors that produced wrong answers also produce wrong confidence in those answers. The student who misunderstood the question did not suddenly understand it when they picked up the red pen.

LLM coding agents have this problem at scale, and most tools ignore it entirely. The standard loop looks like this:

Typical agent loop (broken)
1. Write code
2. Ask the same model: "is this correct?"
3. Model says yes → ship it
4. (maybe) run tests after the model already approved

Steps 2 and 3 are not review. They are the model ratifying its own output using the same weights, the same training, and the same knowledge gaps that produced the output in the first place. Calling it "self-verification" does not make it verification.

What the research says about LLM self-evaluation

JudgeBench, a benchmark specifically designed to evaluate LLM judgment on correctness tasks, found that frontier models score in the range of 50–55% on correctness evaluation — barely better than flipping a coin on hard cases. The harder the problem, the worse the calibration. These are the exact cases where you most need a reliable verifier.

The author-judge correlation problem compounds this. When the judge model comes from the same model family as the author model, approximately 24% fewer bugs are caught compared to using a genuinely decorrelated judge. Shared training produces shared blind spots. Review within the same family becomes a sophisticated form of confirmation bias — the model finds plausible-sounding reasons to approve its own plausible-sounding mistakes.

This effect holds even when the judge is prompted to be adversarial, skeptical, or thorough. The issue is not the prompt. It is the weights.

What real AI coding verification requires

Proper AI coding verification has three non-negotiable components:

1. Execution as the floor. Code either passes its tests or it does not. There is no "well, arguably" in a failing assertion. A test that errors does not care whether the LLM is confident the code is correct. Execution tests establish a ground truth that no opinion — human or machine — can override. An AI agent that actually runs your tests as part of its verification loop is doing something categorically different from one that reads the code and declares it fine.

2. A decorrelated judge. If you use an LLM as part of your verification pipeline (and you should, because tests do not catch every category of bug), it must come from a different model family than the author. Claude judging GPT's output, or GPT judging Claude's output, catches a meaningfully different set of problems because their failure distributions do not overlap the same way. The decorrelation is what makes the judgment signal rather than noise.

3. Separation of roles across process boundaries. The author and the verifier must not share a context window. A model asked to write code and then immediately verify it in the same conversation has already anchored to its own output. The verification needs to happen in a fresh context, with the code presented as something to evaluate rather than something already approved.

The failure mode in practice

Here is what self-grading failure looks like in a real session. An agent writes a function that handles pagination incorrectly — off-by-one on the page index. It is a subtle bug. The model's internal representation of the problem did not include the edge case, so the code does not include the edge case, and when the model reviews the code it does not notice the missing edge case because the same gap in reasoning covers all three steps.

The agent reports success. Tests were not run because the model indicated confidence. The bug ships.

This is not an AI quality problem in the sense of "the model was not smart enough." It is an architectural problem. The smartest model in the world cannot reliably catch its own systematic errors. Humans cannot either — which is why code review exists, why tests exist, why pair programming exists. All of these practices exist because self-evaluation is structurally limited.

How Concertor approaches AI coding verification

Concertor inverts the standard loop. Execution tests run first, before any LLM judgment call. The LLM cannot override a failing test — its role is to catch the categories of problem that tests do not cover (design problems, logical errors in untested paths, security issues that do not manifest as assertion failures).

Concertor's verification pipeline
1. Claude proposes         GPT proposes
2. Execution tests run against both proposals
3. Claude judges GPT's output    GPT judges Claude's output
4. Cross-model synthesis → final output

The key constraint is structural: the agent that wrote the code is not the agent that approves it. The LLM judge is from a different family than the author, so their blind spots differ. Execution tests provide the ground truth that neither model can talk its way around.

No agent grades its own homework. This is not a principle. It is a hard constraint that follows from what we know about how LLMs fail — and it has to be enforced architecturally, not by prompting.

If you are building AI coding tooling and your verification loop runs through the same model that authored the code, you do not have verification. You have an expensive confidence display. The difference matters, and your users will eventually notice.

Concertor enforces separation between authors and verifiers by design. Execution tests run first. The cross-model judge is never the author. No agent grades its own homework.

Try Concertor →