What is a silent failure in a Claude agent?

An output that is schema-valid and wrong at the same time. No exception fires, no alert triggers, the JSON validates, and the answer is still incorrect. The user trusts it; the dashboard stays green.

How do you detect a silent failure if the schema passes?

Three mechanisms that operate on meaning, not shape: assertion-based Pydantic validators for business rules, a pinned golden set of 8 to 15 cases run on every change, and an LLM-as-judge that scores output with a threshold of 0.7.

Why is Domain 5 the domain that fails close candidates?

Domain 5 is 15% of the exam, the smallest weight, so candidates underprepare it. The 4-question margin that decides close pass/fail outcomes lives there, and silent-failure stems read easier than they are.

Claude Agent Silent Failure Detection: 3 Mechanisms

Summary:

A silent failure is output that passes schema validation and is still wrong. No exception. No alert. The gap between “schema-passed” and “actually-correct.”

Three mechanisms close the gap: assertion validators, a pinned golden set, and an LLM-as-judge.

The schema cannot catch this because it only enforces shape. Detection has to operate on meaning.

Bonus deliverable: a drop-in detector with a Pydantic validator, a golden-set runner, and a Haiku judge at threshold 0.7.

Claude agent silent failure detection is the skill the Claude Certified Architect exam grades in its most punishing domain. A silent failure is an output that is schema-valid and wrong at the same time. A supplier contract PDF flows in, output_config.format enforces the JSON schema, every required field is present, no exception is thrown, and the extracted answer is still incorrect. The dashboard stays green. The customer finds out three days later. This lives in Domain 5, which is 15% of the exam, the smallest weight and the one where the 4-question margin between pass and fail is decided.

What is a silent failure?

A silent failure is an output that is valid and wrong. The system reports success. No exception fires. No log line goes red. The schema validates, the Pydantic model is happy, the user reads the answer and assumes it is correct, and the architect watching the dashboard sees nothing. It is the most dangerous bug in an agentic system because every monitoring signal you have says everything is fine.

It lives in the gap between “the schema is satisfied” and “the answer is correct.” The schema layer cannot close that gap. output_config.format guarantees the response is valid JSON matching your schema; it guarantees nothing about whether the values are right. A supplier contract where the model routed the governing-law value into the payment-terms field is schema-valid and useless. Closing the gap needs mechanisms that operate on meaning, not shape. There are exactly three.

What broke: the green dashboard

A nightly extraction ran over a few thousand documents. The schema was tight. Every output validated. The dashboard was green all weekend. On Monday a customer flagged that one document layout had its fields silently swapped. The schema did not catch it because both values were valid strings. There was no business rule saying which value belonged in which field. The fix that would have caught it before the bill was a single assertion validator. That is mechanism one.

How do you detect output that passes schema but is wrong?

Three mechanisms, each catching what the previous one cannot. Build all three; one alone fails the exam stem that grades the other two.

Mechanism 1: Assertion-based validation. A Pydantic post-validator that codifies a business rule the schema cannot express. The schema says payment_terms is a string. The validator says payment_terms must name a recognized currency. Pydantic’s canonical @field_validator decorator is the primitive, from docs.pydantic.dev/latest/concepts/validators:

from pydantic import BaseModel, field_validator

class Model(BaseModel):
    f1: str
    f2: str

    @field_validator('f1', 'f2', mode='before')
    @classmethod
    def capitalize(cls, value: str) -> str:
        return value.capitalize()

Applied to the contract case, the rule is a real assertion:

from pydantic import BaseModel, field_validator
from datetime import date, timedelta

class ContractInfo(BaseModel):
    payment_terms: str
    effective_date: date

    @field_validator("payment_terms")
    @classmethod
    def must_name_currency(cls, v: str) -> str:
        assert any(c in v for c in ["USD", "EUR", "GBP", "JPY"]), \
            f"payment_terms missing currency: {v}"
        return v

    @field_validator("effective_date")
    @classmethod
    def within_a_decade(cls, v: date) -> date:
        assert v <= date.today() + timedelta(days=365 * 10), \
            f"effective_date implausibly far out: {v}"
        return v

Mechanism 1 is the cheapest layer and the easiest to over-trust. It catches rule violations. It does not catch a wrong answer that satisfies every rule.

Mechanism 2: Golden-set regression. A pinned set of 8 to 15 inputs whose correct outputs are known and checked into the repo. On every prompt, model, or tool change, the agent runs the golden set and the outputs are diffed against the pinned answers. A non-empty diff means the change broke something a previous version got right.

import json
from pathlib import Path

def golden_set_regression(agent, samples: Path, expected: Path):
    s = json.loads(samples.read_text())       # 8-15 pinned cases
    e = json.loads(expected.read_text())
    failures = []
    for case_id, payload in s.items():
        actual = agent.run(payload)
        if actual != e[case_id]:
            failures.append({"id": case_id, "expected": e[case_id], "actual": actual})
    return failures

Mechanism 2 is the most architecturally important of the three. The golden set is your test set; it grows as you discover new failure modes; it runs in CI on every commit. Naming it as architecture, not as QA cleanup, is what the exam grades highest.

Mechanism 3: LLM-as-judge. A second Claude call, on Haiku 4.5 for cost, that critiques the first output against the question and returns a structured verdict. Below a confidence of 0.7, flag for human review; well below, refuse and fall back.

import anthropic
from pydantic import BaseModel
from typing import Literal

class JudgeVerdict(BaseModel):
    quality: Literal["good", "questionable", "bad"]
    reason: str
    confidence: float

client = anthropic.Anthropic()

def judge(question: str, answer: str) -> JudgeVerdict:
    r = client.messages.parse(
        model="claude-haiku-4-5",
        max_tokens=512,
        messages=[{"role": "user", "content":
            f"Judge whether the answer is correct and grounded.\n"
            f"Question: {question}\nAnswer: {answer}\nReturn the schema."}],
        output_format=JudgeVerdict,
    )
    return r.parsed_output  # confidence < 0.7 -> flag for review

The judge is not infallible. It catches a meaningful fraction of the silent failures the assertion layer cannot, which is the point: each mechanism covers the previous one’s blind spot.

Why this is the domain that fails close candidates

Because Domain 5 is the smallest weight, candidates spend the least time on it, and the 4-question margin lives there. Silent-failure stems read like edge cases you can guess on. They are the most discriminating questions on the test. A candidate who failed by three or four questions almost certainly lost them in a domain exactly like this one: small, underprepared, and full of stems that look easier than they are.

The stem cues are consistent: “passes validation,” “no exception thrown,” “the output looks fine,” “user reports the wrong answer days later,” “schema validates but downstream complains.” See those phrases and the answer is this three-mechanism stack, never the schema or fallback layers from the structured-output chapter.

What should you actually do?

If your outputs pass schema and you still get wrong answers in production → you have a silent failure. Add an assertion validator for the one business rule that would have caught the last incident.
If you change prompts or models without a regression check → pin 8 to 15 golden cases and diff on every commit before anything else.
If assertions and the golden set still miss semantic errors → add the Haiku judge at a 0.7 threshold. Flag below it; refuse well below it.
If you are prepping for the exam → when a stem says “schema validates but the answer is wrong,” do not reach for output_config.format. That is the trap. Reach for this stack.

bottom_line

A silent failure is schema-valid and wrong. Every monitoring signal lies. Detection has to operate on meaning.
Build all three mechanisms. Assertions catch rule breaks, the golden set catches regressions, the judge catches semantic errors. One alone is not enough.
Domain 5 is small and decisive. The 4-question margin lives where candidates prepare least. Drill it like it is the whole exam.

Claude Agent Silent Failure Detection: 3 Mechanisms

Claude Certified Architect Foundations Playbook

What is a silent failure?

What broke: the green dashboard

How do you detect output that passes schema but is wrong?

Why this is the domain that fails close candidates

What should you actually do?

bottom_line

Frequently Asked Questions

Claude Agent Silent Failure Detection: 3 Mechanisms

Claude Certified Architect Foundations Playbook

What is a silent failure?

What broke: the green dashboard

How do you detect output that passes schema but is wrong?

Why this is the domain that fails close candidates

What should you actually do?

bottom_line

Frequently Asked Questions

More from this book

Claude Batch API Cost Savings: It Saves 50%, Not 30%

The 5 Judgment Patterns the CCA Exam Actually Tests

Claude Chatbot Hallucinating Fix: The Hub-and-Spoke

MCP Works on My Machine, Breaks on Teammate's: 5 Fixes