Claude Batch API Cost Savings: It Saves 50%, Not 30%
>Cost-engineering is one Domain 1 lever of many. Claude Certified Architect Foundations Playbook ships the full orchestration build with schema enforcement and fallback loops wired in.

Claude Certified Architect Foundations Playbook
Pass the CCA-F Exam in 30 Days with 60 Scenario Drills and 8 Build Projects
Summary:
- Claude Batch API cost savings are 50% off input and output, uniformly across every active model. Not 30%.
- The routing decision is a 2-question tree: needs a response in seconds? then synchronous. Stable system prompt across many calls? then cache. Otherwise Batch.
- Caching reads are about 90% cheaper than base; Batch and caching compound on enrichment workloads.
- Bonus deliverable: a copy-paste batch submission snippet and the verified limits table.
Claude Batch API cost savings are the question the Claude Certified Architect exam loves to get you to answer wrong. The discount is 50% off both input and output tokens, applied uniformly across every active model. It is not 30%. The exam puts the 30% figure in the answer set on purpose, as a deliberate distractor, because a study guide author who guessed wrote 30% and a careful reader who checked the docs wrote 50%. This article is the second kind.

How much does the Claude Batch API save?
50% off input tokens and 50% off output tokens, uniformly across every active model. That includes Opus, Sonnet, and Haiku at every current version. This is the verified figure from Anthropic’s own Batch documentation, not a community estimate:
| Attribute | Documented value |
|---|---|
| Cost discount | All usage charged at 50% of standard API prices |
| Max requests per batch | 100,000 Message requests |
| Max batch size | 256 MB (whichever limit is reached first) |
| Typical processing time | Most batches complete within 1 hour |
| Processing / expiry window | Results when all complete or after 24h, whichever first; expires if not done in 24h |
| Result retention | Available for 29 days after creation |
Source: docs.claude.com/en/docs/build-with-claude/batch-processing.
Cost-engineering lives in Domain 1 (Agentic Architecture and Orchestration), the heaviest exam domain at 27% of the questions, where orchestration overlaps Domain 4. The math is plain. A $1,000/month synchronous Opus bill is $500/month on Batch for the same volume. If half your workload is latency-critical chat and half is overnight enrichment, routing the enrichment half to Batch saves $250/month with no code change other than the endpoint and a custom_id per request. The exam grades architects who make this routing decision per workload, not per system.
What broke: paying full price for work nobody was waiting on
The failure is routing batchable work through the synchronous endpoint. A nightly extraction job over a few thousand documents ran synchronously, hit the rate limit, failed one batch in twenty, and cost full price. No user was waiting on it. Every dollar of that bill above 50% was avoidable. The architectural cause is treating “which endpoint” as a system-wide default instead of a per-workload decision.
When should you batch, cache, or stay synchronous?
Run the two-question decision tree on every call path.
Question 1: Does the call need a response within seconds? If yes, it is synchronous. A user is waiting. Real-time chat, live triage, anything with a p95 latency SLA under five seconds. Synchronous hits the standard Messages API at full price. You optimize it with caching and smaller models, not by sending it to Batch, because Batch latency is minutes to hours.
Question 2 (only if no): Is the system prompt stable across many calls? If yes, the lever is caching. Anthropic’s prompt caching gives roughly 90% off cache reads versus base input price. Break-even is two reads; after the third read on the same cached block you save on every call. The syntax is one field on the block you want cached:
response = client.messages.create(
model="claude-opus-4-7",
max_tokens=2048,
system=[
{
"type": "text",
"text": LONG_SYSTEM_PROMPT, # 4096+ tokens
"cache_control": {"type": "ephemeral"},
}
],
messages=[{"role": "user", "content": user_query}],
)
Otherwise: the Batch API. Asynchronous, no shared stable preamble, tolerates a sub-24-hour window. Overnight extraction, eval runs, bulk classification. Submit the work as a batch and poll the job:
import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request
client = anthropic.Anthropic()
batch = client.messages.batches.create(
requests=[
Request(
custom_id=f"extract-{i:03d}",
params=MessageCreateParamsNonStreaming(
model="claude-opus-4-7",
max_tokens=1024,
messages=[{"role": "user", "content": doc}],
),
)
for i, doc in enumerate(documents)
]
)
print(f"batch id: {batch.id}") # poll batch.id until status is ended
The custom_id is how you match each response back to its input after the batch completes. Poll the job, then stream the results:
import time
while True:
job = client.messages.batches.retrieve(batch.id)
if job.processing_status == "ended":
break
time.sleep(30) # most batches finish within 1 hour
for result in client.messages.batches.results(batch.id):
cid = result.custom_id
if result.result.type == "succeeded":
print(cid, result.result.message.content)
else:
print(cid, "FAILED", result.result.type) # retry these synchronously
The few that fail are the ones you fall back to the synchronous path. Wrap that path so a transient rate-limit does not stampede:
import asyncio, anthropic
async def with_fallback(call, max_attempts=3, base_delay=1.0):
last_exc = None
for attempt in range(max_attempts):
try:
return await call()
except (anthropic.RateLimitError, anthropic.APITimeoutError) as exc:
last_exc = exc
await asyncio.sleep(base_delay * (2 ** attempt))
raise last_exc
Why caching and Batch compound
Because they are levers on different axes, not alternatives. Batch is 50% off every token. Caching is about 90% off the read of a stable preamble. A nightly enrichment job that runs Batch over thousands of documents, with the long extraction system prompt cached across the run, pays 50% of synchronous on the rotating content and about 90% off the cached preamble portion. The combined effect lands well under either lever alone. The exam writers ask you to identify which combination a scenario calls for, and the compounding case is the one architects miss because they learned only one lever.
One gotcha the exam loves: max_tokens on each batched request must be at least one. You cannot use the Batch endpoint for max_tokens=0 cache pre-warming. Know that before the question asks you.
What should you actually do?
- If a job is not user-facing and tolerates a sub-24-hour window → route it to the Batch API and take the 50%. Do not negotiate a higher rate-limit quota first; that is the expensive wrong move.
- If many calls reuse a long stable system prompt → cache it with
cache_controland stop paying full input price after the second read. - If the workload is both asynchronous and shares a preamble → do both. Batch the job, cache the preamble, and let the compounding land.
- If a study guide or answer option says the Batch discount is 30% → it is the planted distractor. The verified figure is 50%.
bottom_line
- The Claude Batch API saves 50%, not 30%. Memorize the real number; the wrong one is in the answer set on purpose.
- “Which endpoint” is a per-workload decision, not a system default. Run the two-question tree on every call path.
- Caching and Batch compound. The architects who know both beat the architects who know one.
Frequently Asked Questions
How much does the Claude Batch API actually save?+
50% off both input and output tokens, applied uniformly across every active model. It is not 30%. The 30% figure is a common wrong answer; the verified Anthropic figure is 50%.
When should I use the Batch API instead of synchronous calls?+
When the workload can tolerate a sub-24-hour completion window and no user is waiting in real time: overnight extraction, eval runs, bulk classification. If a user is waiting on the response, stay synchronous and cache instead.
Can I combine the Batch API with prompt caching?+
Yes, and they compound. Batch is 50% off all tokens; cache reads are about 90% cheaper than base input. A batched job with a cached shared preamble lands well below the naive synchronous cost.
More from this Book
Claude Agent Silent Failure Detection: 3 Mechanisms
Claude agent silent failure detection for output that passes schema but is wrong: assertion validators, a pinned golden set, and an LLM-as-judge stack.
from: Claude Certified Architect Foundations Playbook
The 5 Judgment Patterns the CCA Exam Actually Tests
The Claude Certified Architect exam grades pattern recognition, not trivia. The 5 scenario archetypes, their stem cues, and the one-line fix for each.
from: Claude Certified Architect Foundations Playbook
Claude Chatbot Hallucinating Fix: The Hub-and-Spoke
Claude chatbot hallucinating fix: a Coordinator, Researcher, and Synthesizer with scoped tools and a logged refusal path stop confident invention at the source.
from: Claude Certified Architect Foundations Playbook
MCP Works on My Machine, Breaks on Teammate's: 5 Fixes
MCP works on my machine breaks on teammate clones for 5 reasons. The portability gotchas, the .mcp.json project-scope rule, and the tool-naming convention fix.
from: Claude Certified Architect Foundations Playbook