> youcanbuildthings.com
tutorials books topics about

Claude Batch API Cost Savings: It Saves 50%, Not 30%

by J Cook · 8 min read·

Summary:

  1. Claude Batch API cost savings are 50% off input and output, uniformly across every active model. Not 30%.
  2. The routing decision is a 2-question tree: needs a response in seconds? then synchronous. Stable system prompt across many calls? then cache. Otherwise Batch.
  3. Caching reads are about 90% cheaper than base; Batch and caching compound on enrichment workloads.
  4. Bonus deliverable: a copy-paste batch submission snippet and the verified limits table.

Claude Batch API cost savings are the question the Claude Certified Architect exam loves to get you to answer wrong. The discount is 50% off both input and output tokens, applied uniformly across every active model. It is not 30%. The exam puts the 30% figure in the answer set on purpose, as a deliberate distractor, because a study guide author who guessed wrote 30% and a careful reader who checked the docs wrote 50%. This article is the second kind.

A two-question orchestration decision tree: question one, does the call need a response within seconds? Yes goes to Synchronous (real-time UX). No goes to question two, is the system prompt stable across many calls? Yes goes to Caching with cache_control type ephemeral, about 90% off cache reads. No goes to the Batch API, about 50% off input and output, 24-hour SLA. A callout reads "The Batch API saves 50%. Not 30%. The wrong number is distractor C on purpose."

How much does the Claude Batch API save?

50% off input tokens and 50% off output tokens, uniformly across every active model. That includes Opus, Sonnet, and Haiku at every current version. This is the verified figure from Anthropic’s own Batch documentation, not a community estimate:

AttributeDocumented value
Cost discountAll usage charged at 50% of standard API prices
Max requests per batch100,000 Message requests
Max batch size256 MB (whichever limit is reached first)
Typical processing timeMost batches complete within 1 hour
Processing / expiry windowResults when all complete or after 24h, whichever first; expires if not done in 24h
Result retentionAvailable for 29 days after creation

Source: docs.claude.com/en/docs/build-with-claude/batch-processing.

Cost-engineering lives in Domain 1 (Agentic Architecture and Orchestration), the heaviest exam domain at 27% of the questions, where orchestration overlaps Domain 4. The math is plain. A $1,000/month synchronous Opus bill is $500/month on Batch for the same volume. If half your workload is latency-critical chat and half is overnight enrichment, routing the enrichment half to Batch saves $250/month with no code change other than the endpoint and a custom_id per request. The exam grades architects who make this routing decision per workload, not per system.

What broke: paying full price for work nobody was waiting on

The failure is routing batchable work through the synchronous endpoint. A nightly extraction job over a few thousand documents ran synchronously, hit the rate limit, failed one batch in twenty, and cost full price. No user was waiting on it. Every dollar of that bill above 50% was avoidable. The architectural cause is treating “which endpoint” as a system-wide default instead of a per-workload decision.

When should you batch, cache, or stay synchronous?

Run the two-question decision tree on every call path.

Question 1: Does the call need a response within seconds? If yes, it is synchronous. A user is waiting. Real-time chat, live triage, anything with a p95 latency SLA under five seconds. Synchronous hits the standard Messages API at full price. You optimize it with caching and smaller models, not by sending it to Batch, because Batch latency is minutes to hours.

Question 2 (only if no): Is the system prompt stable across many calls? If yes, the lever is caching. Anthropic’s prompt caching gives roughly 90% off cache reads versus base input price. Break-even is two reads; after the third read on the same cached block you save on every call. The syntax is one field on the block you want cached:

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=2048,
    system=[
        {
            "type": "text",
            "text": LONG_SYSTEM_PROMPT,  # 4096+ tokens
            "cache_control": {"type": "ephemeral"},
        }
    ],
    messages=[{"role": "user", "content": user_query}],
)

Otherwise: the Batch API. Asynchronous, no shared stable preamble, tolerates a sub-24-hour window. Overnight extraction, eval runs, bulk classification. Submit the work as a batch and poll the job:

import anthropic
from anthropic.types.message_create_params import MessageCreateParamsNonStreaming
from anthropic.types.messages.batch_create_params import Request

client = anthropic.Anthropic()

batch = client.messages.batches.create(
    requests=[
        Request(
            custom_id=f"extract-{i:03d}",
            params=MessageCreateParamsNonStreaming(
                model="claude-opus-4-7",
                max_tokens=1024,
                messages=[{"role": "user", "content": doc}],
            ),
        )
        for i, doc in enumerate(documents)
    ]
)
print(f"batch id: {batch.id}")  # poll batch.id until status is ended

The custom_id is how you match each response back to its input after the batch completes. Poll the job, then stream the results:

import time

while True:
    job = client.messages.batches.retrieve(batch.id)
    if job.processing_status == "ended":
        break
    time.sleep(30)  # most batches finish within 1 hour

for result in client.messages.batches.results(batch.id):
    cid = result.custom_id
    if result.result.type == "succeeded":
        print(cid, result.result.message.content)
    else:
        print(cid, "FAILED", result.result.type)  # retry these synchronously

The few that fail are the ones you fall back to the synchronous path. Wrap that path so a transient rate-limit does not stampede:

import asyncio, anthropic

async def with_fallback(call, max_attempts=3, base_delay=1.0):
    last_exc = None
    for attempt in range(max_attempts):
        try:
            return await call()
        except (anthropic.RateLimitError, anthropic.APITimeoutError) as exc:
            last_exc = exc
            await asyncio.sleep(base_delay * (2 ** attempt))
    raise last_exc

Why caching and Batch compound

Because they are levers on different axes, not alternatives. Batch is 50% off every token. Caching is about 90% off the read of a stable preamble. A nightly enrichment job that runs Batch over thousands of documents, with the long extraction system prompt cached across the run, pays 50% of synchronous on the rotating content and about 90% off the cached preamble portion. The combined effect lands well under either lever alone. The exam writers ask you to identify which combination a scenario calls for, and the compounding case is the one architects miss because they learned only one lever.

One gotcha the exam loves: max_tokens on each batched request must be at least one. You cannot use the Batch endpoint for max_tokens=0 cache pre-warming. Know that before the question asks you.

What should you actually do?

  • If a job is not user-facing and tolerates a sub-24-hour window → route it to the Batch API and take the 50%. Do not negotiate a higher rate-limit quota first; that is the expensive wrong move.
  • If many calls reuse a long stable system prompt → cache it with cache_control and stop paying full input price after the second read.
  • If the workload is both asynchronous and shares a preamble → do both. Batch the job, cache the preamble, and let the compounding land.
  • If a study guide or answer option says the Batch discount is 30% → it is the planted distractor. The verified figure is 50%.

bottom_line

  • The Claude Batch API saves 50%, not 30%. Memorize the real number; the wrong one is in the answer set on purpose.
  • “Which endpoint” is a per-workload decision, not a system default. Run the two-question tree on every call path.
  • Caching and Batch compound. The architects who know both beat the architects who know one.

Frequently Asked Questions

How much does the Claude Batch API actually save?+

50% off both input and output tokens, applied uniformly across every active model. It is not 30%. The 30% figure is a common wrong answer; the verified Anthropic figure is 50%.

When should I use the Batch API instead of synchronous calls?+

When the workload can tolerate a sub-24-hour completion window and no user is waiting in real time: overnight extraction, eval runs, bulk classification. If a user is waiting on the response, stay synchronous and cache instead.

Can I combine the Batch API with prompt caching?+

Yes, and they compound. Batch is 50% off all tokens; cache reads are about 90% cheaper than base input. A batched job with a cached shared preamble lands well below the naive synchronous cost.