Why does a local tool call take 52 seconds?

Almost always memory thrashing. The model plus its context didn't fit in RAM, so your machine is swapping to disk and waiting on I/O. Shrink the model or context until everything fits in RAM and the call drops to a few seconds.

Do I set the context window with a flag on ollama pull?

No. There's no --quantize or context flag on pull. Quantization is part of the model tag, and context length is the OLLAMA_CONTEXT_LENGTH environment variable on the server.

Will a bigger model fix slow coding?

Usually the opposite. A 7B that fits in RAM runs circles around a 14B that thrashes. For coding, a fast model that fits beats a smarter one that swaps to disk.

Why Your Local AI Coding Agent Is Slow

Summary:

Why the 52-second tool call is a memory problem, not a hardware verdict.

The three knobs that fix it: model size, quantization, context length.

A hardware tier table matching your RAM to a model that fits.

How to tell thrashing apart from a tool-call problem in ten seconds.

Asking “why is Ollama slow for coding” misses the real culprit: it’s almost never your hardware. The 52-second tool call that makes people quit local coding on day one is a memory problem with a knob, not a verdict on your laptop. The same setup that gives you a 52-second call can give you a 4-second one after twenty minutes of tuning.

What’s actually slow? Diagnose before you touch anything

A tool call that takes 52 seconds is almost never the model thinking hard. It’s one of two things, and the fix is different for each, so spend ten seconds finding out which before you change a setting.

Open your memory monitor (Activity Monitor on macOS, your system monitor elsewhere) and run the slow task again while you watch. You’re looking for one thing: is memory pegged near your total with the disk busy, or is memory comfortable while the model just churns? You can also check what’s loaded:

ollama ps

If memory is in the red, it’s thrashing: the model plus context overran your RAM, so the machine is swapping to disk and waiting on I/O instead of computing. That’s your 52 seconds. If memory is fine but the model keeps stumbling on tool calls specifically, it’s a different problem with the opposite fix. Diagnosing first is the whole game; changing knobs blind is how people “fix” the wrong problem and conclude nothing works.

The three knobs that fix it

Every slowness fix is one of three knobs, and they all point at the same goal: keep the working set in RAM.

Knob 1, model size. The instinct is to run the biggest model that loads. For coding, that instinct costs you. A model that’s slightly smarter but noticeably slower is a worse daily driver than a fast one that’s plenty smart for everyday work. Size is in the tag: qwen2.5-coder:3b, :7b, :14b, :32b.

Knob 2, quantization. This is how much the model is compressed, trading a little quality for much less memory and more speed. You choose it in the tag, the part after the colon, not a flag:

ollama pull qwen2.5-coder:7b-instruct-q4_K_M

That q4_K_M is a popular sweet spot: roughly four-bit precision, which cuts memory dramatically while keeping quality high enough that most people can’t tell the difference on coding tasks. There is no ollama pull --quantize flag. The tag is the knob.

Knob 3, context length. From the Ollama FAQ, the verbatim mechanism:

By default, Ollama uses a context window size of 4096 tokens. This can be overridden with the OLLAMA_CONTEXT_LENGTH environment variable. For example, to set the default context window to 8K, use:

OLLAMA_CONTEXT_LENGTH=8192 ollama serve

Source: docs.ollama.com/faq. To change it for a single session instead of the whole server, the same FAQ gives you a runtime override inside ollama run:

/set parameter num_ctx 8192

Context costs memory: a bigger window means more RAM goes to holding the conversation. You raise it when the model forgets earlier files; you lower it when the machine is swapping.

Match the model to the machine you already own

Here’s the whole tuning conversation in one table. Find your row and start from a configuration that fits, instead of guessing.

Your machine	Recommended model	Size	Usable context	Why this works
16GB no-GPU laptop	qwen2.5-coder:7B-instruct-q4_K_M	7B	~4k tokens	Fits easily. 7B in RAM. Snappy tool calls.
32GB workstation	qwen2.5-coder:14B-instruct-q4_K_M	14B	~32k tokens	Fits well. Room for context. Fast and stable.
48GB+ GPU box	qwen2.5-coder:32B-instruct-q4_K_M	32B	up to 256k tokens	Room to grow. Handles long context and big edits.

The numbers are recommended targets you set, not magic. On 16GB, target around 4k of context; on 32GB the 32k tier is comfortable; on a 48GB+ box you can push context toward 256k. Set it explicitly when you need it, with the variable above.

Hardware tier table matching RAM to a local coding model: a 16GB no-GPU laptop runs qwen2.5-coder 7B at ~4k context, a 32GB workstation runs 14B at ~32k, and a 48GB+ GPU box runs 32B up to 256k, alongside the fits-in-RAM versus thrashes-to-disk cliff that causes a 52-second tool call

What broke: the fits-versus-thrashes cliff

Here’s the failure mode behind almost every “local is too slow” complaint, and why the three knobs work as a system.

When the model plus its context plus everything else fits in RAM, the model runs at the speed your hardware can actually do. When it doesn’t fit, your operating system starts shuffling data between RAM and your much slower disk. That shuffling is swapping, and constant swapping under load is thrashing. A thrashing machine isn’t doing math slowly; it’s spending most of its time moving data instead of computing, which is why it doesn’t feel a little slow, it feels broken. That’s your 52-second tool call. The model isn’t thinking for 52 seconds. It’s waiting on the disk for most of them.

This is why all three knobs aim at the same target. A smaller model, a lighter quantization, a tighter context: each is a way to get the whole working set into fast memory. The cliff between “fits” and “doesn’t fit” is enormous, far bigger than the gentle slope of “slightly bigger model.” A 7B that fits runs circles around a 14B that thrashes, even though the 14B is “better,” because the 14B spends its life waiting on disk.

Two slow cases, two opposite fixes

The diagnosis from the start pays off here, because the two causes need opposite moves.

Memory red while it hangs? Thrashing. Drop the model size one notch (:14b to :7b), and if it’s still tight, lower the context length. Get comfortably back inside RAM with headroom, and the freezes stop completely.

Memory fine but the model keeps stumbling on tool calls? Counterintuitively, raise the context window. Both Codex’s and OpenCode’s docs recommend more context specifically to make tool-calling reliable, because the model needs room to keep the tool definitions and the conversation in view at once. That’s the opposite of the thrashing fix, which is exactly why you diagnose first.

What should you actually do?

If a tool call hangs, check your memory monitor before changing anything. Red means thrash; fine means a tool-call problem.
If you’re thrashing, drop the model size first, the context length second, until the working set fits in RAM.
Don’t buy a GPU to fix this yet. Tune what you own first. A GPU buys speed on work you may not even be doing.

The bottom line

The 52-second tool call was never a verdict on local coding. It was an untuned setup on capable hardware.
Keep the working set in RAM. That single rule explains all three knobs: smaller model, lighter quant, tighter context.
A fast model that fits beats a smart model that swaps. For coding, that trade is almost always right.

Why Your Local AI Coding Agent Is Slow

Run Claude Code Locally

What’s actually slow? Diagnose before you touch anything

The three knobs that fix it

Match the model to the machine you already own

What broke: the fits-versus-thrashes cliff

Two slow cases, two opposite fixes

What should you actually do?

The bottom line

Frequently Asked Questions

Why Your Local AI Coding Agent Is Slow

Run Claude Code Locally

What’s actually slow? Diagnose before you touch anything

The three knobs that fix it

Match the model to the machine you already own

What broke: the fits-versus-thrashes cliff

Two slow cases, two opposite fixes

What should you actually do?

The bottom line

Frequently Asked Questions

More from this book

Local vs Cloud AI Coding: When Local Loses

How to Pick a Local Model for Coding

Run a Local AI Coding Agent Free in 15 Minutes

Run Claude Code Locally with Ollama