Where do AI agent guardrails belong?

On the tool-call boundary, in code, not in the prompt. The dangerous moment is the one line where a proposed tool call becomes a real action. A deterministic gate on that line classifies the call and blocks or escalates it before it runs.

Can't I just tell the model not to run dangerous commands?

No. A prompt instruction or a model-consulted allowlist is a suggestion, not a guardrail. The moment enforcement depends on the model or on output it can route around, it is bypassable. One real allowlist bypass scored a CVSS 10.0.

What is the actual attack surface for an AI agent?

The tool-call manifest: the list of things the agent's tools can touch, not the words going into the model. A model that can only read files cannot delete your repo no matter how it is prompted, because deletion is not on its manifest.

AI Agent Guardrails: Gate Every Tool Call

Summary:

AI agent guardrails belong on one line of code: where a tool call becomes an action.

Build a gate that classifies every call as allow or ask, and defaults to ask.

The real attack surface is the tool-call manifest, not the prompt.

You get a gate that fails safe, escalates the dangerous stuff, and lets the agent keep working.

AI agent guardrails belong in exactly one place, and it is not the prompt. Your agent gets dangerous at the moment it does something, and doing something is never the prompt. It is a tool call. An agent that can run commands can run the wrong one. An agent that can hit the network can send your secrets somewhere you did not intend. The good news: you do not need a security team to close most of this gap. You need a gate in the right place, and the right place is more specific than you would guess.

Where does suggest become act?

Look at any agent loop and find the exact line where the danger lives. The model returns tool calls, you parse them, and then:

result = TOOLS[name](**args)

That line is the trust boundary. To the left of it, a tool call is just a suggestion: text the model produced, sitting harmlessly in a variable. The model “wants” to run a command, and wanting is free. The instant that line executes, the suggestion becomes an action with real consequences on your real machine. Everything before it was reversible. Nothing after it is.

The trust boundary where the gate intercepts a tool call: a reversible suggestion on the left, the gate that classifies and decides in the middle, the irreversible action on the right, with guardrails on the gate not the prompt

Every guardrail you will ever build sits on that one seam. In the bare loop the seam has nothing on it, so a confident-but-wrong rm -rf ./build runs with zero friction. The thing we put on the seam is the gate.

The attack surface is the manifest, not the prompt

Most people think about agent safety as a prompt problem: injection, jailbreaks. That is real, and it is not where your agent gets dangerous. A model that can only read files cannot delete your repo no matter how badly it is prompted, because deletion is not on its manifest. The real attack surface is the tool-call manifest: the list of things the tools can actually touch.

The security world is writing this down. OWASP, the people behind the famous web-app Top 10, now publish a Top 10 for agentic applications, and the top two entries are exactly this: tool misuse, and identity and privilege abuse. Both are about the manifest. Neither is about the prompt. Watch a delete request bounce off a missing capability:

def attempt(tools, name, args, human_says_yes):
    if name not in tools:
        return f"IMPOSSIBLE: {name!r} is not on this manifest"
    if not human_says_yes:
        return "DENIED at the gate"
    return tools[name](**args)

READ_ONLY = {"read_file": lambda path: "(contents)"}
FULL = {"read_file": lambda path: "(contents)", "run": lambda command: "(ran)"}

print(attempt(READ_ONLY, "run", {"command": "rm -rf ."}, human_says_yes=True))
print(attempt(FULL,      "run", {"command": "rm -rf ."}, human_says_yes=True))

IMPOSSIBLE: 'run' is not on this manifest
DENIED at the gate

In the first line the operator said yes to a delete and nothing deleted, because run was never on the manifest. The same call reaches the shell only when the shell tool is on it. The only variable that changed was the manifest.

How to build the gate

Step 1: classify every call. Allow the safe, ask about everything else, and default to ask.

SAFE_PREFIXES = ("ls", "cat", "head", "grep", "rg", "find", "pytest",
                 "git status", "git diff", "git log")
DANGER_SIGNS = ("rm ", "sudo", "mv ", "dd ", "mkfs", "chmod", ">", ">>",
                "curl", "wget", "git push", "pip install", ":(){")

def classify(name, args):
    if name in ("read_file", "update_progress"):
        return "allow"                       # reads are free
    if name == "write_file":
        return "ask"                         # every write goes to a human
    if name == "run":
        command = args.get("command", "")
        if any(sign in command for sign in DANGER_SIGNS):
            return "ask"                     # danger sign anywhere: escalate
        if command.strip().startswith(SAFE_PREFIXES):
            return "allow"
    return "ask"                             # default-deny

Two ordering choices matter more than they look. The danger check runs before the allowlist, so pytest && rm -rf build does not sneak through on the strength of starting with pytest. And every branch ends in ask, not allow. A default-allow gate is not a gate. It is a speed bump you forgot to install.

Step 2: build the gate that asks. When a call is not auto-safe, show a human exactly what is about to happen and wait. If no human is attached, deny.

def gate(name, args):
    if classify(name, args) == "allow":
        return True
    print(f"\n[gate] agent wants to run: {name}({args})")
    try:
        answer = input("[gate] allow this action? [y/N] ").strip().lower()
    except EOFError:
        answer = "n"          # no human present: deny by default (fail safe)
    return answer == "y"

When the safety mechanism is unsure, it fails toward “no.” That is the only direction a guardrail is ever allowed to fail.

Step 3: wire it into the loop. Ask the gate before you run the tool, and feed a denial back to the model as the tool result so it adapts instead of crashing.

if not gate(name, args):
    result = ("DENIED by the guardrail: a human declined this action. "
              "Do not retry it. Find another approach.")
else:
    result = TOOLS[name](**args)

That denial message does real work. The agent does what a good colleague does when told no: it explains why, finds another way, or asks. The gate keeps the conversation going on the safe side of the line.

What broke: the model can’t enforce its own rules

The most tempting shortcut in agent safety is to skip the code and just tell the model to behave. Put “never run destructive commands” in the system prompt and trust it. It feels equivalent. It is a security hole you can drive a truck through.

The proof is a real advisory against a shipping coding CLI. Its auto-approve mode had an allowlist meant to block dangerous commands, and researchers found a command that the allowlist waved through but that did the forbidden thing anyway. It earned a CVSS score of 10.0, the maximum. The lesson is structural: the moment your enforcement depends on the model, or on a check the model’s output can slip around, you do not have a guardrail. You have a suggestion. Enforcement has to live in the engine, at the boundary, in code the model cannot route around.

This is also why I will not tell you a gated agent is “secure.” A cleverly malformed command could slip past a danger-sign list. What the gate gives you is defense in depth at the trust boundary: a real check, in the right place, that stops the obvious disasters and puts a human on the rest. That is enormously better than the bare loop, and it is genuinely not the same thing as secure. Anyone who tells you their framework is “secure” is selling the exact false confidence this layer exists to take away.

Shrink the blast radius too

Gating the action is half the job. The other half is making sure an action that gets through cannot reach far. Refuse any path outside the working directory, in code:

import os
RIG_ROOT = os.path.realpath(os.getcwd())

def within_root(path):
    real = os.path.realpath(path)
    if real != RIG_ROOT and not real.startswith(RIG_ROOT + os.sep):
        raise PermissionError(f"path escapes root: {path!r}")
    return real

The realpath call collapses the ../../ before the comparison, so a ../../etc/passwd write dies before the gate even weighs in. Assume the gate will eventually be wrong, and arrange the world so that when it is, the damage stays small.

What should you actually do?

If your agent has a shell tool and no gate → stop pointing it at anything you care about until you add one. An unguarded run is a loaded gun that aims well.
If your “guardrail” is a line in the system prompt → move it into code at the tool-call boundary. A prompt cannot enforce anything the model decides to ignore.
If the prompts get annoying and you want to auto-approve everything → don’t. The pull to switch the gate off is strongest exactly when a confident-but-wrong action is most likely to slide past you unread.
If you want real safety → pair the gate with least privilege. Shorten the manifest, scope the working directory, hand it a low-privilege token. An agent that physically cannot touch your home directory needs no gate to keep it from deleting your home directory.

The bottom line

Put guardrails on the gate, not in the prompt. The attack surface is what the tools can touch, not the words going in.
Default to deny, fail to deny, and feed denials back so the agent adapts instead of hammering the wall.
No gate makes an agent “secure.” A gate plus a small blast radius makes it safe enough to point at a repo you actually care about.

AI Agent Guardrails: Gate Every Tool Call

Agentic Coding: Build the Harness

Where does suggest become act?

The attack surface is the manifest, not the prompt

How to build the gate

What broke: the model can’t enforce its own rules

Shrink the blast radius too

What should you actually do?

The bottom line

Frequently Asked Questions

AI Agent Guardrails: Gate Every Tool Call

Agentic Coding: Build the Harness

Where does suggest become act?

The attack surface is the manifest, not the prompt

How to build the gate

What broke: the model can’t enforce its own rules

Shrink the blast radius too

What should you actually do?

The bottom line

Frequently Asked Questions

More from this book

Build an AI Coding Agent From Scratch in Python

Context Engineering for AI Agents

What Is an Agent Harness? The 7 Layers

Why AI Agents Report False Success