Free playbooks in your inbox
Hands-on tutorials for people who want to build with AI.
tutorial · The LLM Council

AI Code Review with a Multi-Model Council

Make Codex, Claude, and Gemini argue over your diff and block on any single real bug. Build a multi-model AI code review council that catches what one misses.

From the youcanbuildthings catalog ▸ Build-tested 8 min read

Summary:

  1. Build a /review-council that runs your diff past Codex, Claude, and Gemini independently.
  2. The consensus gate blocks on any single BLOCKER, because a bug seen by one model is still a bug.
  3. Three different blind spots stacked together catch what one rubber-stamping reviewer waves through.
  4. Bonus: the real codex review flags, the full skill, and a worked money-bug the council catches.

Most AI code review is one model glancing at your diff and telling you it looks good. That is the yes-man problem wearing a hoodie. A single model reviewing your pull request finds reasons it is fine, flags a cosmetic nit to look thorough, and waves through the subtle correctness bug it never thought to hunt for. This builds the fix: three different models arguing over your diff before it merges.

How do you run AI code review across multiple models?

You shell out to three different coding CLIs, hand each the same diff, and block the merge if any one of them finds a real bug. The mechanism is a clean three-CLI symmetry: the major coding tools each expose a headless mode that takes a prompt and prints an answer.

claude -p "<prompt>"
codex exec "<prompt>"
gemini -p "<prompt>"

Three different companies, three different models, three commands you can call from one skill. That symmetry is the whole reason a vendor-spanning panel is buildable: each one is a command, so you can fan one diff out to all three and collect three independent reads.

Why does one reviewer rubber-stamp?

One reviewer rubber-stamps because you implicitly asked it to. When you tell a model “review this code,” a model trained to be helpful reads the subtext the same way it reads “is my business idea good.” It finds reasons to approve. It will flag an unused variable to look diligent, but it rarely goes hunting for the correctness bug, because hunting for the thing that proves you wrong is not the helpful-assistant default.

Then there is the knowledge gap. One model has one model’s blind spots about your language and your edge cases. If its training underweighted a failure mode, it will not see that failure mode in your code no matter how hard it looks. The fix is not one reviewer being more careful. It is several reviewers with different blind spots, so the bug one misses, another catches.

Build the /review-council skill

One of your three reviewers is not a model you prompted, it is a purpose-built tool. Codex ships an actual review subcommand. Here are its real flags, verbatim from the openai/codex source:

codex review [OPTIONS] [PROMPT]

--uncommitted          Review staged, unstaged, and untracked changes.
--base <BRANCH>        Review changes against the given base branch.
--commit <SHA>         Review the changes introduced by a commit.
--title <TITLE>        Optional commit title to display in the review summary.
[PROMPT]               Custom review instructions.

--base is the one you will use most, because it reviews your feature branch against the base, which is exactly what a pull request is. (Note: --uncommitted, --base, --commit, and a bare prompt are mutually exclusive, and --title requires --commit.) Now build the skill at .claude/skills/review-council/SKILL.md:

---
description: Convene a three-model code-review council on the current branch.
  Codex, Claude, and Gemini review the diff independently; any single BLOCKER blocks the merge.
---

# Review Council

Review the diff three ways with the Bash tool. (Swap main for your PR base.)

REVIEWER 1 (Codex, purpose-built):
  codex review --base main --title "review-council run"

REVIEWER 2 (Claude, correctness focus):
  Capture the diff: git diff main...HEAD
  claude -p "Review this diff ONLY for correctness bugs: logic errors, error
  paths, race conditions, security. For each: file, line, what breaks, and
  BLOCKER or MINOR. Ignore style. DIFF: paste git diff"

REVIEWER 3 (Gemini, second family):
  gemini -p "Find correctness and security bugs in this diff. List each as
  BLOCKER or MINOR with file, line, and what breaks. DIFF: paste git diff"

CONSENSUS GATE: any BLOCKER from ANY reviewer => verdict BLOCK, and that issue
must be resolved before merge. Single-reviewer findings are still surfaced. No
blocker => SHIP. Print the verdict, then the three reports.

Read the gate carefully, because it encodes a deliberate asymmetry: it is easier to block than to ship. Any single reviewer flagging a real bug blocks the merge, even if the other two missed it. That is the opposite of voting, and it is correct for code, because three models can talk themselves into the same wrong answer, but a bug only has to be caught once.

A code diff on the left labeled payments/service.py, captioned that the refactor removes await so the function returns success before the charge completes. It passes through three reviewer agents: Codex (missing await, reports success on a failed charge, BLOCKER), Claude (same, unhandled rejection in a payments path, BLOCKER), Gemini (failure reason no longer surfaced, MINOR). Their findings converge into a consensus gate showing a red BLOCK state, with the rule any single BLOCKER forces BLOCK, not a vote. Command label: /review-council.

Watch it catch a real money bug

Here is a diff that looks like a clean refactor and is not. The change swaps a thrown error for a returned result object:

# payments/service.py  (refactor that looks clean and isn't)
async def charge(customer_id, amount):
    customer = await get_customer(customer_id)
    total = calculate_total(amount)
    try:
        payments.charge(amount=total, currency="usd")   # the await was dropped here
        return {"ok": True}
    except Exception as e:
        return {"ok": False, "error": str(e)}

The await got dropped. The function now returns {"ok": True} before the charge actually completes, and if the charge fails, the rejection is lost. You would report success on a payment that never went through. Here is what the council does with it:

ReviewerFindingSeverity
Codexmissing await: reports success on a failed chargeBLOCKER
Claudesame, unhandled rejection in a payments pathBLOCKER
Geminifailure reason no longer surfaced to the callerMINOR

Any single BLOCKER trips the gate, so the verdict is BLOCK. Two reviewers caught the money bug; the third caught a smaller real issue the others missed. That is the council working as designed: three blind spots, and the union of what they see is far larger than what any one saw alone. A single reviewer gave you a thumbs up. The council gave you a blocked merge and a punch list.

Notice why the Gemini finding matters even at MINOR severity. The refactor no longer surfaces why a charge failed at the payment layer, only at the balance check, so a downstream caller cannot tell a declined card from a system error. Lower stakes than the lost await, but real, and now it is on your list instead of in production. One reviewer would have given you none of this.

One discipline: verify the finding, do not just trust the BLOCK. Go look at the line, confirm the await is genuinely gone, confirm it matters. Occasionally a reviewer misreads control flow and flags a bug that is not one, so you spend thirty seconds confirming instead of merging a phantom fix. The council narrows where the bug is; you still confirm it is a bug. That thirty-second check is the cheapest habit in the whole workflow.

When to convene, and when to skip it

Not every diff deserves three models. Skip the council for the small and the obvious: a one-line config change, a typo fix, a dependency bump. Running three models on a README edit is the code version of convening a board to pick lunch. A single codex review, or no AI review at all, is fine there.

Convene the full council for the diffs where a hidden bug is expensive: anything touching money, auth, data deletion, concurrency, or a public API other people depend on. Those are the payments-charge situations, where the diff looks innocent and the cost of shipping the bug is real. A good instinct: if you would want a careful human teammate to review it before merge, convene the council.

What should you actually do?

  • If you only have two CLIs installed → run a two-model panel. Any two independent vendors still beats one rubber-stamp.
  • If a reviewer flags a bug → confirm it on the line before you fix it. Treat BLOCK as a high-quality “look here,” not a verdict.
  • If your gate cries wolf on style → tighten it to block on correctness only, or teammates learn to click past it, and an ignored gate is worse than none.
  • If the diff is trivial → skip the council. Reserve it for the changes that would scare a senior reviewer.

The bottom line

  • Block on any single BLOCKER, never on a majority vote. A bug caught once is still a bug, and false confidence is the expensive failure.
  • One of your reviewers should be codex review, a real review tool, not a third roleplay prompt.
  • Three stacked blind spots are the product. The bug reaches production only if it slips past all of them, which is a far higher bar than one model smiling at your diff.
Why trust this? Every youcanbuildthings guide is pulled from a build-tested book: code that ran in production before it was written down.

Frequently Asked Questions

Can AI do code review with multiple models?+

Yes. You run your diff past Codex, Claude, and Gemini independently and combine their findings. Each vendor's model has different blind spots, so a bug one misses, another catches, and the union covers far more than any single reviewer.

What is the codex review command?+

codex review is a non-interactive code-review subcommand in OpenAI's Codex CLI. It reviews uncommitted changes, a branch against a base, or a single commit, and reports problems, so one of your three reviewers is a purpose-built tool rather than a prompt.

Why block on a single reviewer instead of a majority vote?+

Because a bug seen by one model and missed by two is still a bug. Shipping it costs far more than one extra review cycle, so the gate blocks on any single credible BLOCKER instead of voting, which is the right asymmetry for code.