AI & LLMsJune 11, 2026 7 min read

Routing Between Claude, GPT, and Gemini: A Production Playbook

Picking one frontier model and praying is not a strategy. Here's how we route requests across Claude, GPT, and Gemini in production — by task shape, cost, and failure mode.

Picking a single frontier model and betting your product on it used to be defensible. In 2026 it isn't. Claude, GPT, and Gemini each win different task classes, and the price/quality curves shift every few months. If you're still hardcoding one provider, you're either overpaying or under-shipping — usually both.

This is the routing playbook we use when we build LLM features for clients. It covers how to classify traffic, when each model actually pulls ahead, how to wire fallbacks without melting your latency budget, and how to keep the whole thing under cost control.

Why single-vendor lock-in stopped making sense

For a while, the answer was "just use GPT-4" or "just use Claude 3.5 Sonnet." That worked because the gaps were huge and the APIs were unstable. Today the three major providers — Anthropic, OpenAI, and Google — all ship comparable frontier tiers, comparable smaller tiers, and overlapping feature sets (tool use, structured outputs, prompt caching, long context).

What differs is the shape of strengths:

Claude (Anthropic) tends to be our default for long-form reasoning, code review, and anything where instruction adherence on nuanced policy matters. Prompt caching with up to 1-hour TTL is documented in Anthropic's docs and changes the economics of agent loops.
GPT (OpenAI) is still our pick for strict structured outputs (the response_format JSON schema mode is genuinely reliable), broad tool-calling ecosystems, and the Realtime API for voice.
Gemini (Google) wins on raw context window (multi-million token tiers per Google's docs), native multimodal ingestion of PDFs and video, and pricing on the Flash tier for high-volume classification.

None of this is a benchmark claim — it's how the tradeoffs shake out in projects we've shipped. Your mileage will vary by domain, and that's exactly why you need a router instead of a religion.

Classify traffic before you classify models

The mistake teams make is starting with "which model is best?" The right starting question is "what kinds of requests do we actually serve?"

In a typical SaaS product with AI features, we usually find four or five buckets:

Cheap classification / extraction — tagging, routing, intent detection. Latency-sensitive, accuracy tolerance is moderate.
Structured generation — filling a JSON schema from messy input. Accuracy on the schema matters more than prose quality.
Long-context synthesis — summarizing a 200-page document, reviewing a codebase. Quality and context window dominate.
Agentic loops — multi-step tool use, planning. Instruction adherence and tool-calling reliability matter most.
User-facing chat — streaming, conversational. Latency and tone matter as much as raw IQ.

Once you have buckets, you can map each to a primary model and a fallback. Don't skip this. A routing layer without task taxonomy is just a more expensive load balancer.

A simple router interface

Keep the routing decision out of your business logic. We typically wrap providers behind a single internal complete() function that takes a task type and a payload:

type TaskType =
  | 'classify'
  | 'extract_structured'
  | 'long_synthesis'
  | 'agent_step'
  | 'user_chat';

interface CompleteRequest {
  task: TaskType;
  messages: Message[];
  schema?: JSONSchema;
  maxTokens?: number;
  tenantId: string;
}

async function complete(req: CompleteRequest): Promise<Completion> {
  const route = pickRoute(req.task, req.tenantId);
  try {
    return await route.primary.invoke(req);
  } catch (err) {
    if (isRetryable(err)) {
      return await route.fallback.invoke(req);
    }
    throw err;
  }
}

The pickRoute function reads from a config table, not from code. That way you can shift traffic without a deploy when a provider has an outage — and they all have outages.

Mapping tasks to models

Here's a starting matrix we've used as a baseline on greenfield projects. Treat it as a hypothesis to validate with evals, not gospel.

Task bucket	Primary	Fallback	Why
Classify / extract (high volume)	Gemini Flash	GPT-4.1 mini	Flash pricing on bulk classification is hard to beat
Structured generation	GPT (JSON schema mode)	Claude with tool-use coercion	OpenAI's schema enforcement is the most reliable
Long-context synthesis	Gemini Pro or Claude	The other one	Both handle 200k+ tokens well; pick on quality per domain
Agent step (tool calling)	Claude Sonnet tier	GPT	Claude's tool-use adherence in long loops has been steadier for us
User-facing chat	Claude or GPT	The other one	Pick on tone/latency for your audience

Note what's missing: no "best overall model" column. That's the point.

Fallbacks that don't make latency worse

A naive fallback — try primary, on error try secondary — doubles your worst-case latency. Worse, it can stack timeouts. A few patterns we use:

Tight primary timeout, generous fallback timeout. If your SLA is 8s, set the primary timeout to ~3.5s and the fallback to ~4s. You'd rather burn a fallback call than blow the SLA.
Hedged requests for critical paths. For low-volume, high-stakes calls (a checkout assistant, say), fire both primary and fallback in parallel after a short delay and take whichever returns first. This costs more but caps p99.
Circuit breakers per provider. If error rate from a provider spikes above a threshold over a 60-second window, flip the primary to the fallback automatically. Anthropic, OpenAI, and Google all have multi-hour incidents a few times a year. Don't be the team that finds out via support tickets.
Don't fall back on content errors. If the primary returned a 200 with bad JSON, that's an eval problem, not a routing problem. Falling back here hides bugs.

Cost guardrails that actually fire

Routing only saves money if you measure it. Three things we instrument on day one:

Per-tenant token spend with a daily and monthly cap. Soft-cap warns, hard-cap degrades to a smaller model. Surfacing this to customers is also a fair feature in B2B.
Per-task cost-per-success. Tokens-per-call is a vanity metric. Cost-per-successful-completion (where "successful" comes from your eval suite) tells you if a cheaper model is actually cheaper after retries.
Cache hit ratio per route. Prompt caching only pays off above a certain hit rate — usually 30–40% in our experience, depending on the cache discount the provider offers. Below that, the bookkeeping isn't worth it.

# Pseudocode for a per-tenant guardrail check
def check_budget(tenant_id: str, estimated_cost_usd: float) -> RouteDecision:
    spent = redis.get(f"spend:{tenant_id}:today") or 0
    cap = get_tenant_cap(tenant_id)

    if spent + estimated_cost_usd > cap.hard:
        return RouteDecision(model="flash_tier", reason="hard_cap")
    if spent + estimated_cost_usd > cap.soft:
        emit_warning(tenant_id)
    return RouteDecision(model="primary", reason="ok")

Evals are the routing layer's source of truth

You cannot route on vibes. Every task bucket needs an offline eval set — 50 to 500 examples is usually enough to start — that you re-run whenever you consider switching a route.

What we look at when comparing models for a route:

Pass rate on the eval set (exact match, LLM-as-judge, or schema validity depending on the task).
p50 and p95 latency from your actual region, not the provider's status page.
Cost per pass, not cost per call.
Failure mode distribution. A model that's 2% worse on average but never produces catastrophic outputs may still be the right call for user-facing flows.

When a new model version ships — and they ship constantly — you don't migrate. You add it as a candidate, re-run evals, and shift traffic gradually with a feature flag. We've written more on this approach in our AI services notes.

The operational stuff nobody warns you about

A few things that bit us and will probably bite you:

Tokenizers differ. A prompt that fits in Claude's window may not fit in GPT's, and your token counts for billing reconciliation will not match across providers. Track each separately.
Tool-calling schemas aren't portable. Translating between OpenAI's tools format, Anthropic's tool blocks, and Gemini's function declarations is mechanical but tedious. Build a single internal tool schema and adapt at the edge.
Streaming formats differ. SSE shapes vary. If you stream to the browser, normalize server-side before clients see it.
Rate limits are per-org, not per-route. A noisy classification job can starve your user-facing chat. Separate API keys per workload class, or use a gateway that enforces fairness.

Where we'd start

If you're retrofitting this onto an existing product, don't try to route everything on day one. Pick the single highest-volume LLM call in your system, define its task bucket, write a 50-example eval set, and put it behind a router with one primary and one fallback. Measure cost-per-pass for a week. Then move to the next call.

The teams that win with LLMs in 2026 aren't the ones who picked the right vendor. They're the ones who built the muscle to switch vendors per task, per tenant, and per quarter — without their product noticing.

#AI & LLMs#Architecture#Cost Optimization#RAG

Want a team like ours?

72Technologies builds production software for the kind of teams who actually read this blog.

Start a project

Keep reading

Hybrid Search for RAG: BM25 + Vectors Without the Duct Tape

Pure vector search misses exact matches. Pure BM25 misses meaning. Here's how we wire them together in production RAG without turning the retrieval layer into a tangle of glue code.

July 31, 2026 6 min

Semantic Chunking vs Fixed-Size Chunks: What Actually Moves RAG Quality

Fixed-size chunking is the default because it's easy. Semantic chunking is trendy because it sounds smart. Here's what actually changes retrieval quality in production RAG systems, and how to decide which one you need.

July 29, 2026 6 min

Eval Harnesses That Catch Regressions Before Users Do

Most teams write prompts, ship, and pray. Here's how we build eval harnesses that actually catch regressions before a model swap or prompt tweak breaks production.

July 26, 2026 6 min