Model Routing: How to Use Frontier Models at 24% of Their Cost

The idea is simple. The execution is not. ¶

Model routing is the practice of sending different queries to different models based on predicted difficulty. Easy queries go to cheap models (GPT-4o mini, Claude Haiku, Gemini Flash). Hard queries go to frontier models (GPT-4o, Claude Sonnet, Gemini Pro).

The premise: not every query needs the most capable model. A question like "Summarize this paragraph in one sentence" does not require the same model as "Identify the logical flaw in this statistical argument."

If you can route accurately, you pay frontier prices only for queries that need frontier models. Everything else runs cheap.

arXiv:2603.04445 ("MixLLM: Dynamic LLM Selection via Multi-Armed Bandit") is the most rigorous recent benchmark of this approach. The result: 97.25% of GPT-4o quality at 24.18% of GPT-4o cost.

How MixLLM routing works ¶

MixLLM treats model selection as a multi-armed bandit problem. Each query is a trial. Each model is an arm. The bandit learns which model produces acceptable quality on which query types by tracking reward signals (quality scores from human feedback or automated evaluation) and exploration/exploitation tradeoffs.

The key insight from the paper: routing decisions should update online. A static rule that says "questions with fewer than 50 words go to the cheap model" degrades as the query distribution shifts. A bandit-based router improves over time because it incorporates feedback from actual query-answer pairs.

The 97.25% quality figure uses GPT-4o as the quality ceiling. Queries where the cheap model (GPT-4o mini in the benchmark) produces answers within a defined quality margin of GPT-4o are counted as successful routing decisions. The 2.75% quality gap is the acceptable degradation from routing, calibrated by the application builder.

The three routing architectures ¶

arXiv:2603.04445 and related work describe three distinct approaches:

Cascade routing: Run the cheap model first. If its answer confidence is below a threshold, run the expensive model. Return whichever answer clears the confidence bar.

Latency: potentially 2x (two model calls on hard queries)

Quality: good, because the expensive model sees the actual query

Best for: applications where latency is acceptable and the query mix is unpredictable

Selector routing: Classify the query first (using a small, fast classifier model). Based on classification, route to cheap or expensive. Never call both.

Latency: classifier latency + one model call (low)

Quality: depends on classifier accuracy

Best for: latency-sensitive applications with predictable query categories

Bandit routing (MixLLM): Track per-model, per-query-type quality online. Exploit successful routing patterns; explore occasionally to catch distribution shifts.

Latency: one model call (no cascade)

Quality: improves over time

Best for: high-volume applications with sufficient feedback signal

What the other papers found ¶

Two additional results from the routing literature:

R2-Reasoner (RL-based routing): Reported 84.46% API cost savings versus always using the frontier model, using reinforcement learning to train the routing policy. The savings figure is larger than MixLLM but the quality threshold is different: R2-Reasoner accepted larger quality gaps on some query types.

GreenServ (energy-aware routing): Reported 31% API cost reduction by routing queries based on both model capability and provider energy pricing. The routing signal included grid carbon intensity data alongside query difficulty. The 31% figure is conservative because GreenServ optimized for energy as a co-objective with cost, not cost alone.

None of these papers used the same quality threshold or query distribution, which makes direct comparison misleading. The MixLLM 97.25%/24.18% figure is cited here because it uses the most clearly documented quality threshold (GPT-4o as ceiling) and the most transparent experimental setup.

Where routing fails ¶

Routing degrades in three documented conditions:

Low query volume. Bandit-based routing needs enough queries to converge on reliable routing policies. Below ~1,000 queries, the bandit is still exploring and routing decisions are close to random. Cascade routing is safer at low volume because it does not depend on learned routing policies.

Distribution shift. If your query mix changes (new user segments, seasonal patterns, product changes), routing policies trained on the old distribution become stale. Online learning routers handle this better than static rules, but there is always a lag.

Queries that require full frontier capability. Some query types have no cheap-model equivalent. Complex multi-step reasoning, subtle argument evaluation, nuanced creative tasks: the quality gap between GPT-4o and GPT-4o mini on these is large enough that routing them to the cheap model produces visible degradation. Know which query types these are for your application before deploying routing.

Building a router: the minimum viable implementation ¶

``python from openai import OpenAI

client = OpenAI()

def route_query(query: str, context: str) -> str: # Step 1: Estimate difficulty with a cheap classifier call classification = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": f"Rate this query difficulty 1-5 (1=simple factual, 5=complex reasoning). Query: {query}. Respond with only a number." }], max_tokens=1 ) difficulty = int(classification.choices[0].message.content.strip())

# Step 2: Route based on difficulty model = "gpt-4o" if difficulty >= 4 else "gpt-4o-mini"

response = client.chat.completions.create( model=model, messages=[ {"role": "system", "content": context}, {"role": "user", "content": query} ] ) return response.choices[0].message.content`

This is selector routing with a cheap model as the classifier. The classifier call costs ~$0.002 for a 200-token query. If 70% of queries route to GPT-4o mini and 30% to GPT-4o, the blended cost is roughly:

0.002 (classifier) + 0.70 × $0.15/K (mini) + 0.30 × $2.50/K (4o) = ~$0.86/K input tokens

Versus $2.50/K for always-on GPT-4o. The savings are real even with a simplistic routing signal.

`Compounding with context compression ¶`

Model routing reduces cost by changing *which model* runs. Context compression reduces cost by changing *how many tokens* the model sees. Both apply to the expensive-model path.

When routing sends a hard query to the frontier model, that frontier model call still processes whatever context you pass it. Compressing the context before the frontier model call reduces the token cost of the expensive path without touching the routing logic.

gotcontext compresses via ingest_context` before your model call, regardless of which model the router selected. The two optimizations are fully composable.

Get gotcontext free →

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{model-routing-the-architecture-that-makes-frontier-models-affordable-2026,
  title  = {Model Routing: How to Use Frontier Models at 24% of Their Cost},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/model-routing-the-architecture-that-makes-frontier-models-affordable},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 8). Model Routing: How to Use Frontier Models at 24% of Their Cost. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/model-routing-the-architecture-that-makes-frontier-models-affordable.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts