Model Routing: How to Use Frontier Models at 24% of Their Cost
arXiv:2603.04445 benchmarked bandit-based LLM routing and found 97.25% GPT-4o quality at 24.18% of the cost. Here is how model routing works and what the research says about where it does and does not pay off.
The idea is simple. The execution is not. ¶
Model routing is the practice of sending different queries to different models based on predicted difficulty. Easy queries go to cheap models (GPT-4o mini, Claude Haiku, Gemini Flash). Hard queries go to frontier models (GPT-4o, Claude Sonnet, Gemini Pro).
The premise: not every query needs the most capable model. A question like "Summarize this paragraph in one sentence" does not require the same model as "Identify the logical flaw in this statistical argument."
If you can route accurately, you pay frontier prices only for queries that need frontier models. Everything else runs cheap.
arXiv:2603.04445 ("MixLLM: Dynamic LLM Selection via Multi-Armed Bandit") is the most rigorous recent benchmark of this approach. The result: 97.25% of GPT-4o quality at 24.18% of GPT-4o cost.
How MixLLM routing works ¶
MixLLM treats model selection as a multi-armed bandit problem. Each query is a trial. Each model is an arm. The bandit learns which model produces acceptable quality on which query types by tracking reward signals (quality scores from human feedback or automated evaluation) and exploration/exploitation tradeoffs.
The key insight from the paper: routing decisions should update online. A static rule that says "questions with fewer than 50 words go to the cheap model" degrades as the query distribution shifts. A bandit-based router improves over time because it incorporates feedback from actual query-answer pairs.
The 97.25% quality figure uses GPT-4o as the quality ceiling. Queries where the cheap model (GPT-4o mini in the benchmark) produces answers within a defined quality margin of GPT-4o are counted as successful routing decisions. The 2.75% quality gap is the acceptable degradation from routing, calibrated by the application builder.
The three routing architectures ¶
arXiv:2603.04445 and related work describe three distinct approaches:
Cascade routing: Run the cheap model first. If its answer confidence is below a threshold, run the expensive model. Return whichever answer clears the confidence bar.
Selector routing: Classify the query first (using a small, fast classifier model). Based on classification, route to cheap or expensive. Never call both.
Bandit routing (MixLLM): Track per-model, per-query-type quality online. Exploit successful routing patterns; explore occasionally to catch distribution shifts.
What the other papers found ¶
Two additional results from the routing literature:
R2-Reasoner (RL-based routing): Reported 84.46% API cost savings versus always using the frontier model, using reinforcement learning to train the routing policy. The savings figure is larger than MixLLM but the quality threshold is different: R2-Reasoner accepted larger quality gaps on some query types.
GreenServ (energy-aware routing): Reported 31% API cost reduction by routing queries based on both model capability and provider energy pricing. The routing signal included grid carbon intensity data alongside query difficulty. The 31% figure is conservative because GreenServ optimized for energy as a co-objective with cost, not cost alone.
None of these papers used the same quality threshold or query distribution, which makes direct comparison misleading. The MixLLM 97.25%/24.18% figure is cited here because it uses the most clearly documented quality threshold (GPT-4o as ceiling) and the most transparent experimental setup.
Where routing fails ¶
Routing degrades in three documented conditions:
Low query volume. Bandit-based routing needs enough queries to converge on reliable routing policies. Below ~1,000 queries, the bandit is still exploring and routing decisions are close to random. Cascade routing is safer at low volume because it does not depend on learned routing policies.
Distribution shift. If your query mix changes (new user segments, seasonal patterns, product changes), routing policies trained on the old distribution become stale. Online learning routers handle this better than static rules, but there is always a lag.
Queries that require full frontier capability. Some query types have no cheap-model equivalent. Complex multi-step reasoning, subtle argument evaluation, nuanced creative tasks: the quality gap between GPT-4o and GPT-4o mini on these is large enough that routing them to the cheap model produces visible degradation. Know which query types these are for your application before deploying routing.
Building a router: the minimum viable implementation ¶
``python
from openai import OpenAI
client = OpenAI()
def route_query(query: str, context: str) -> str: # Step 1: Estimate difficulty with a cheap classifier call classification = client.chat.completions.create( model="gpt-4o-mini", messages=[{ "role": "user", "content": f"Rate this query difficulty 1-5 (1=simple factual, 5=complex reasoning). Query: {query}. Respond with only a number." }], max_tokens=1 ) difficulty = int(classification.choices[0].message.content.strip())
# Step 2: Route based on difficulty model = "gpt-4o" if difficulty >= 4 else "gpt-4o-mini"
response = client.chat.completions.create(
model=model,
messages=[
{"role": "system", "content": context},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
`
This is selector routing with a cheap model as the classifier. The classifier call costs ~$0.002 for a 200-token query. If 70% of queries route to GPT-4o mini and 30% to GPT-4o, the blended cost is roughly:
0.002 (classifier) + 0.70 × $0.15/K (mini) + 0.30 × $2.50/K (4o) = ~$0.86/K input tokens
Versus $2.50/K for always-on GPT-4o. The savings are real even with a simplistic routing signal.
Compounding with context compression ¶
Model routing reduces cost by changing *which model* runs. Context compression reduces cost by changing *how many tokens* the model sees. Both apply to the expensive-model path.
When routing sends a hard query to the frontier model, that frontier model call still processes whatever context you pass it. Compressing the context before the frontier model call reduces the token cost of the expensive path without touching the routing logic.
gotcontext compresses via ingest_context` before your model call, regardless of which model the router selected. The two optimizations are fully composable.
Cite this¶
Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.
@misc{model-routing-the-architecture-that-makes-frontier-models-affordable-2026,
title = {Model Routing: How to Use Frontier Models at 24% of Their Cost},
author = {James Hollingsworth},
year = {2026},
month = {May},
url = {https://www.gotcontext.ai/blog/model-routing-the-architecture-that-makes-frontier-models-affordable},
note = {gotcontext.ai engineering blog.},
}James Hollingsworth. (2026, May 8). Model Routing: How to Use Frontier Models at 24% of Their Cost. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/model-routing-the-architecture-that-makes-frontier-models-affordable.