Measured savings across 11 LLMs — Claude Opus 4.7 to Gemini Flash.→ See per-model data
Get free API key →
Engineering

You Can Cut Chain-of-Thought Token Costs ~66% With One Prompt Change

Token-budget-aware prompting cuts chain-of-thought reasoning length ~66% with negligible accuracy loss — and since output tokens cost 4–8x more than input tokens, this is the highest-leverage prompt change available.

James Hollingsworth(Contributor)Published 5 min~588 words

The problem with "think step by step"

Chain-of-thought prompting works. It consistently improves reasoning quality on math, logic, and multi-step tasks. The problem is that it's expensive, and the expense scales poorly: the more capable the model, the more verbose its reasoning trace tends to be.

For a model like Claude Sonnet or GPT-4o where output tokens cost 4–5× more than input tokens, a long reasoning trace isn't just slow. It's the dominant cost driver in your call. A 2,000-token reasoning chain costs more than a 10,000-token input context on most pricing structures.

Researchers at Nanjing University and UMass Amherst identified this problem and proposed a direct fix: tell the model how many tokens it has to reason in (arXiv:2412.18547).

Token-budget-aware reasoning: what it is

The core insight from "Token-Budget-Aware LLM Reasoning" is that LLM reasoning chains are unnecessarily long by default, and that including an explicit token budget in the prompt causes the model to compress its reasoning without meaningfully reducing accuracy.

The framework works by:

  • Estimating the complexity of the incoming question
  • Setting a token budget proportional to that complexity
  • Including that budget as an instruction in the prompt
  • Letting the model self-regulate its reasoning length against the budget
  • The paper reports that this approach reduces mean reasoning token counts substantially (approximately 66% on their benchmarks) with accuracy reduction the authors characterize as slight and within measurement noise for most task categories. The key is the dynamic per-query adjustment: a fixed budget across all queries would hurt accuracy on hard problems. The framework estimates complexity first, so difficult questions get more reasoning room while easy questions get tight budgets.

    The prompt pattern

    The practical implementation is a prompt wrapper:

    `` You have a budget of {N} tokens to reason through this problem before your final answer. Use your budget efficiently. Harder problems warrant more reasoning; simpler ones less. Problem: {user_input} `

    N can be set statically (if your query distribution is uniform) or dynamically (if you have a lightweight classifier that estimates complexity before the main call). The model self-regulates: you don't need to truncate the output; you instruct the model to be concise.

    The output token premium makes this urgent

    Output tokens cost more than input tokens across every major provider:

    ProviderModelInput (per 1M)Output (per 1M)Output premium
    AnthropicClaude Sonnet 4$3.00$15.00
    OpenAIGPT-4o$2.50$10.00
    GoogleGemini 2.5 Pro$1.25$10.00
    A reasoning-heavy workflow generating 500 tokens of CoT per call at 10,000 calls/day produces 5M output tokens daily. At Claude Sonnet pricing, that's $75/day just in reasoning traces. Cut that ~66% and you save roughly $49/day (~$18,000/year) from one prompt change.

    Composing with context compression

    Token-budget reasoning addresses the *output* side of the cost equation. Context compression addresses the *input* side. They compose cleanly.

    A typical agentic call has:

  • Input: tool outputs, conversation history, retrieved docs (often 10K–50K tokens)
  • Reasoning: CoT chain (often 200–500 tokens of output)
  • Final answer: the actual response (50–200 tokens)
  • gotcontext compresses the input layer (tool outputs, docs, history) before they reach the model. Token-budget prompting compresses the reasoning layer. Together they attack both the largest input cost and the highest-per-token output cost.

    The setup for input compression is one config block:

    `json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_KEY" } } } } ``

    Add token-budget instructions to your system prompt. Add gotcontext to your MCP config. Two changes, attacking both sides of the bill.

    The research says CoT compression with budget constraints reduces reasoning tokens substantially with negligible accuracy loss. The output token premium means those savings are worth 4–8× their weight in equivalent input savings. This is the highest-leverage prompt change you can make today.

    Compress inputs and reasoning →

    Cite this

    Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

    BibTeXbibtex
    @misc{how-to-cut-chain-of-thought-costs-66-percent-with-token-budgets-2026,
      title  = {You Can Cut Chain-of-Thought Token Costs ~66% With One Prompt Change},
      author = {James Hollingsworth},
      year   = {2026},
      month  = {May},
      url    = {https://www.gotcontext.ai/blog/how-to-cut-chain-of-thought-costs-66-percent-with-token-budgets},
      note   = {gotcontext.ai engineering blog.},
    }
    APAtext
    James Hollingsworth. (2026, May 8). You Can Cut Chain-of-Thought Token Costs ~66% With One Prompt Change. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/how-to-cut-chain-of-thought-costs-66-percent-with-token-budgets.

    Contribute