gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters

Two layers of the stack you can each squeeze independently ¶

Most LLM token-cost articles assume you're calling someone else's API: Anthropic, OpenAI, Google. If that's you, your inference engine is whatever the provider runs and you can't change it. Compression at the input layer (gotcontext) is the only knob you have.

But if you're self-hosting open models (Llama 4, Qwen3, Mixtral, Kimi K2.5, DeepSeek V4) there are two distinct knobs:

Input layer: how many tokens reach the GPU per request

Inference layer: how fast the GPU processes each token

The two compose independently. Halving input tokens × doubling throughput per token = ~75% real-cost reduction. They live at different layers of the agent stack and don't conflict.

This week the Lightseek Foundation released TokenSpeed, an open-source LLM inference engine targeting TensorRT-LLM-level performance, MIT licensed. According to LightSeek's own announcement, their Multi-head Latent Attention (MLA) kernel has been adopted by vLLM. For self-hosters, this is the first credible open-source replacement for TRT-LLM that's also hand-tuned for agentic workload patterns (long input, short output, high concurrency).

The two stack like this:

Layer	Tool	What it optimizes
Application / agent	(your code)	n/a
Input preprocessing	gotcontext	tokens reaching the GPU
Provider boundary	(your endpoint)	n/a
Inference engine	TokenSpeed	tokens/second from the GPU
GPU	(NVIDIA Hopper, Blackwell)	n/a

Where each one operates ¶

	gotcontext	TokenSpeed
Domain	Documentation, KB, conversation history, tool output	GPU-side token generation
Method	Semantic graph + PageRank importance scoring + structural chunking	Per-GPU throughput optimization, FSM-based KV cache safety
Architecture	API + MCP gateway in front of your inference endpoint	Drop-in replacement for TensorRT-LLM behind your endpoint
Where it sits	Between agent and ingest	Between endpoint and GPU
License	Proprietary (free + paid plans)	MIT
Setup	MCP config + API key	Replace TRT-LLM in your serving stack
Scope	Cuts what arrives	Speeds up what's processed

gotcontext rewrites prompt content before any inference engine sees it. TokenSpeed runs the inference engine itself. They literally cannot conflict: gotcontext doesn't run on a GPU; TokenSpeed doesn't read documents.

The math, joint impact ¶

The savings compound. Concrete example for a self-hosted agentic workload:

Workload	Input tokens / req	Inference tok/sec	Cost per request
Baseline (vLLM default + uncompressed)	12,000	45	1.0×
+ gotcontext (input compression ~3×)	4,000	45	0.33×
+ TokenSpeed (~2× tok/sec on agentic patterns)	4,000	90	0.17×

~83% real-cost reduction. The two interventions are independent: gotcontext doesn't care which inference engine consumes its compressed output; TokenSpeed doesn't care whether the input was compressed before it arrived. Stacking them is multiplicative, not additive.

Setup: roughly 1 hour for both, end to end ¶

gotcontext

Add to your Claude Code MCP config (~/.claude/claude_desktop_config.json):

``json { "mcpServers": { "gotcontext": { "url": "https://api.gotcontext.ai/mcp", "headers": { "Authorization": "Bearer gc_live_YOUR_KEY" } } } }``

Get a key at gotcontext.ai/sign-up. Free tier covers 1,000 compressions/month, no card required. ~30 seconds.

TokenSpeed

For self-hosters running NVIDIA Hopper or Blackwell GPUs, TokenSpeed is a TRT-LLM replacement. The MLA kernel has already landed in vLLM, so depending on your engine choice you may already get partial benefit. But the standalone TokenSpeed runtime is where the agentic-workload tuning lives.

Browse the project, install via the project's documented path, point your inference endpoint at TokenSpeed instead of TRT-LLM, restart your serving deployment. ~1 hour for the full migration on a typical setup. License is MIT. Vendor relationship is "open-source library" not "managed service."

Why we're recommending an inference engine ¶

gotcontext doesn't run inference. We're an API + MCP layer that sits between your agent and your model endpoint. TokenSpeed doesn't run preprocessing. It's a GPU runtime that takes whatever tokens arrive and processes them as fast as possible.

These tools cannot replace each other. A customer who runs only gotcontext gets the input-side win but is leaving inference performance on the table if they're self-hosting. A customer who runs only TokenSpeed gets faster inference on their existing token volume but is paying for tokens they didn't need to send.

The use case where this matters most is enterprise self-hosters: companies running open models on their own NVIDIA hardware, not API customers. Those teams typically have a serving stack (vLLM, TGI, TRT-LLM today) and an application stack. gotcontext slots into the application stack; TokenSpeed slots into the serving stack. No team-boundary friction, no integration work between the two products.

Operational notes ¶

gotcontext is a remote API. Content is sent to our servers for compression. If your KB is sensitive, evaluate the data-flow shape per source. Self-hosting gotcontext is on the enterprise plan.

TokenSpeed is local. Runs on your GPUs. No data leaves your infrastructure.

Free tiers exist for both. gotcontext: 1,000 compressions/month, no card. TokenSpeed: MIT, no fee.

Neither is API-customer-relevant. If you're calling Claude or GPT-4 via Anthropic/OpenAI's API, you can't choose your inference engine. That's the provider's problem. TokenSpeed doesn't help. gotcontext does.

vLLM users get a partial win. TokenSpeed's MLA kernel landed in vLLM upstream. If you're already running vLLM with MLA-capable models (Kimi, DeepSeek V4 Pro), you're picking up some of the throughput gain without migrating.

TL;DR ¶

gotcontext = input-layer compression (cuts what reaches the GPU)

TokenSpeed = inference-engine optimization (speeds up what the GPU processes)

Different layers, no conflict, multiplicative savings

Joint reduction: ~80% real-cost on self-hosted agentic workloads

gotcontext: ~30 second setup. TokenSpeed: ~1 hour migration

API customers (Claude, GPT-4): only gotcontext applies

Self-hosters (Llama, Qwen, Kimi, DeepSeek): install both

Get gotcontext free → · Read about TokenSpeed →

Cite this¶

Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.

BibTeXbibtex

@misc{tokenspeed-companion-self-hosters-2026,
  title  = {gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters},
  author = {James Hollingsworth},
  year   = {2026},
  month  = {May},
  url    = {https://www.gotcontext.ai/blog/tokenspeed-companion-self-hosters},
  note   = {gotcontext.ai engineering blog.},
}

APAtext

James Hollingsworth. (2026, May 9). gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/tokenspeed-companion-self-hosters.

Contribute¶

Suggest an edit

Spotted a typo, a stale benchmark, or a missing nuance? Open a GitHub issue.

Discuss this post

Counterexamples, follow-up questions, and adjacent research welcome.

Email us

Bigger story? Hit us directly at hello@gotcontext.ai.

← Back to all posts