gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters
TokenSpeed (Lightseek Foundation, MIT, May 2026) is the first open-source LLM inference engine targeting TensorRT-LLM-level performance for agentic workloads. It sits at a different layer than gotcontext — gotcontext compresses what reaches the GPU, TokenSpeed speeds up what the GPU processes. For self-hosters running open models, the two stack multiplicatively for ~80% real-cost reduction.
Two layers of the stack you can each squeeze independently ¶
Most LLM token-cost articles assume you're calling someone else's API: Anthropic, OpenAI, Google. If that's you, your inference engine is whatever the provider runs and you can't change it. Compression at the input layer (gotcontext) is the only knob you have.
But if you're self-hosting open models (Llama 4, Qwen3, Mixtral, Kimi K2.5, DeepSeek V4) there are two distinct knobs:
The two compose independently. Halving input tokens × doubling throughput per token = ~75% real-cost reduction. They live at different layers of the agent stack and don't conflict.
This week the Lightseek Foundation released TokenSpeed, an open-source LLM inference engine targeting TensorRT-LLM-level performance, MIT licensed. According to LightSeek's own announcement, their Multi-head Latent Attention (MLA) kernel has been adopted by vLLM. For self-hosters, this is the first credible open-source replacement for TRT-LLM that's also hand-tuned for agentic workload patterns (long input, short output, high concurrency).
The two stack like this:
| Layer | Tool | What it optimizes |
|---|---|---|
| Application / agent | (your code) | n/a |
| Input preprocessing | gotcontext | tokens reaching the GPU |
| Provider boundary | (your endpoint) | n/a |
| Inference engine | TokenSpeed | tokens/second from the GPU |
| GPU | (NVIDIA Hopper, Blackwell) | n/a |
Where each one operates ¶
| gotcontext | TokenSpeed | |
|---|---|---|
| Domain | Documentation, KB, conversation history, tool output | GPU-side token generation |
| Method | Semantic graph + PageRank importance scoring + structural chunking | Per-GPU throughput optimization, FSM-based KV cache safety |
| Architecture | API + MCP gateway in front of your inference endpoint | Drop-in replacement for TensorRT-LLM behind your endpoint |
| Where it sits | Between agent and ingest | Between endpoint and GPU |
| License | Proprietary (free + paid plans) | MIT |
| Setup | MCP config + API key | Replace TRT-LLM in your serving stack |
| Scope | Cuts what arrives | Speeds up what's processed |
The math, joint impact ¶
The savings compound. Concrete example for a self-hosted agentic workload:
| Workload | Input tokens / req | Inference tok/sec | Cost per request |
|---|---|---|---|
| Baseline (vLLM default + uncompressed) | 12,000 | 45 | 1.0× |
| + gotcontext (input compression ~3×) | 4,000 | 45 | 0.33× |
| + TokenSpeed (~2× tok/sec on agentic patterns) | 4,000 | 90 | 0.17× |
Setup: roughly 1 hour for both, end to end ¶
gotcontext
Add to your Claude Code MCP config (~/.claude/claude_desktop_config.json):
``json
{
"mcpServers": {
"gotcontext": {
"url": "https://api.gotcontext.ai/mcp",
"headers": { "Authorization": "Bearer gc_live_YOUR_KEY" }
}
}
}
``
Get a key at gotcontext.ai/sign-up. Free tier covers 1,000 compressions/month, no card required. ~30 seconds.
TokenSpeed
For self-hosters running NVIDIA Hopper or Blackwell GPUs, TokenSpeed is a TRT-LLM replacement. The MLA kernel has already landed in vLLM, so depending on your engine choice you may already get partial benefit. But the standalone TokenSpeed runtime is where the agentic-workload tuning lives.
Browse the project, install via the project's documented path, point your inference endpoint at TokenSpeed instead of TRT-LLM, restart your serving deployment. ~1 hour for the full migration on a typical setup. License is MIT. Vendor relationship is "open-source library" not "managed service."
Why we're recommending an inference engine ¶
gotcontext doesn't run inference. We're an API + MCP layer that sits between your agent and your model endpoint. TokenSpeed doesn't run preprocessing. It's a GPU runtime that takes whatever tokens arrive and processes them as fast as possible.
These tools cannot replace each other. A customer who runs only gotcontext gets the input-side win but is leaving inference performance on the table if they're self-hosting. A customer who runs only TokenSpeed gets faster inference on their existing token volume but is paying for tokens they didn't need to send.
The use case where this matters most is enterprise self-hosters: companies running open models on their own NVIDIA hardware, not API customers. Those teams typically have a serving stack (vLLM, TGI, TRT-LLM today) and an application stack. gotcontext slots into the application stack; TokenSpeed slots into the serving stack. No team-boundary friction, no integration work between the two products.
Operational notes ¶
TL;DR ¶
Cite this¶
Researchers, analysts, or journalists referencing this post can use either format below — both are copyable.
@misc{tokenspeed-companion-self-hosters-2026,
title = {gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters},
author = {James Hollingsworth},
year = {2026},
month = {May},
url = {https://www.gotcontext.ai/blog/tokenspeed-companion-self-hosters},
note = {gotcontext.ai engineering blog.},
}James Hollingsworth. (2026, May 9). gotcontext + TokenSpeed: stack input compression with TRT-LLM-class inference for self-hosters. gotcontext.ai. Retrieved from https://www.gotcontext.ai/blog/tokenspeed-companion-self-hosters.