Back to Blog
2026-03-27
5 min read
"Google published TurboQuant at ICLR 2026 for text models. 72 hours later, turboquant-vllm was on PyPI — the first implementation validated on vision-language models and the first vLLM plugin. One flag to enable, 3.76x KV cache compression."
There's a difference between nailing a recipe at home and running it on a restaurant line. At home you control the heat, the timing, the single plate going out. On the line, you need it to work with different stoves, multiple tickets firing at once, and a kitchen that wasn't built around your dish. The first post was the home kitchen version — implementing TurboQuant from the paper, finding what works and what breaks. This post is about getting it on the line.
Google published the TurboQuant paper on March 24, 2026. By March 27, turboquant-vllm was on PyPI serving compressed video inference through vLLM's OpenAI-compatible API. One flag to enable:
pip install turboquant-vllm[vllm]
vllm serve allenai/Molmo2-8B --attention-backend CUSTOM
3.76x KV cache compression. Near-identical output quality. No code changes.
This post is about the production journey — the decisions that turned a research implementation into a pip-installable plugin in 72 hours, and why nobody else has tested TurboQuant on vision-language models.
Every other TurboQuant implementation I could find — and there are several — tests exclusively on text models: Qwen, Gemma, Mistral, Llama. Google's own paper benchmarks on Gemma, Mistral, and Llama-3.1-8B. Text only.
Vision-language models are a harder test case. A 12-second video clip through Molmo2-4B produces ~11,000 visual tokens — 10x longer than typical text prompts. That means 10x more KV cache memory, 10x more opportunities for precision bugs to compound across 36 transformer layers.
The existing VLM KV cache compression literature takes an entirely different approach: token pruning and sparsification (VL-Cache, Dynamic-LLaVA, ZipVL). These methods decide which tokens to discard. TurboQuant compresses the tokens you keep. They're complementary — you could stack TurboQuant on top of pruned caches for even greater savings.
Nobody had validated whether TurboQuant's vector quantization survives the visual token regime. Now someone has.
turboquant-vllm 1.0.0 is a vLLM plugin, not a fork. It registers via vllm.general_plugins entry points — the same mechanism vLLM uses for official backends. Install it, pass --attention-backend CUSTOM, and the TQ4 backend handles everything:
For HuggingFace users, CompressedDynamicCache wraps DynamicCache and compresses transparently on every cache.update().
Molmo2-4B on RTX 4090, 11K visual tokens from a Seinfeld video clip:
| Metric | Baseline | TQ4 Compressed |
|---|---|---|
| KV cache | 1,639 MiB | 435 MiB (3.76x) |
| Output quality | Detailed scene description | Near-identical (100+ tokens match word-for-word) |
| Decode overhead | — | 1.78x |
Molmo2-8B: same 3.76x compression ratio, correctly identifies all Seinfeld characters. Full 23-minute episode processed across 4 clips at 24 tok/s.
Other vLLM TurboQuant efforts are forks (brittle, hard to update) or monkey-patches (fragile, version-dependent). turboquant-vllm uses vLLM's official plugin entry point:
[project.entry-points."vllm.general_plugins"]
tq4_backend = "turboquant_vllm.vllm:register_tq4_backend"
pip install registers the backend. --attention-backend CUSTOM activates it. No patching, no forking, no maintenance burden when vLLM updates.
The naive approach decompresses the entire KV cache at every layer at every decode step. For 11K tokens across 36 layers, that's 3.36x overhead.
The fix: decompress only the 1 new token per step, append it to a running buffer, let standard Flash Attention handle the rest. Overhead drops to 1.78x. This optimization isn't in the Google paper — it's what makes TQ4 practical for production serving.
The fused kernels (compress, decompress, Q@K^T, Flash Attention + TQ4) run on both NVIDIA CUDA and AMD ROCm without code changes. I validated on a Radeon 890M iGPU — 84 of 84 GPU-parametrized tests pass with bit-identical math.
KV cache compression is most useful on memory-constrained hardware — exactly where AMD's consumer GPUs sit.
The v1.0.0 release includes:
vllm/vllm-openai:latest, served Molmo2-8B video inference with zero errorsThe experiment logs document failure modes nobody else has published: the fp16 norms trap at 10K+ tokens, QJL correction being invisible in standard attention, and multi-layer precision drift in fused kernels. These are landmines in every other implementation that hasn't hit 10K+ visual tokens.
The full implementation, 16 experiment logs, and architecture docs are at github.com/Alberto-Codes/turboquant-vllm.
pip install turboquant-vllm[vllm]
vllm serve your-model --attention-backend CUSTOM