Google Dropped TurboQuant Two Weeks Ago. The Community Already Made It Usable.
Google published the TurboQuant paper on March 25. It's April 7. There are already five independent implementations, a llama.cpp fork running 104B parameter models on a MacBook, and an active vLLM ...

Source: DEV Community
Google published the TurboQuant paper on March 25. It's April 7. There are already five independent implementations, a llama.cpp fork running 104B parameter models on a MacBook, and an active vLLM integration effort. Google hasn't released a single line of official code. This is the post about what happened in those two weeks. The Paper, In 30 Seconds TurboQuant is a KV cache compression method. During inference, large language models store key-value pairs for every token in the context -- this is the KV cache, and it's the single biggest memory bottleneck for long-context inference. The paper demonstrates quality-neutral compression at around 3.5 bits per element, with marginal degradation down to 2.5 bits -- achieving at least 6x memory reduction and up to 8x speedup in attention computation on H100 GPUs, with what the paper claims is zero accuracy loss at the sweet spot. The critical detail: it's training-free and data-oblivious. You don't retrain the model. You don't need calibrati