Google Research details TurboQuant, a quantization algorithm to enable massive compression of LLMs and vector search engines without sacrificing accuracy

Amir Zandieh, Research Scientist, and Vahab Mirrokni, VP and Google Fellow, Google Research — We introduce a set …

Google Research 2026-03-25

Discussion

@eastdakota Matthew Prince on x
This is Google's DeepSeek. So much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization. Lots of teams at @Cloudflare focused on these areas. #staytuned
@markosaaig Markos on x
I'm laughing here with @jukan05 because this dropping shows exactly who understands the memory system and who don't. Probably the same people who constantly only scream micron, SK have a low forward PE TurboQuant lowers cost per token. and expands the context window so you can
@omercheeema Omer Cheema on x
In every AI chat, the model keeps your entire conversation in KV cache. On a 70B model, that alone can eat 40GB+ of GPU RAM, more than the model. Google just dropped TurboQuant: a new compression algorithm that shrinks the KV cache by 6x down to just 3 bits per value — with
@jenzhuscott Jen Zhu on x
When I was consulting for @HBO Silicon Valley, zero-loss compression was the holy grail Richard Hendricks chases that perfect middle-out algo could shrink everything w/out breaking a single bit. Google just did something even more practical for the AI era: TurboQuant compresses […
@stocksavvyshay Shay Boloor on x
$MU and $SNDK are getting hit hard at the open from the release of $GOOGL TurboQuant. The market is reading it as a potential headwind for memory names because long-context AI inference may now need far less memory per workload. [image]
@benbajarin Ben Bajarin on x
Ok, class, listen up. Your homework today is to come up with scenarios where TurboQuant is negative to memory demand and scenarios where it may actually boost memory demand.
@googleresearch @googleresearch on x
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: https://research.google/... [im…
@anisha_moonka Anish Moonka on x
Every time you message an AI chatbot, the model stores your entire conversation in temporary memory called a KV cache (a cheat sheet so it doesn't re-read everything from scratch). On a large model like Llama 70B running a long conversation, that cache alone eats 40GB of GPU
@f4micom @f4micom on x
i hope this is open and i hope that if it's not it inspires an open implementation of the same base concept
@joshkale Josh Kale on x
This post got ZERO attention but is BY FAR the biggest AI news this week Google just published TurboQuant: a compression algorithm that makes AI inference 8x faster while using 6x less memory. No retraining. No accuracy loss. The biggest cost is inference which happens billions […
@onlyxuanwo @onlyxuanwo on x
I know Pied Piper is real
@firstadopter Tae Kim on x
Wat! “Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency.” [image]
@kmeanskaran Karan on x
Mark my words, inference engineering will be an evergreen in-demand skill for the upcoming years. If you are in applied ML, then it's a top priority skill. People using AI will grow, along with companies integrating AI. Nowadays, speed matters the most and we are losing
@0xsero @0xsero on x
Testing this tomorrow, will report back if it works on Qwen3.5 Might be able to run much larger models if this works. [image]
@alexfinn Alex Finn on x
This is potentially the biggest news of the year Google just released TurboQuant. An algorithm that makes LLM's smaller and faster, without losing quality Meaning that 16gb Mac Mini now can run INCREDIBLE AI models. Completely locally, free, and secure This also means: •
@brianroemmele Brian Roemmele on x
We are testing TurboQuant at the Zero-Human Company and are fascinated by the speed up! We are at a consistent 5x increase! More testing...
@aminkarbasi Amin Karbasi on x
I left @GoogleResearch almost two years ago, so it makes me genuinely happy to see our work on polar quantization (my last project), which eventually led to extreme compression, being recognized there. It is a nice reminder that good fundamental work tends to find its place with
@bqbrady Benedict on x
All you had to do was pay attention to the polar coordinates lecture in Trig and you could have discovered a 6x reduction in KV cache memory. High school math vindicated
@sudoingx @sudoingx on x
thank you google. for all your contributions to make the world a better place. we need more of this not more of altman gambles.
@kimmonismus @kimmonismus on x
Thats freaking awesome: Google Research has introduced TurboQuant, a compression algorithm (presenting at ICLR 2026) that shrinks the memory footprint of large language models by at least 6x, without any retraining or drop in accuracy. It works by converting data into a polar [im…
@themylesfiles Myles on x
I'm an interactive learner, so I turned Google's TurboQuant paper into a @marimo_io notebook. Random rotations → Beta distributions → optimal 3-bit quantization → 6x memory savings on LLM KV caches. Way easier to grok when you can drag a slider and watch the math happen.
@8teapi Prakash on x
Interesting. By releasing this publicly they can reduce memory demand across the sector, which helps because they were slow to secure memory capacity in Asia
@rough__sea Ryan Dahl on x
I'm surprised Google is publishing this - seems like good IP
@prince_canuma Prince Canuma on x
Just implemented Google's TurboQuant in MLX and the results are wild! Needle-in-a-haystack using Qwen3.5-35B-A3B across 8.5K, 32.7K, and 64.2K context lengths: → 6/6 exact match at every quant level → TurboQuant 2.5-bit: 4.9x smaller KV cache → TurboQuant 3.5-bit: 3.8x [image]
@jordannanos Jordan Nanos on x
Jevons paradox for KV cache
@raffi_hotter Raffi Hotter on x
This algorithm uses one of my favourite theorems in math, the Johnson-Lindentrauss Lemma, which says you can drastically reduce the dimensionality of n points to just log(n) dimensions and still preserve pairwise distances
@matthewberman Matthew Berman on x
this is a big deal. 6x reduction in kv mem and 8x speed up is incredible...let alone ZERO accuracy loss.
@mweinbach Max Weinbach on x
On TurboQuant, I see 2 possible outcomes 1. we reallocate memory towards larger models and larger context, so freed up memory goes to improving models 2. we keep everything as is and just shrink costs to generate tokens I'd wager we see the shift towards 1, not 2
@julientechinvst Julien on x
Memory makers bloodbath tomorrow (likely). It a huge progress and is an step forward in removing the memory bottleneck. Now, it will put more pressure on logic side though
@timkellogg.me Tim Kellogg on bluesky
PolarQuant: 6x memory reduction 8x speed improvement — weirdly, this works for both KV-cache and vector DBs — the gist is they convert from cartesian coordinate vectors into polar coordinates, and since they're always normalized to 1.0, they drop the magnitude too — researc…
r/accelerate r on reddit
Google Research introduces TurboQuant: A new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup …
r/LocalLLaMA r on reddit
[google research] TurboQuant: Redefining AI efficiency with extreme compression
@emollick Ethan Mollick on x
AI slop science posts keep moving markets, this time by misinterpreting or mis-dating papers. Science fiction, but dumb.
@jukan05 Jukan on x
Bro, that shit you guys are hyping dropped in April last year. Why are you acting like it's new now? [image]
@friedberg David Friedberg on x
Since the first Presidential scientific advisory board, established by FDR in 1933, Presidential science and technology councils have supported policies that advanced research goals, enabled breakthrough scientific discoveries, drove the development of new technologies, and
@fundaai @fundaai on x
If you consider how paper publication processes typically work at major labs like Gemini and Google, it's reasonable to assume that the most impactful results in the Gemini domain are unlikely to ever be published. Even the papers that do get released are presumably based on
@jukan05 Jukan on x
The fact that memory stocks are crashing because of Google's Turboquant is a pretty good indicator of how many clueless people this market is filled with. It's like saying Aramco should crash because Toyota came out with a next-generation hybrid engine.
@dorialexander Alexander Doria on x
You only have to read it to realize it's incremental engineering gains, other similar methods exist (KIVI, PolarQuant), there are trade-off (random notation is not free)
@dorialexander Alexander Doria on x
People going insane on a one year old mid paper makes me very pessimistic over tech literacy. Maybe just as well if the next effective agents take over.

Chronicles

Google Research details TurboQuant, a quantization algorithm to enable massive compression of LLMs and vector search engines without sacrificing accuracy

Related Coverage

Discussion