Google Research details TurboQuant, a quantization algorithm to enable massive compression of LLMs and vector search engines without sacrificing accuracy
Amir Zandieh, Research Scientist, and Vahab Mirrokni, VP and Google Fellow, Google Research — We introduce a set …
Google Research
Related Coverage
- Google's TurboQuant reduces AI LLM cache memory capacity requirements by at least six times … Tom's Hardware · Luke James
- Google Research outlines algorithms that may ease AI memory squeeze Constellation Research · Larry Dignan
- SanDisk (SNDK) Shares Slide 5% as Google Innovation Threatens Memory Demand Blockonomi · Trader Edge
- Google's TurboQuant compresses AI memory by 6x without losing accuracy Efficienist · Ivan Jenic
- Google's TurboQuant cuts AI memory use without losing accuracy Help Net Security · Anamarija Pogorelec
- Google Introduces TurboQuant: A New Compression Algorithm that Reduces LLM Key-Value Cache Memory by 6x and Delivers Up to 8x Speedup, All with Zero Accuracy Loss MarkTechPost · Asif Razzaq
- TurboQuant: Redefining AI efficiency with extreme compression Hacker News
- TurboQuant: Redefining AI efficiency with extreme compression Lobsters
- 1 Introduction — Vector quantization (VQ) in Euclidean space is crucial for efficiently … arXiv.org e-Print archive
- Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x Ars Technica · Ryan Whitwam
- Google's new TurboQuant algorithm speeds up AI memory 8x, cutting costs by 50% or more VentureBeat · Carl Franzen
- Memory Stocks Slide As Google's New AI Efficiency Breakthrough May Slash Data Storage Needs Benzinga · Kaustubh Bagalkote
- Google unveils TurboQuant, a new AI memory compression algorithm — and yes, the internet is calling it ‘Pied Piper’ TechCrunch · Sarah Perez
- Micron's stock is dropping. Is Google partly to blame? MarketWatch · Britney Nguyen
- Google's new compression algorithm cut memory stocks within hours of publication The Next Web · Alina Maria Stan
Discussion
-
@eastdakota
Matthew Prince
on x
This is Google's DeepSeek. So much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization. Lots of teams at @Cloudflare focused on these areas. #staytuned
-
@markosaaig
Markos
on x
I'm laughing here with @jukan05 because this dropping shows exactly who understands the memory system and who don't. Probably the same people who constantly only scream micron, SK have a low forward PE TurboQuant lowers cost per token. and expands the context window so you can
-
@omercheeema
Omer Cheema
on x
In every AI chat, the model keeps your entire conversation in KV cache. On a 70B model, that alone can eat 40GB+ of GPU RAM, more than the model. Google just dropped TurboQuant: a new compression algorithm that shrinks the KV cache by 6x down to just 3 bits per value — with
-
@jenzhuscott
Jen Zhu
on x
When I was consulting for @HBO Silicon Valley, zero-loss compression was the holy grail Richard Hendricks chases that perfect middle-out algo could shrink everything w/out breaking a single bit. Google just did something even more practical for the AI era: TurboQuant compresses […
-
@stocksavvyshay
Shay Boloor
on x
$MU and $SNDK are getting hit hard at the open from the release of $GOOGL TurboQuant. The market is reading it as a potential headwind for memory names because long-context AI inference may now need far less memory per workload. [image]
-
@benbajarin
Ben Bajarin
on x
Ok, class, listen up. Your homework today is to come up with scenarios where TurboQuant is negative to memory demand and scenarios where it may actually boost memory demand.
-
@googleresearch
@googleresearch
on x
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: https://research.google/... [im…
-
@anisha_moonka
Anish Moonka
on x
Every time you message an AI chatbot, the model stores your entire conversation in temporary memory called a KV cache (a cheat sheet so it doesn't re-read everything from scratch). On a large model like Llama 70B running a long conversation, that cache alone eats 40GB of GPU
-
@f4micom
@f4micom
on x
i hope this is open and i hope that if it's not it inspires an open implementation of the same base concept
-
@joshkale
Josh Kale
on x
This post got ZERO attention but is BY FAR the biggest AI news this week Google just published TurboQuant: a compression algorithm that makes AI inference 8x faster while using 6x less memory. No retraining. No accuracy loss. The biggest cost is inference which happens billions […
-
@onlyxuanwo
@onlyxuanwo
on x
I know Pied Piper is real
-
@firstadopter
Tae Kim
on x
Wat! “Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency.” [image]
-
@kmeanskaran
Karan
on x
Mark my words, inference engineering will be an evergreen in-demand skill for the upcoming years. If you are in applied ML, then it's a top priority skill. People using AI will grow, along with companies integrating AI. Nowadays, speed matters the most and we are losing
-
@0xsero
@0xsero
on x
Testing this tomorrow, will report back if it works on Qwen3.5 Might be able to run much larger models if this works. [image]
-
@alexfinn
Alex Finn
on x
This is potentially the biggest news of the year Google just released TurboQuant. An algorithm that makes LLM's smaller and faster, without losing quality Meaning that 16gb Mac Mini now can run INCREDIBLE AI models. Completely locally, free, and secure This also means: •
-
@brianroemmele
Brian Roemmele
on x
We are testing TurboQuant at the Zero-Human Company and are fascinated by the speed up! We are at a consistent 5x increase! More testing...
-
@aminkarbasi
Amin Karbasi
on x
I left @GoogleResearch almost two years ago, so it makes me genuinely happy to see our work on polar quantization (my last project), which eventually led to extreme compression, being recognized there. It is a nice reminder that good fundamental work tends to find its place with
-
@bqbrady
Benedict
on x
All you had to do was pay attention to the polar coordinates lecture in Trig and you could have discovered a 6x reduction in KV cache memory. High school math vindicated
-
@sudoingx
@sudoingx
on x
thank you google. for all your contributions to make the world a better place. we need more of this not more of altman gambles.
-
@kimmonismus
@kimmonismus
on x
Thats freaking awesome: Google Research has introduced TurboQuant, a compression algorithm (presenting at ICLR 2026) that shrinks the memory footprint of large language models by at least 6x, without any retraining or drop in accuracy. It works by converting data into a polar [im…
-
@themylesfiles
Myles
on x
I'm an interactive learner, so I turned Google's TurboQuant paper into a @marimo_io notebook. Random rotations → Beta distributions → optimal 3-bit quantization → 6x memory savings on LLM KV caches. Way easier to grok when you can drag a slider and watch the math happen.
-
@8teapi
Prakash
on x
Interesting. By releasing this publicly they can reduce memory demand across the sector, which helps because they were slow to secure memory capacity in Asia
-
@rough__sea
Ryan Dahl
on x
I'm surprised Google is publishing this - seems like good IP
-
@prince_canuma
Prince Canuma
on x
Just implemented Google's TurboQuant in MLX and the results are wild! Needle-in-a-haystack using Qwen3.5-35B-A3B across 8.5K, 32.7K, and 64.2K context lengths: → 6/6 exact match at every quant level → TurboQuant 2.5-bit: 4.9x smaller KV cache → TurboQuant 3.5-bit: 3.8x [image]
-
@jordannanos
Jordan Nanos
on x
Jevons paradox for KV cache
-
@raffi_hotter
Raffi Hotter
on x
This algorithm uses one of my favourite theorems in math, the Johnson-Lindentrauss Lemma, which says you can drastically reduce the dimensionality of n points to just log(n) dimensions and still preserve pairwise distances
-
@matthewberman
Matthew Berman
on x
this is a big deal. 6x reduction in kv mem and 8x speed up is incredible...let alone ZERO accuracy loss.
-
@mweinbach
Max Weinbach
on x
On TurboQuant, I see 2 possible outcomes 1. we reallocate memory towards larger models and larger context, so freed up memory goes to improving models 2. we keep everything as is and just shrink costs to generate tokens I'd wager we see the shift towards 1, not 2
-
@julientechinvst
Julien
on x
Memory makers bloodbath tomorrow (likely). It a huge progress and is an step forward in removing the memory bottleneck. Now, it will put more pressure on logic side though
-
@timkellogg.me
Tim Kellogg
on bluesky
PolarQuant: 6x memory reduction 8x speed improvement — weirdly, this works for both KV-cache and vector DBs — the gist is they convert from cartesian coordinate vectors into polar coordinates, and since they're always normalized to 1.0, they drop the magnitude too — researc…
-
r/accelerate
r
on reddit
Google Research introduces TurboQuant: A new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup …
-
r/LocalLLaMA
r
on reddit
[google research] TurboQuant: Redefining AI efficiency with extreme compression
-
@emollick
Ethan Mollick
on x
AI slop science posts keep moving markets, this time by misinterpreting or mis-dating papers. Science fiction, but dumb.
-
@jukan05
Jukan
on x
Bro, that shit you guys are hyping dropped in April last year. Why are you acting like it's new now? [image]
-
@friedberg
David Friedberg
on x
Since the first Presidential scientific advisory board, established by FDR in 1933, Presidential science and technology councils have supported policies that advanced research goals, enabled breakthrough scientific discoveries, drove the development of new technologies, and
-
@fundaai
@fundaai
on x
If you consider how paper publication processes typically work at major labs like Gemini and Google, it's reasonable to assume that the most impactful results in the Gemini domain are unlikely to ever be published. Even the papers that do get released are presumably based on
-
@jukan05
Jukan
on x
The fact that memory stocks are crashing because of Google's Turboquant is a pretty good indicator of how many clueless people this market is filled with. It's like saying Aramco should crash because Toyota came out with a next-generation hybrid engine.
-
@dorialexander
Alexander Doria
on x
You only have to read it to realize it's incremental engineering gains, other similar methods exist (KIVI, PolarQuant), there are trade-off (random notation is not free)
-
@dorialexander
Alexander Doria
on x
People going insane on a one year old mid paper makes me very pessimistic over tech literacy. Maybe just as well if the next effective agents take over.