Google Research details TurboQuant, a quantization algorithm to enable massive compression of LLMs and vector search engines without sacrificing accuracy
We introduce a set of advanced theoretically grounded quantization algorithms that enable massive compression for large language models and vector search engines.
Google Research
Related Coverage
- Google unveils TurboQuant, a new AI memory compression algorithm — and yes, the internet is calling it ‘Pied Piper’ TechCrunch · Sarah Perez
- Google's new TurboQuant algorithm speeds up AI memory 8x, cutting costs by 50% or more VentureBeat · Carl Franzen
- Google's TurboQuant AI-compression algorithm can reduce LLM memory usage by 6x Ars Technica · Ryan Whitwam
- Google's TurboQuant Algorithm Slashes LLM Memory Use by 6x WinBuzzer · Markus Kasanmascheff
- Google develops TurboQuant compression technology for AI models SiliconANGLE · Maria Deutscher
- Google Shrinks AI Memory With No Accuracy Loss—But There's a Catch Decrypt · Jose Antonio Lanz
- Google Breakthrough Spurs Chip Selloff Despite Analyst Doubt Bloomberg · Kurt Schussler
- Google's new compression algorithm cut memory stocks within hours of publication The Next Web · Alina Maria Stan
- Micron's stock is dropping. Is Google partly to blame? MarketWatch · Britney Nguyen
- Google Research outlines algorithms that may ease AI memory squeeze Constellation Research · Larry Dignan
- Google's TurboQuant compresses AI memory by 6x without losing accuracy Efficienist · Ivan Jenic
- TurboQuant: Redefining AI efficiency with extreme compression Hacker News
- A Google AI breakthrough is pressuring memory chip stocks from Samsung to Micron CNBC · Arjun Kharpal
- TurboQuant: Redefining AI efficiency with extreme compression Lobsters
Discussion
-
@eastdakota
Matthew Prince
on x
This is Google's DeepSeek. So much more room to optimize AI inference for speed, memory usage, power consumption, and multi-tenant utilization. Lots of teams at @Cloudflare focused on these areas. #staytuned
-
@googleresearch
@googleresearch
on x
Introducing TurboQuant: Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency. Read the blog to learn how it achieves these results: https://research.google/... [im…
-
@prince_canuma
Prince Canuma
on x
Just implemented Google's TurboQuant in MLX and the results are wild! Needle-in-a-haystack using Qwen3.5-35B-A3B across 8.5K, 32.7K, and 64.2K context lengths: → 6/6 exact match at every quant level → TurboQuant 2.5-bit: 4.9x smaller KV cache → TurboQuant 3.5-bit: 3.8x [image]
-
@onlyxuanwo
@onlyxuanwo
on x
I know Pied Piper is real
-
@emollick
Ethan Mollick
on x
AI slop science posts keep moving markets, this time by misinterpreting or mis-dating papers. Science fiction, but dumb.
-
@dorialexander
Alexander Doria
on x
You only have to read it to realize it's incremental engineering gains, other similar methods exist (KIVI, PolarQuant), there are trade-off (random notation is not free)
-
@jukan05
Jukan
on x
Bro, that shit you guys are hyping dropped in April last year. Why are you acting like it's new now? [image]
-
@dorialexander
Alexander Doria
on x
People going insane on a one year old mid paper makes me very pessimistic over tech literacy. Maybe just as well if the next effective agents take over.
-
@friedberg
David Friedberg
on x
Since the first Presidential scientific advisory board, established by FDR in 1933, Presidential science and technology councils have supported policies that advanced research goals, enabled breakthrough scientific discoveries, drove the development of new technologies, and
-
@markosaaig
Markos
on x
I'm laughing here with @jukan05 because this dropping shows exactly who understands the memory system and who don't. Probably the same people who constantly only scream micron, SK have a low forward PE TurboQuant lowers cost per token. and expands the context window so you can
-
@omercheeema
Omer Cheema
on x
In every AI chat, the model keeps your entire conversation in KV cache. On a 70B model, that alone can eat 40GB+ of GPU RAM, more than the model. Google just dropped TurboQuant: a new compression algorithm that shrinks the KV cache by 6x down to just 3 bits per value — with
-
@jukan05
Jukan
on x
The fact that memory stocks are crashing because of Google's Turboquant is a pretty good indicator of how many clueless people this market is filled with. It's like saying Aramco should crash because Toyota came out with a next-generation hybrid engine.
-
@jenzhuscott
Jen Zhu
on x
When I was consulting for @HBO Silicon Valley, zero-loss compression was the holy grail Richard Hendricks chases that perfect middle-out algo could shrink everything w/out breaking a single bit. Google just did something even more practical for the AI era: TurboQuant compresses […
-
@stocksavvyshay
Shay Boloor
on x
$MU and $SNDK are getting hit hard at the open from the release of $GOOGL TurboQuant. The market is reading it as a potential headwind for memory names because long-context AI inference may now need far less memory per workload. [image]
-
@benbajarin
Ben Bajarin
on x
Ok, class, listen up. Your homework today is to come up with scenarios where TurboQuant is negative to memory demand and scenarios where it may actually boost memory demand.
-
@mweinbach
Max Weinbach
on x
On TurboQuant, I see 2 possible outcomes 1. we reallocate memory towards larger models and larger context, so freed up memory goes to improving models 2. we keep everything as is and just shrink costs to generate tokens I'd wager we see the shift towards 1, not 2
-
@kimmonismus
@kimmonismus
on x
Thats freaking awesome: Google Research has introduced TurboQuant, a compression algorithm (presenting at ICLR 2026) that shrinks the memory footprint of large language models by at least 6x, without any retraining or drop in accuracy. It works by converting data into a polar [im…
-
@aminkarbasi
Amin Karbasi
on x
I left @GoogleResearch almost two years ago, so it makes me genuinely happy to see our work on polar quantization (my last project), which eventually led to extreme compression, being recognized there. It is a nice reminder that good fundamental work tends to find its place with
-
@fundaai
@fundaai
on x
If you consider how paper publication processes typically work at major labs like Gemini and Google, it's reasonable to assume that the most impactful results in the Gemini domain are unlikely to ever be published. Even the papers that do get released are presumably based on
-
@julientechinvst
Julien
on x
Memory makers bloodbath tomorrow (likely). It a huge progress and is an step forward in removing the memory bottleneck. Now, it will put more pressure on logic side though
-
@themylesfiles
Myles
on x
I'm an interactive learner, so I turned Google's TurboQuant paper into a @marimo_io notebook. Random rotations → Beta distributions → optimal 3-bit quantization → 6x memory savings on LLM KV caches. Way easier to grok when you can drag a slider and watch the math happen.
-
@jordannanos
Jordan Nanos
on x
Jevons paradox for KV cache
-
@f4micom
@f4micom
on x
i hope this is open and i hope that if it's not it inspires an open implementation of the same base concept
-
@8teapi
Prakash
on x
Interesting. By releasing this publicly they can reduce memory demand across the sector, which helps because they were slow to secure memory capacity in Asia
-
@kmeanskaran
Karan
on x
Mark my words, inference engineering will be an evergreen in-demand skill for the upcoming years. If you are in applied ML, then it's a top priority skill. People using AI will grow, along with companies integrating AI. Nowadays, speed matters the most and we are losing
-
@brianroemmele
Brian Roemmele
on x
We are testing TurboQuant at the Zero-Human Company and are fascinated by the speed up! We are at a consistent 5x increase! More testing...
-
@bqbrady
Benedict
on x
All you had to do was pay attention to the polar coordinates lecture in Trig and you could have discovered a 6x reduction in KV cache memory. High school math vindicated
-
@anisha_moonka
Anish Moonka
on x
Every time you message an AI chatbot, the model stores your entire conversation in temporary memory called a KV cache (a cheat sheet so it doesn't re-read everything from scratch). On a large model like Llama 70B running a long conversation, that cache alone eats 40GB of GPU
-
@sudoingx
@sudoingx
on x
thank you google. for all your contributions to make the world a better place. we need more of this not more of altman gambles.
-
@joshkale
Josh Kale
on x
This post got ZERO attention but is BY FAR the biggest AI news this week Google just published TurboQuant: a compression algorithm that makes AI inference 8x faster while using 6x less memory. No retraining. No accuracy loss. The biggest cost is inference which happens billions […
-
@rough__sea
Ryan Dahl
on x
I'm surprised Google is publishing this - seems like good IP
-
@firstadopter
Tae Kim
on x
Wat! “Our new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup, all with zero accuracy loss, redefining AI efficiency.” [image]
-
@raffi_hotter
Raffi Hotter
on x
This algorithm uses one of my favourite theorems in math, the Johnson-Lindentrauss Lemma, which says you can drastically reduce the dimensionality of n points to just log(n) dimensions and still preserve pairwise distances
-
@0xsero
@0xsero
on x
Testing this tomorrow, will report back if it works on Qwen3.5 Might be able to run much larger models if this works. [image]
-
@matthewberman
Matthew Berman
on x
this is a big deal. 6x reduction in kv mem and 8x speed up is incredible...let alone ZERO accuracy loss.
-
@alexfinn
Alex Finn
on x
This is potentially the biggest news of the year Google just released TurboQuant. An algorithm that makes LLM's smaller and faster, without losing quality Meaning that 16gb Mac Mini now can run INCREDIBLE AI models. Completely locally, free, and secure This also means: •
-
@timkellogg.me
Tim Kellogg
on bluesky
PolarQuant: 6x memory reduction 8x speed improvement — weirdly, this works for both KV-cache and vector DBs — the gist is they convert from cartesian coordinate vectors into polar coordinates, and since they're always normalized to 1.0, they drop the magnitude too — researc…
-
r/technology
r
on reddit
Google unveils TurboQuant, a new AI memory compression algorithm — and yes, the internet is calling it ‘Pied Piper’ | TechCrunch
-
r/accelerate
r
on reddit
Google Research introduces TurboQuant: A new compression algorithm that reduces LLM key-value cache memory by at least 6x and delivers up to 8x speedup …
-
r/LocalLLaMA
r
on reddit
[google research] TurboQuant: Redefining AI efficiency with extreme compression
-
r/Bard
r
on reddit
Google Research: TurboQuant achieves 6x KV cache compression with zero accuracy loss