GOOGLE TURBOQUANT — INFERENCE EFFICIENCY BREAKTHROUGH
KV Cache −6x Memory | 8x Attention Speedup | Zero Accuracy Loss
■ WHAT IT IS
Training-free compression algo targeting the KV cache (working memory in LLMs).
Compresses each cache value from 16 bits → 3 bits. No retraining required.
Drop-in on production models. PyTorch/MLX ports live within 24h of release.
■ HOW IT WORKS
Stage 1 – PolarQuant: rotates KV vectors into polar coords, eliminating per-block normalization overhead (the 1-2 bit tax that defeats most compression).
Stage 2 – QJL: reduces vectors to sign bits (+1/-1) via Johnson-Lindenstrauss Transform. Zero memory overhead. High-precision query estimator preserves attention accuracy.
Result: approaches information-theoretic optimum. Online, data-oblivious,
accelerator-friendly.
■ BENCHMARKS (H100)
- 4-bit impl: 8x speedup on attention logit computation vs unquantized 32-bit
- 3-bit: 6x KV cache reduction, zero degradation on LongBench / NIAH / RULER / L-Eval
Community test (MLX / Qwen3.5-35B, 8.5K–64K ctx): 100% output match at 2.5-bit
■ MARKET REACTION
SK Hynix -6.0% KRX
Kioxia -5.9% TSE
SanDisk -5.7% NASDAQ
Samsung -4.9% KRX
WDC -4.7% NASDAQ
Micron -3.0% NASDAQ
■ ANALYST SPLIT
- Wells Fargo (Rocha): "Directly attacking the cost curve. Calls into question how much memory capacity is needed." — bearish on near-term demand; adoption TBD.
- Morgan Stanley: does not touch model weights or training HBM. Bullish.
- SemiAnalysis (Wang): bottleneck relief → more capable models → more hardware.
- Quilter Cheviot: "Evolutionary, not revolutionary." Cyclical sell-off, not structural.
- Cloudflare CEO Prince: "Google's DeepSeek moment."
■ MY VIEW
Overreaction. TurboQuant is inference-only — training HBM demand (the supercycle driver) is entirely unaffected. Jevons dynamics apply: 6x cheaper inference → longer contexts, more agents, more RAG pipelines deployed.
MU/SK Hynix weakness is technical/positional, not a demand inflection.
Watch Q2 datacenter capex guidance from AWS/Azure/GOOGL as the real signal.