PolarQuant Metal
Fused Metal kernels that eliminate the dequantize-on-fetch bottleneck in PolarQuant KV cache on Apple Silicon.
PolarQuant (TurboQuant) achieves ~4.6x KV cache compression via random orthogonal rotation + Lloyd-Max codebook quantization. Existing MLX implementations suffer a 0.5x decode speed penalty because they dequantize the entire cache every attention step. Our fused Metal kernels compute attention directly from packed quantized indices — no dequantization needed.
Result: 75.3 tok/s vs 71.4 tok/s standard — 5% faster than FP16 with 8x KV cache compression on Qwen3.5-35B (M4 Pro).
GitHub | TurboQuant Paper (arXiv:2504.19874) | mlx-lm Issue #1060
3-bit PolarQuant, 32 query heads / 8 KV heads, D=128, M4 Pro. Fused kernels match FP16 speed at 1K tokens and pull ahead at 2K+, while using 4.6x less memory.