PolarQuant Metal

Fused Metal kernels that eliminate the dequantize-on-fetch bottleneck in PolarQuant KV cache on Apple Silicon.

PolarQuant (TurboQuant) achieves ~4.6x KV cache compression via random orthogonal rotation + Lloyd-Max codebook quantization. Existing MLX implementations suffer a 0.5x decode speed penalty because they dequantize the entire cache every attention step. Our fused Metal kernels compute attention directly from packed quantized indices — no dequantization needed.

Result: 75.3 tok/s vs 71.4 tok/s standard — 5% faster than FP16 with 8x KV cache compression on Qwen3.5-35B (M4 Pro).

GitHub | TurboQuant Paper (arXiv:2504.19874) | mlx-lm Issue #1060