-
|
Hey! Really digging the tiered KV cache (SSD offloading) design, it's super handy for long contexts on Mac. Just wondering—does oMLX support 4-bit KV cache quantization yet? Or is it something on the roadmap? I'm trying to push the context limit as far as possible on a memory-constrained machine. Any plans for this? Thanks! |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
|
As you can see in ml-explore/mlx-lm#941, the continuous batching that oMLX relies on as its core does not support kv cache quantization yet. (that PR is still open and doesn't appear to be proper quantization anyway.) The first problem is that mlx-lm, oMLX's backend, doesn't support it. But even if i tried to implement it separately, i honestly have a lot of doubts about the effectiveness of kv cache quantization. If you've tried 4-bit kv cache, you probably know this already, but it has a devastating impact especially on the long context agentic tasks that oMLX is primarily targeting. 4-bit is practically unusable in my opinion, and i'm skeptical whether 8-bit produces acceptable quality either. That's my take on it. Feel free to add any thoughts! |
Beta Was this translation helpful? Give feedback.
As you can see in ml-explore/mlx-lm#941, the continuous batching that oMLX relies on as its core does not support kv cache quantization yet. (that PR is still open and doesn't appear to be proper quantization anyway.)
The first problem is that mlx-lm, oMLX's backend, doesn't support it. But even if i tried to implement it separately, i honestly have a lot of doubts about the effectiveness of kv cache quantization. If you've tried 4-bit kv cache, you probably know this already, but it has a devastating impact especially on the long context agentic tasks that oMLX is primarily targeting. 4-bit is practically unusable in my opinion, and i'm skeptical whether 8-bit produces acceptable quality…