Skip to content
Discussion options

You must be logged in to vote

As you can see in ml-explore/mlx-lm#941, the continuous batching that oMLX relies on as its core does not support kv cache quantization yet. (that PR is still open and doesn't appear to be proper quantization anyway.)

The first problem is that mlx-lm, oMLX's backend, doesn't support it. But even if i tried to implement it separately, i honestly have a lot of doubts about the effectiveness of kv cache quantization. If you've tried 4-bit kv cache, you probably know this already, but it has a devastating impact especially on the long context agentic tasks that oMLX is primarily targeting. 4-bit is practically unusable in my opinion, and i'm skeptical whether 8-bit produces acceptable quality…

Replies: 1 comment 1 reply

Comment options

You must be logged in to vote
1 reply
@Yif1999
Comment options

Answer selected by Yif1999
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants