

KV-cache for the MTP-model all the way down to q4_0.
From the other reply from @brucethemoose@lemmy.world, I did try q8_0/q5_1. It works pretty well.
I’ll try Gemma-4-12B QAT when I get time :)


KV-cache for the MTP-model all the way down to q4_0.
From the other reply from @brucethemoose@lemmy.world, I did try q8_0/q5_1. It works pretty well.
I’ll try Gemma-4-12B QAT when I get time :)


You are an absolute legend! You were 100% right.
I built llama.cpp from source just like you suggested, using the -DGGML_CUDA_FA_ALL_QUANTS=ON and AVX-512 flags. The difference is insane! The silent CPU fallback is completely gone. My prompt processing jumped from a slow 87 tok/s to 938.68 tok/s, and my CPU is now at 0% during prefill.
P.S. I was using doccker image previously…
Thank you so much for the compile flag. You saved me a ton of time. Oh also, avx512 is “1” by default.
Edit: And I am at 32k context now :)


9950X3D.How much RAM you got with that?
…And have you ever considered running an MoE, with experts on the CPU?
They can be shockingly fast, as the attention and dense layers are all still processed on the GPU. You could also run a much less impactful quantization, especially for the dense layers (which are typically at Q6K or Q8_0 for MoEs, while the experts in CPU RAM can take heavier quantization).
I have 32G rn, planning to upgrade it to 64G soon once they get cheaper again. Meanwhile, I did some testing



q8_0 + q5_1
I just tried it but it is silently falling back to CPU for processing giving me 87 tok/s for prompt processing. Maybe q5_1 is not fully supported on AMD? I cannot find anything relevant :(
Plug your monitor into the motherboard
My mobo has just 1 hdmi and a usb 4 DP. I would need to buy a new cable… I would try it once I get one


P.S. 16k context window is not good for agentic coding. I am trying to increase it and see what I can do…
How do you assume that it was only 2?
I love how it just says no, prolly no life left XD
You gotta try hard enough


Yeah, mid characters except few…
3 considering the zipper
They must rest live in peace.


I am glad that it was not a typical harem trope in the 2nd episode. I’m prolly gonna stick with the season.


Plot twist: Okarun was the main antagonist after all…
Fair enough, servers cost, you can self host it too…
No, I actually don’t run a separate draft model on the CPU…
Since I’m using Qwen 3.6 27B, I’m utilizing MTP (Multi-Token Prediction) speculative decoding. This is built directly into the main model itself, so there is no extra “small model” to load or offload to the CPU. Everything runs entirely on the GPU VRAM.
To get it working well in llama.cpp, you just need two things:
That’s pretty much it! Because everything stays in VRAM and uses the main model’s native architecture, the draft acceptance rate is super high (around 64% for me) and it basically doubles the generation speed.