• 3 Posts
  • 58 Comments
Joined 3 years ago
cake
Cake day: July 24th, 2023

help-circle
  • Did you follow a guide for setting up Speculative Decoding? I haven’t gotten it to work very well personally. Does the smaller model run on your CPU memory and the larger one fully on GPU?

    No, I actually don’t run a separate draft model on the CPU…

    Since I’m using Qwen 3.6 27B, I’m utilizing MTP (Multi-Token Prediction) speculative decoding. This is built directly into the main model itself, so there is no extra “small model” to load or offload to the CPU. Everything runs entirely on the GPU VRAM.

    To get it working well in llama.cpp, you just need two things:

    • Make sure you are using a model variant specifically trained for it (look for files with -MTP- in the name on Hugging Face, like the Unsloth ones).
    • Add the flag --spec-type draft-mtp to your startup command for docker. But I do suggest compiling llama.cpp yourself now, for better kv caching.

    That’s pretty much it! Because everything stays in VRAM and uses the main model’s native architecture, the draft acceptance rate is super high (around 64% for me) and it basically doubles the generation speed.