Llama.cpp MTP Support merged - up to 2.5x speed increase

TheCornCollector@piefed.zip · edit-2 9 天前

Llama.cpp MTP Support merged - up to 2.5x speed increase

robber@lemmy.ml · 3 天前

Using MTP combined with tensor parallelism, I was able to go from running Qwen3.6 27b at ~7t/s to ~30t/s which I think is an insane boost (3x RTX 2000e Ada).

Avid Amoeba@lemmy.ca · 7 天前

This does 18tps on 2x R9700:

[Qwen3.6-27B-Q8_0-Code-256K]
m = /models/Qwen3.6-27B/Qwen3.6-27B-Q8_0.gguf
mmproj = /models/Qwen3.6-27B/mmproj-BF16.gguf
chat-template-kwargs = {"preserve_thinking": true}
ctx-size = 262144
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0

This does 39tps on the same hardware:

[Qwen3.6-27B-MTP-Q8_0-Code-256K]
m = /models/Qwen3.6-27B-MTP/Qwen3.6-27B-Q8_0.gguf
mmproj = /models/Qwen3.6-27B-MTP/mmproj-BF16.gguf
spec-type = draft-mtp
spec-draft-n-max = 2
chat-template-kwargs = {"preserve_thinking": true}
ctx-size = 262144
temp = 0.6
top-p = 0.95
top-k = 20
min-p = 0.0
presence-penalty = 0.0
repeat-penalty = 1.0

😱

TheCornCollector@piefed.zip · 9 天前

https://unsloth.ai/docs/models/qwen3.6#mtp-guide
Unsloth made a guide and has graphs with comparisons

Llama.cpp MTP Support merged - up to 2.5x speed increase

Llama.cpp MTP Support merged - up to 2.5x speed increase

llama + spec: MTP Support by am17an · Pull Request #22673 · ggml-org/llama.cpp