Qwen3.6-27B-MTP-UD-Q5_K_XL on my 7900XTX goes from 32 t/s to 50-72 t/s depending on the predictability of the task. So, a 1.5x increase on creative tasks up to a 2.2x increase on math.

MTP does not change the quality with the only cost being a few hundred MB extra VRAM usage. You will need to download a gguf model with MTP support to use it.
My parameters:

; Context memory usage  
ctx-size = 65536  
ctk = q8_0  
ctv = q8_0  

; Prompt processing speed  
batch-size = 1024  
ubatch-size = 1024  

; Speculative decoding  
np = 1  
spec-type = draft-mtp  
spec-draft-n-max = 3  

Edit: did some more testing using Unsloth’s parameters and with spec-draft-n-max = 6 I can get up to 82 tk/s, a 2.56x increase, on the same math prompt. But this comes at the cost of the creative writing task that now falls below 40 tk/s.
It seems like this should be tweaked depending on the prompt similar to the sampling parameters.

  • robber@lemmy.ml
    link
    fedilink
    English
    arrow-up
    2
    ·
    3 天前

    Using MTP combined with tensor parallelism, I was able to go from running Qwen3.6 27b at ~7t/s to ~30t/s which I think is an insane boost (3x RTX 2000e Ada).

  • Avid Amoeba@lemmy.ca
    link
    fedilink
    English
    arrow-up
    2
    ·
    7 天前

    This does 18tps on 2x R9700:

    [Qwen3.6-27B-Q8_0-Code-256K]
    m = /models/Qwen3.6-27B/Qwen3.6-27B-Q8_0.gguf
    mmproj = /models/Qwen3.6-27B/mmproj-BF16.gguf
    chat-template-kwargs = {"preserve_thinking": true}
    ctx-size = 262144
    temp = 0.6
    top-p = 0.95
    top-k = 20
    min-p = 0.0
    presence-penalty = 0.0
    repeat-penalty = 1.0
    

    This does 39tps on the same hardware:

    [Qwen3.6-27B-MTP-Q8_0-Code-256K]
    m = /models/Qwen3.6-27B-MTP/Qwen3.6-27B-Q8_0.gguf
    mmproj = /models/Qwen3.6-27B-MTP/mmproj-BF16.gguf
    spec-type = draft-mtp
    spec-draft-n-max = 2
    chat-template-kwargs = {"preserve_thinking": true}
    ctx-size = 262144
    temp = 0.6
    top-p = 0.95
    top-k = 20
    min-p = 0.0
    presence-penalty = 0.0
    repeat-penalty = 1.0
    

    😱