MTP support?

#5
by Nindaleth - opened

I can see people are starting to ask in your other models' discussions about MTP-supporting quants, please consider this nice little MoE too :)

This comment has been hidden (marked as Resolved)

I wanted to get some work done this weekend so I made a Q6_K version patched with the old PR method and uploaded it here. Been using it for about a day and it seems to work fine but I'd still switch to a proper official version once AesSedai releases it.

Thanks! I've tried it and in my heterogenous GPU Vulkan env, PP tanks 90 % (~2.5k -> 250 tk/s) as soon as I add --spec-type draft-mtp --spec-draft-n-max 2. Without MTP enabled, I get the same prompt processing speed as with the original.

With Unsloth's quants I start much lower and don't crash so hard (1.1k -> 800 tk/s), but I checked each on different llama.cpp commits and must test more.

I'll be updating my quants with MTP soon.

I've updated these quants with MTP support.

Follow-up, I made a mistake with the quants and didn't produce them as fused initially so the imatrix wasn't being applied correctly. I've updated the quants and it should all be correct and have the MTP tensors now.

The hero we don't deserve but need

Sign up or log in to comment