31gb NVFP4 Model?
I'm curious why this model is 31gb for NVFP4? The full FP8 straight from Qwen is only 37.5gb.
@zenmagnets only the MLP layers are compressed to 4-bit. Sensitive components like the attention and vision blocks are left in 16-bit. Qwen's official FP8 applies compression much more uniformly, but because this 4-bit version leaves so much in BF16, the total size equalizes to be close to the FP8 model. See also Qwen's official Qwen/Qwen3.5-27B-GPTQ-Int4 which is also around the same size.
Gotcha. Thanks for the explanation
@zenmagnets only the MLP layers are compressed to 4-bit. Sensitive components like the attention and vision blocks are left in 16-bit. Qwen's official FP8 applies compression much more uniformly, but because this 4-bit version leaves so much in BF16, the total size equalizes to be close to the FP8 model. See also Qwen's official Qwen/Qwen3.5-27B-GPTQ-Int4 which is also around the same size.
with lots of bf16, how is the decode output performance?
I've got it running on WSL and RTX 6000 Pro with docker compose below. I am getting 22 tok/s which is 5x lower than FP8 version (100-140 tok/s) of same model. I must be doing something wrong here OR that number is right?
services:
vllm-large:
image: vllm/vllm-openai:latest
container_name: vllm-large
restart: unless-stopped
ports:
- "8000:8000"
environment:
- VLLM_API_KEY=some-secure-key-here
- HF_HOME=/root/.cache/huggingface
volumes:
- /mnt/p/models/qwen:/models
- /mnt/p/hf-cache:/root/.cache/huggingface
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
ipc: host
command: >
--model /models/Qwen3.6-27B-NVFP4-unsloth
--served-model-name qwen3.6-27b-nvfp4-unsloth
--trust-remote-code
--max-model-len 196608
--max-num-seqs 256
--max-num-batched-tokens 32768
--gpu-memory-utilization 0.70
--dtype bfloat16
--kv-cache-dtype fp8
--attention-backend flashinfer
--enable-prefix-caching
--no-scheduler-reserve-full-isl
--reasoning-parser qwen3
--enable-auto-tool-choice
--tool-call-parser qwen3_coder
--enable-chunked-prefill
--override-generation-config '{"max_new_tokens": 81920}'
--default-chat-template-kwargs '{"enable_thinking": true}'
--speculative-config '{"method":"mtp","num_speculative_tokens":3}'
healthcheck:
test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
interval: 30s
timeout: 10s
retries: 10
start_period: 60s