31gb NVFP4 Model?

#1
by zenmagnets - opened

I'm curious why this model is 31gb for NVFP4? The full FP8 straight from Qwen is only 37.5gb.

@zenmagnets only the MLP layers are compressed to 4-bit. Sensitive components like the attention and vision blocks are left in 16-bit. Qwen's official FP8 applies compression much more uniformly, but because this 4-bit version leaves so much in BF16, the total size equalizes to be close to the FP8 model. See also Qwen's official Qwen/Qwen3.5-27B-GPTQ-Int4 which is also around the same size.

Gotcha. Thanks for the explanation

@zenmagnets only the MLP layers are compressed to 4-bit. Sensitive components like the attention and vision blocks are left in 16-bit. Qwen's official FP8 applies compression much more uniformly, but because this 4-bit version leaves so much in BF16, the total size equalizes to be close to the FP8 model. See also Qwen's official Qwen/Qwen3.5-27B-GPTQ-Int4 which is also around the same size.

with lots of bf16, how is the decode output performance?

I've got it running on WSL and RTX 6000 Pro with docker compose below. I am getting 22 tok/s which is 5x lower than FP8 version (100-140 tok/s) of same model. I must be doing something wrong here OR that number is right?

services:
  vllm-large:
    image: vllm/vllm-openai:latest
    container_name: vllm-large
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      - VLLM_API_KEY=some-secure-key-here
      - HF_HOME=/root/.cache/huggingface
    volumes:
      - /mnt/p/models/qwen:/models
      - /mnt/p/hf-cache:/root/.cache/huggingface

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    ipc: host

    command: >
      --model /models/Qwen3.6-27B-NVFP4-unsloth
      --served-model-name qwen3.6-27b-nvfp4-unsloth
      --trust-remote-code
      --max-model-len 196608
      --max-num-seqs 256
      --max-num-batched-tokens 32768
      --gpu-memory-utilization 0.70
      --dtype bfloat16
      --kv-cache-dtype fp8
      --attention-backend flashinfer
      --enable-prefix-caching
      --no-scheduler-reserve-full-isl
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --enable-chunked-prefill
      --override-generation-config '{"max_new_tokens": 81920}'
      --default-chat-template-kwargs '{"enable_thinking": true}'
      --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
    
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 10
      start_period: 60s

Sign up or log in to comment