31gb NVFP4 Model?

by zenmagnets - opened Apr 23

Discussion

zenmagnets

Apr 23

I'm curious why this model is 31gb for NVFP4? The full FP8 straight from Qwen is only 37.5gb.

mmangkad

Owner Apr 23

@zenmagnets only the MLP layers are compressed to 4-bit. Sensitive components like the attention and vision blocks are left in 16-bit. Qwen's official FP8 applies compression much more uniformly, but because this 4-bit version leaves so much in BF16, the total size equalizes to be close to the FP8 model. See also Qwen's official Qwen/Qwen3.5-27B-GPTQ-Int4 which is also around the same size.

zenmagnets

Apr 23

Gotcha. Thanks for the explanation

PowAG

Apr 23

@zenmagnets only the MLP layers are compressed to 4-bit. Sensitive components like the attention and vision blocks are left in 16-bit. Qwen's official FP8 applies compression much more uniformly, but because this 4-bit version leaves so much in BF16, the total size equalizes to be close to the FP8 model. See also Qwen's official Qwen/Qwen3.5-27B-GPTQ-Int4 which is also around the same size.

with lots of bf16, how is the decode output performance?

markonimakaroni

May 3

I've got it running on WSL and RTX 6000 Pro with docker compose below. I am getting 22 tok/s which is 5x lower than FP8 version (100-140 tok/s) of same model. I must be doing something wrong here OR that number is right?

services:
  vllm-large:
    image: vllm/vllm-openai:latest
    container_name: vllm-large
    restart: unless-stopped
    ports:
      - "8000:8000"
    environment:
      - VLLM_API_KEY=some-secure-key-here
      - HF_HOME=/root/.cache/huggingface
    volumes:
      - /mnt/p/models/qwen:/models
      - /mnt/p/hf-cache:/root/.cache/huggingface

    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]

    ipc: host

    command: >
      --model /models/Qwen3.6-27B-NVFP4-unsloth
      --served-model-name qwen3.6-27b-nvfp4-unsloth
      --trust-remote-code
      --max-model-len 196608
      --max-num-seqs 256
      --max-num-batched-tokens 32768
      --gpu-memory-utilization 0.70
      --dtype bfloat16
      --kv-cache-dtype fp8
      --attention-backend flashinfer
      --enable-prefix-caching
      --no-scheduler-reserve-full-isl
      --reasoning-parser qwen3
      --enable-auto-tool-choice
      --tool-call-parser qwen3_coder
      --enable-chunked-prefill
      --override-generation-config '{"max_new_tokens": 81920}'
      --default-chat-template-kwargs '{"enable_thinking": true}'
      --speculative-config '{"method":"mtp","num_speculative_tokens":3}'
    
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:8000/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 10
      start_period: 60s

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment