[ROCm][Perf][Bugfix] DSv4 indexer: use platform FP8 dtype (fnuz) for Q-quant on gfx942 by akii96 · Pull Request #46730 · vllm-project/vllm · GitHub
Skip to content

[ROCm][Perf][Bugfix] DSv4 indexer: use platform FP8 dtype (fnuz) for Q-quant on gfx942#46730

Open
akii96 wants to merge 1 commit into
vllm-project:mainfrom
akii96:aakif/dsv4-indexer-dtype-fix-gfx942
Open

[ROCm][Perf][Bugfix] DSv4 indexer: use platform FP8 dtype (fnuz) for Q-quant on gfx942#46730
akii96 wants to merge 1 commit into
vllm-project:mainfrom
akii96:aakif/dsv4-indexer-dtype-fix-gfx942

Conversation

@akii96

@akii96 akii96 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Motivation

On gfx942 the DeepSeek-V4 Flash indexer quantizes Q and K to different FP8 types. K already uses the platform type (e4m3fnuz on gfx942, via current_platform.fp8_dtype()), but the fused RoPE+quant kernel in fused_indexer_q.py hardcodes Q to e4m3fn. The FP8 logits kernel then gets fnuz K with fn Q and falls back to a mixed-dtype path on every call. This change derives Q's type from current_platform.fp8_dtype() as well, so on gfx942 both are fnuz and the logits kernel runs its native fnuz/fnuz path.

This is gfx942-specific by design. is_fp8_fnuz() is true only for gfx94x, so on gfx950 the platform type is OCP e4m3fn and Q/K are already fn/fn. Nothing changes there, and the NVIDIA cutedsl and MXFP4 paths are untouched (the two new kernel constexprs are defaulted).

The fnuz quant max is set to 224.0 to match get_fp8_min_max() in quant_utils.py, the value the K cache already uses.

Results

End to end serving of DeepSeek-V4 Flash (TP4, gfx942 / MI300), mean TTFT, prefill-heavy (OSL=27, concurrency 4):

ISL OSL main (mixed fp8 q=fn / k=fnuz) This PR (fnuz q/k) Speedup
8,192 27 3283 ms 1260 ms 2.6x
32,768 27 39744 ms 5350 ms 7.4x

Bonus correctness check: the existing kernel test tests/kernels/test_fused_indexer_q_rope_quant.py matches the unfused reference bit for bit on 9 of 10 gfx942 shapes. The one miss is 3 of 8,380,416 values at float32 / 1023 tokens, from fused vs unfused RoPE rounding at FP8 boundaries, not a dtype or scale error.

Note

The same indexer dtype handling was included in the larger ROCm enablement PRs #41601 and #42033, both stalled on rebase since May. This is a minimal, standalone version of just that fix for the gfx942 path; main as of today still hardcodes Q to e4m3fn.

Repro: serve + bench commands (DeepSeek-V4 Flash, TP4, gfx942)

Serve (4x MI325):

HIP_VISIBLE_DEVICES=0,1,2,3 VLLM_ROCM_USE_AITER=1 \
vllm serve deepseek-ai/DeepSeek-V4-Flash \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.85 \
  --kv-cache-dtype fp8_e4m3 \
  --block-size 256 \
  --max-model-len 132096 \
  --max-num-batched-tokens 16384 \
  --max-num-seqs 156 \
  --async-scheduling \
  --no-enable-prefix-caching \
  --tokenizer-mode deepseek_v4 \
  --reasoning-parser deepseek_v4 \
  --tool-call-parser deepseek_v4 \
  --enable-auto-tool-choice \
  --disable-log-stats \
  --host 0.0.0.0 --port 8000
  
  vllm bench serve --backend vllm \
  --model deepseek-ai/DeepSeek-V4-Flash \
  --host localhost --port 8000 \
  --dataset-name random --ignore-eos --trust-remote-code \
  --seed 5678 \
  --random-input-len 8192 \
  --random-output-len 27 \
  --max-concurrency 4 --num-prompts 12 --num-warmups 4

# 32K ISL: same command with --random-input-len 32768

@akii96 akii96 requested a review from zyongye as a code owner June 25, 2026 13:59

@claude claude Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@mergify mergify Bot added rocm Related to AMD ROCm bug Something isn't working labels Jun 25, 2026
@github-project-automation github-project-automation Bot moved this to Todo in AMD Jun 25, 2026
Signed-off-by: Aakif Nawaz <aakif.nawaz@amd.com>
@akii96 akii96 force-pushed the aakif/dsv4-indexer-dtype-fix-gfx942 branch from db4e561 to 0309e6c Compare June 25, 2026 14:20
@akii96

akii96 commented Jun 25, 2026

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working rocm Related to AMD ROCm

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

1 participant