Update Minimax M3 FP4 vllm#1978
Conversation
| - { tp: 2, ep: 2, conc-start: 32, conc-end: 512 } | ||
| - { tp: 4, ep: 4, conc-start: 128, conc-end: 512 } | ||
| - { tp: 4, ep: 4, dp-attn: true, conc-start: 256, conc-end: 512 } | ||
| - { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 } |
There was a problem hiding this comment.
🔴 The new 8k1k entry { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 } at line 12950 mislabels EP: with dp-attn: true, minimaxm3_fp4_b200.sh launches vLLM with --tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel and never passes EP_SIZE, so world_size = TP×1 = 2 and vLLM runs with EP=2 — but process_result.py reads EP_SIZE from env and records the result as ep=4, poisoning cross-config comparisons on the leaderboard. It is also the only entry in the file where ep > tp and ep != 1; every other dp-attn row has ep == tp. Likely a typo — either tp: 4, ep: 4, dp-attn: true (matching the sibling row on line 12949) or tp: 2, ep: 2, dp-attn: true.
Extended reasoning...
What's wrong
Line 12950 adds:
- { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 }The ep: 4 label on this row is a lie: the actual runtime will use EP=2. Here is the code path.
Trigger path
1. Shell launcher — benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh:37-43:
if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel"
elif [ "$EP_SIZE" -gt 1 ]; then
PARALLEL_ARGS="--tensor-parallel-size=$TP --enable-expert-parallel"
else
PARALLEL_ARGS="--tensor-parallel-size=$TP"
fiWhen DP_ATTENTION=true, EP_SIZE is used only for the elif's boolean check — it is never forwarded to vLLM via --expert-parallel-size. vLLM derives EP from world_size = data-parallel-size × tensor-parallel-size = TP × 1.
2. Slurm allocation — runners/launch_b200-dgxc.sh:435 requests --gres=gpu:$TP = 2 GPUs. Only 2 physical GPUs are allocated, so EP=4 is physically impossible on this node.
3. Result labeling — utils/process_result.py:110-118 reads EP_SIZE directly from env and records it as the ep field. utils/matrix_logic/generate_sweep_configs.py:483-484 propagates the YAML ep into that env var. Nothing corrects for the dp-attn case, so the label survives verbatim.
Step-by-step proof for this entry
- Sweep generator emits an env with
TP=2,EP_SIZE=4,DP_ATTENTION=true. - Slurm allocates 2 GPUs (
gpu:$TP). - Shell hits the first branch → launches vLLM with
--tensor-parallel-size=1 --data-parallel-size=2 --enable-expert-parallel.EP_SIZE=4is never referenced. - vLLM's world_size = 2×1 = 2, so it uses EP=2.
- Benchmark completes.
process_result.pyreadsEP_SIZEfrom env → writesep: 4into the result JSON. - Leaderboard ingests an ep=4 row that was actually an EP=2 run.
Why validation doesn't catch this
utils/matrix_logic/validation.py SingleNodeSearchSpaceEntry permits ep independent of tp; there is no ep <= tp or ep == tp for dp-attn constraint. So this passes generate-sweep unnoticed.
Uniqueness / typo signal
Across the entire nvidia-master.yaml, this is the only entry with ep > tp and ep != 1. Every other dp-attn: true row has ep == tp (2/2, 4/4, 8/8). Combined with the row immediately above (tp: 4, ep: 4, dp-attn: true), the natural read is a typo where either tp should be 4 or ep should be 2.
Fix
One-character YAML edit — pick whichever config the author actually intended:
- { tp: 4, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 } # matches row 12949 pattern
# or
- { tp: 2, ep: 2, dp-attn: true, conc-start: 64, conc-end: 256 } # matches every other dp-attn row|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28565934200 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28570491405 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28574074488 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28604484366 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28604484366 |
|
/reuse-sweep-run |
|
As a PR reviewer and CODEOWNER, I have reviewed this and have:
Additional detail section:
Signed: |
|
@Ankur-singh Merge is blocked on Check 3: the linked recipe does not cover the trtllm all-reduce backend this PR forces.
|
# Conflicts: # perf-changelog.yaml

Update Minimax B200 minimax M3 FP4 vllm configs, primarily the parallel configs, use FP8 kv cache.