Update Minimax M3 FP4 vllm by wzhao18 · Pull Request #1978 · SemiAnalysisAI/InferenceX · GitHub
Skip to content

Update Minimax M3 FP4 vllm#1978

Merged
adibarra merged 8 commits into
mainfrom
wzhao/minimax-m3-fp4-update
Jul 2, 2026
Merged

Update Minimax M3 FP4 vllm#1978
adibarra merged 8 commits into
mainfrom
wzhao/minimax-m3-fp4-update

Conversation

@wzhao18

@wzhao18 wzhao18 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Update Minimax B200 minimax M3 FP4 vllm configs, primarily the parallel configs, use FP8 kv cache.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Comment thread .github/configs/nvidia-master.yaml Outdated
- { tp: 2, ep: 2, conc-start: 32, conc-end: 512 }
- { tp: 4, ep: 4, conc-start: 128, conc-end: 512 }
- { tp: 4, ep: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
- { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new 8k1k entry { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 } at line 12950 mislabels EP: with dp-attn: true, minimaxm3_fp4_b200.sh launches vLLM with --tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel and never passes EP_SIZE, so world_size = TP×1 = 2 and vLLM runs with EP=2 — but process_result.py reads EP_SIZE from env and records the result as ep=4, poisoning cross-config comparisons on the leaderboard. It is also the only entry in the file where ep > tp and ep != 1; every other dp-attn row has ep == tp. Likely a typo — either tp: 4, ep: 4, dp-attn: true (matching the sibling row on line 12949) or tp: 2, ep: 2, dp-attn: true.

Extended reasoning...

What's wrong

Line 12950 adds:

- { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 }

The ep: 4 label on this row is a lie: the actual runtime will use EP=2. Here is the code path.

Trigger path

1. Shell launcherbenchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh:37-43:

if [ "${DP_ATTENTION}" = "true" ]; then
  PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel"
elif [ "$EP_SIZE" -gt 1 ]; then
  PARALLEL_ARGS="--tensor-parallel-size=$TP --enable-expert-parallel"
else
  PARALLEL_ARGS="--tensor-parallel-size=$TP"
fi

When DP_ATTENTION=true, EP_SIZE is used only for the elif's boolean check — it is never forwarded to vLLM via --expert-parallel-size. vLLM derives EP from world_size = data-parallel-size × tensor-parallel-size = TP × 1.

2. Slurm allocationrunners/launch_b200-dgxc.sh:435 requests --gres=gpu:$TP = 2 GPUs. Only 2 physical GPUs are allocated, so EP=4 is physically impossible on this node.

3. Result labelingutils/process_result.py:110-118 reads EP_SIZE directly from env and records it as the ep field. utils/matrix_logic/generate_sweep_configs.py:483-484 propagates the YAML ep into that env var. Nothing corrects for the dp-attn case, so the label survives verbatim.

Step-by-step proof for this entry

  1. Sweep generator emits an env with TP=2, EP_SIZE=4, DP_ATTENTION=true.
  2. Slurm allocates 2 GPUs (gpu:$TP).
  3. Shell hits the first branch → launches vLLM with --tensor-parallel-size=1 --data-parallel-size=2 --enable-expert-parallel. EP_SIZE=4 is never referenced.
  4. vLLM's world_size = 2×1 = 2, so it uses EP=2.
  5. Benchmark completes. process_result.py reads EP_SIZE from env → writes ep: 4 into the result JSON.
  6. Leaderboard ingests an ep=4 row that was actually an EP=2 run.

Why validation doesn't catch this

utils/matrix_logic/validation.py SingleNodeSearchSpaceEntry permits ep independent of tp; there is no ep <= tp or ep == tp for dp-attn constraint. So this passes generate-sweep unnoticed.

Uniqueness / typo signal

Across the entire nvidia-master.yaml, this is the only entry with ep > tp and ep != 1. Every other dp-attn: true row has ep == tp (2/2, 4/4, 8/8). Combined with the row immediately above (tp: 4, ep: 4, dp-attn: true), the natural read is a typo where either tp should be 4 or ep should be 2.

Fix

One-character YAML edit — pick whichever config the author actually intended:

- { tp: 4, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 }   # matches row 12949 pattern
# or
- { tp: 2, ep: 2, dp-attn: true, conc-start: 64, conc-end: 256 }   # matches every other dp-attn row

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

1 similar comment
@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@Ankur-singh

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@Ankur-singh

Copy link
Copy Markdown
Collaborator

As a PR reviewer and CODEOWNER, I have reviewed this and have:

  • Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
  • Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this. Link
  • Verified that this PR passes evals. Please link to GitHub Action workflow that shows this. Link
  • Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
  • If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
  • Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
  • If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

  • Single-node vLLM AGG submission (minimaxm3-fp4-b200-vllm, nvidia/MiniMax-M3-NVFP4, B200). The upstream vLLM recipe PR vllm-project/recipes#577 adds the NVFP4 Blackwell (B200/B300) variant to the MiniMax-M3 recipe (MTP + non-MTP); this InferenceX PR updates the image tag + search space and enables FP8 KV cache and the trtllm all-reduce backend, consistent with that recipe variant.

Signed: ankur-singh

@Klaud-Cold

Copy link
Copy Markdown
Collaborator

@Ankur-singh Merge is blocked on Check 3: the linked recipe does not cover the trtllm all-reduce backend this PR forces.

  • Check 0 (CODEOWNER): PASS — ankur-singh is a listed owner of .github/configs/nvidia-master.yaml; the other changed paths fall under the catch-all only.
  • Check 1 (sweep on in-PR commit): PASS — head commit 317de36b has all executed single-node */ (33) and eval / (4) check-runs green in run 28604484366.
  • Check 2 (evals): PASS — GSM8K em_strict 0.953–0.958 across 4 configs on nvidia/MiniMax-M3-NVFP4 (fp4, vLLM, B200-DGXC), and that run used this PR's image vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226.
  • Check 3 (recipe): FAIL — the link is present (vllm-project/recipes#577) and model, B200, TP/EP/DP-attention modes, NVFP4 (auto-detected from checkpoint), --kv-cache-dtype fp8, --block-size 128, and --language-model-only all match, but VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm is a kernel-selection backend that appears nowhere in the recipe and is not the default (at image commit 93d8f834, auto resolves to mnnvl on a single node). Add it to the recipe PR or drop it from the launch script. Informational only: --gpu-memory-utilization 0.95, --max-num-batched-tokens, and the sweep search-space changes are InferenceX harness tuning, not blockers.
  • Check 4 (reuse command): PASS — /reuse-sweep-run posted by Ankur-singh (COLLABORATOR).

@wzhao18

wzhao18 commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

@wzhao18 wzhao18 changed the title [WIP] Update Minimax M3 FP4 vllm Update Minimax M3 FP4 vllm Jul 2, 2026
# Conflicts:
#	perf-changelog.yaml
@adibarra adibarra merged commit b4e8176 into main Jul 2, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants