iframe-proxy

wzhao18 · 2026-07-02T04:39:32Z

Update Minimax B200 minimax M3 FP4 vllm configs, primarily the parallel configs, use FP8 kv cache.

github-actions · 2026-07-02T04:39:42Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude · 2026-07-02T04:47:10Z

+      - { tp: 2, ep: 2, conc-start: 32, conc-end: 512 }
+      - { tp: 4, ep: 4, conc-start: 128, conc-end: 512 }
+      - { tp: 4, ep: 4, dp-attn: true, conc-start: 256, conc-end: 512 }
+      - { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 }


🔴 The new 8k1k entry { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 } at line 12950 mislabels EP: with dp-attn: true, minimaxm3_fp4_b200.sh launches vLLM with --tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel and never passes EP_SIZE, so world_size = TP×1 = 2 and vLLM runs with EP=2 — but process_result.py reads EP_SIZE from env and records the result as ep=4, poisoning cross-config comparisons on the leaderboard. It is also the only entry in the file where ep > tp and ep != 1; every other dp-attn row has ep == tp. Likely a typo — either tp: 4, ep: 4, dp-attn: true (matching the sibling row on line 12949) or tp: 2, ep: 2, dp-attn: true.

Extended reasoning...

What's wrong

Line 12950 adds:

- { tp: 2, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 }

The ep: 4 label on this row is a lie: the actual runtime will use EP=2. Here is the code path.

Trigger path

1. Shell launcher — benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b200.sh:37-43:

if [ "${DP_ATTENTION}" = "true" ]; then PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel" elif [ "$EP_SIZE" -gt 1 ]; then PARALLEL_ARGS="--tensor-parallel-size=$TP --enable-expert-parallel" else PARALLEL_ARGS="--tensor-parallel-size=$TP" fi

When DP_ATTENTION=true, EP_SIZE is used only for the elif's boolean check — it is never forwarded to vLLM via --expert-parallel-size. vLLM derives EP from world_size = data-parallel-size × tensor-parallel-size = TP × 1.

2. Slurm allocation — runners/launch_b200-dgxc.sh:435 requests --gres=gpu:$TP = 2 GPUs. Only 2 physical GPUs are allocated, so EP=4 is physically impossible on this node.

3. Result labeling — utils/process_result.py:110-118 reads EP_SIZE directly from env and records it as the ep field. utils/matrix_logic/generate_sweep_configs.py:483-484 propagates the YAML ep into that env var. Nothing corrects for the dp-attn case, so the label survives verbatim.

Step-by-step proof for this entry

Sweep generator emits an env with TP=2, EP_SIZE=4, DP_ATTENTION=true.

Slurm allocates 2 GPUs (gpu:$TP).

Shell hits the first branch → launches vLLM with --tensor-parallel-size=1 --data-parallel-size=2 --enable-expert-parallel. EP_SIZE=4 is never referenced.

vLLM's world_size = 2×1 = 2, so it uses EP=2.

Benchmark completes. process_result.py reads EP_SIZE from env → writes ep: 4 into the result JSON.

Leaderboard ingests an ep=4 row that was actually an EP=2 run.

Why validation doesn't catch this

utils/matrix_logic/validation.py SingleNodeSearchSpaceEntry permits ep independent of tp; there is no ep <= tp or ep == tp for dp-attn constraint. So this passes generate-sweep unnoticed.

Uniqueness / typo signal

Across the entire nvidia-master.yaml, this is the only entry with ep > tp and ep != 1. Every other dp-attn: true row has ep == tp (2/2, 4/4, 8/8). Combined with the row immediately above (tp: 4, ep: 4, dp-attn: true), the natural read is a typo where either tp should be 4 or ep should be 2.

Fix

One-character YAML edit — pick whichever config the author actually intended:

- { tp: 4, ep: 4, dp-attn: true, conc-start: 64, conc-end: 256 } # matches row 12949 pattern # or - { tp: 2, ep: 2, dp-attn: true, conc-start: 64, conc-end: 256 } # matches every other dp-attn row

github-actions · 2026-07-02T06:36:15Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28565934200
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28565934200

github-actions · 2026-07-02T07:36:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28570491405
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28570491405

github-actions · 2026-07-02T16:08:15Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28574074488
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28574074488

github-actions · 2026-07-02T18:44:48Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28604484366
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28604484366

github-actions · 2026-07-02T20:26:13Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28604484366
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28604484366

Ankur-singh · 2026-07-02T20:27:50Z

/reuse-sweep-run

Ankur-singh · 2026-07-02T20:28:19Z

As a PR reviewer and CODEOWNER, I have reviewed this and have:

Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this. Link
Verified that this PR passes evals. Please link to GitHub Action workflow that shows this. Link
Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
- If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

Single-node vLLM AGG submission (minimaxm3-fp4-b200-vllm, nvidia/MiniMax-M3-NVFP4, B200). The upstream vLLM recipe PR vllm-project/recipes#577 adds the NVFP4 Blackwell (B200/B300) variant to the MiniMax-M3 recipe (MTP + non-MTP); this InferenceX PR updates the image tag + search space and enables FP8 KV cache and the trtllm all-reduce backend, consistent with that recipe variant.

Signed: ankur-singh

Klaud-Cold · 2026-07-02T20:32:30Z

@Ankur-singh Merge is blocked on Check 3: the linked recipe does not cover the trtllm all-reduce backend this PR forces.

Check 0 (CODEOWNER): PASS — ankur-singh is a listed owner of .github/configs/nvidia-master.yaml; the other changed paths fall under the catch-all only.
Check 1 (sweep on in-PR commit): PASS — head commit 317de36b has all executed single-node */ (33) and eval / (4) check-runs green in run 28604484366.
Check 2 (evals): PASS — GSM8K em_strict 0.953–0.958 across 4 configs on nvidia/MiniMax-M3-NVFP4 (fp4, vLLM, B200-DGXC), and that run used this PR's image vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226.
Check 3 (recipe): FAIL — the link is present (vllm-project/recipes#577) and model, B200, TP/EP/DP-attention modes, NVFP4 (auto-detected from checkpoint), --kv-cache-dtype fp8, --block-size 128, and --language-model-only all match, but VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm is a kernel-selection backend that appears nowhere in the recipe and is not the default (at image commit 93d8f834, auto resolves to mnnvl on a single node). Add it to the recipe PR or drop it from the launch script. Informational only: --gpu-memory-utilization 0.95, --max-num-batched-tokens, and the sweep search-space changes are InferenceX harness tuning, not blockers.
Check 4 (reuse command): PASS — /reuse-sweep-run posted by Ankur-singh (COLLABORATOR).

wzhao18 · 2026-07-02T20:34:52Z

# Conflicts: # perf-changelog.yaml

wzhao18 added 2 commits July 1, 2026 21:28

Add 2-gpu configs to search space

fe7d1cf

update search space

734b60a

wzhao18 requested a review from a team July 2, 2026 04:39

wzhao18 requested review from Ankur-singh, jgangani and kedarpotdar-nv as code owners July 2, 2026 04:39

github-project-automation Bot added this to InferenceMAX Board Jul 2, 2026

update

b1d1910

wzhao18 added the full-sweep-enabled label Jul 2, 2026

claude Bot reviewed Jul 2, 2026

View reviewed changes

update

514494e

wzhao18 added 2 commits July 2, 2026 00:46

update

e02cc19

update

4aab647

Merge branch 'main' into wzhao/minimax-m3-fp4-update

317de36

Ankur-singh approved these changes Jul 2, 2026

View reviewed changes

wzhao18 changed the title ~~[WIP] Update Minimax M3 FP4 vllm~~ Update Minimax M3 FP4 vllm Jul 2, 2026

Merge remote-tracking branch 'origin/main' into HEAD

f9289e4

# Conflicts: # perf-changelog.yaml

adibarra approved these changes Jul 2, 2026

View reviewed changes

adibarra merged commit b4e8176 into main Jul 2, 2026
27 checks passed

github-project-automation Bot moved this to Done in InferenceMAX Board Jul 2, 2026

adibarra deleted the wzhao/minimax-m3-fp4-update branch July 2, 2026 23:23

This was referenced Jul 3, 2026

[AMD] MiniMax-M3 FP4/FP8 MI355X ATOM: refactor config & add MTP recipes #2001

Merged

[WIP] Update Minimax M3 FP4 B300 Eagle #2006

Open

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

wzhao18 commented Jul 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

claude Bot Jul 2, 2026

Choose a reason for hiding this comment

What's wrong

Trigger path

Step-by-step proof for this entry

Why validation doesn't catch this

Uniqueness / typo signal

Fix

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

github-actions Bot commented Jul 2, 2026

Uh oh!

Ankur-singh commented Jul 2, 2026

Uh oh!

Ankur-singh commented Jul 2, 2026

Additional detail section:

Uh oh!

Klaud-Cold commented Jul 2, 2026

Uh oh!

wzhao18 commented Jul 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wzhao18 commented Jul 2, 2026 •

edited

Loading