[NV] perf: update MiniMax-M3 FP4 B300 vLLM by anish-shanbhag · Pull Request #1990 · SemiAnalysisAI/InferenceX · GitHub
Skip to content

[NV] perf: update MiniMax-M3 FP4 B300 vLLM#1990

Merged
adibarra merged 5 commits into
mainfrom
codex/minimax-m3-b300-fp4-vllm-update
Jul 2, 2026
Merged

[NV] perf: update MiniMax-M3 FP4 B300 vLLM#1990
adibarra merged 5 commits into
mainfrom
codex/minimax-m3-b300-fp4-vllm-update

Conversation

@anish-shanbhag

Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Comment thread perf-changelog.yaml Outdated
Comment on lines +4430 to +4435
- config-keys:
- minimaxm3-fp4-b300-vllm
description:
- "Update Minimax M3 b300 vllm image tag"
- "Update search space to cover more configs"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/tree/codex/minimax-m3-b300-fp4-vllm-update

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new perf-changelog.yaml entry has two issues: (1) pr-link uses a branch URL (tree/codex/minimax-m3-b300-fp4-vllm-update) instead of the canonical pull/1990 form used by every other entry in the file — this will 404 once the branch is deleted after merge; (2) the description says "Update search space to cover more configs", but the diff actually narrows the sweep (isl 1024: 7→4 entries, isl 8192: 6→4 entries, all TP8/EP8 and TP4/EP4 lanes dropped). Consider phrasing similar to the neighboring dsr1-fp4-b200-sglang entry (e.g. "Refocus/narrow the search space and drop TP8/EP8 and TP4/EP4 sweeps").

Extended reasoning...

Issue 1 — branch URL in pr-link:\n\nThe new entry at perf-changelog.yaml line 4435 sets:\n\nyaml\npr-link: https://github.com/SemiAnalysisAI/InferenceX/tree/codex/minimax-m3-b300-fp4-vllm-update\n\n\nOf the ~554 pr-link values in this file, this is the only one that points to a branch tree/... URL — all others use the canonical https://github.com/SemiAnalysisAI/InferenceX/pull/<num> form (see the four most recent entries at lines 4402–4428 for examples). Since branches are typically deleted after merge, this link will 404 shortly after this PR lands. The correct value is https://github.com/SemiAnalysisAI/InferenceX/pull/1990.\n\nIssue 2 — description contradicts the diff:\n\nThe entry claims:\n\nyaml\n- "Update search space to cover more configs"\n\n\nBut the diff clearly narrows the sweep. Step-by-step count from the diff on .github/configs/nvidia-master.yaml:\n\nisl 1024, osl 1024:\n- Before (7 entries): {tp:8, c1-64}, {tp:8/ep:8, c1-512}, {tp:4, c1-64}, {tp:4/ep:4, c64-512}, {tp:4/ep:4/dp-attn, c128-512}, {tp:2/ep:2, c16-128}, {tp:8/ep:8/dp-attn, c256-1024}\n- After (4 entries): {tp:8, c1-2}, {tp:4, c2-2}, {tp:2, c4-256}, {tp:2/ep:2/dp-attn, c512-1024}\n\nisl 8192, osl 1024:\n- Before (6 entries): {tp:8, c1-64}, {tp:8/ep:8, c1-512}, {tp:4, c1-128}, {tp:4/ep:4, c64-256}, {tp:4/ep:4/dp-attn, c64-128}, {tp:8/ep:8/dp-attn, c128-256}\n- After (4 entries): {tp:8, c1-2}, {tp:4, c2-2}, {tp:2, c4-256}, {tp:2/ep:2/dp-attn, c512-1024}\n\nEvery TP8/EP8 and TP4/EP4 lane is dropped, and the TP8/TP4 lanes are drastically clipped (tp:8 from c1-64 down to c1-2; tp:4 from c1-64 down to a single c2 point). While tp:2 and tp:2/ep:2/dp-attn are genuinely new, the aggregate coverage is significantly smaller (roughly 38 concurrency points → ~12 at isl 1024). Calling this "cover more configs" is the opposite of what the change actually does.\n\nImpact:\n\nBoth issues are purely documentation quality — no runtime effect, and no CI schema check enforces the URL form. But the perf-changelog is a user-facing history document that people read to understand why a config changed; a stale link plus a misleading description defeats that purpose.\n\nFix:\n\nyaml\n- config-keys:\n - minimaxm3-fp4-b300-vllm\n description:\n - "Update Minimax M3 b300 vllm image tag"\n - "Refocus/narrow the search space and drop TP8/EP8 and TP4/EP4 sweeps."\n pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1990\n\n\nThe neighboring dsr1-fp4-b200-sglang entry at lines 4422–4428 uses similar honest phrasing when reducing coverage ("Drop the TP8/EP8 single-concurrency points.").

@@ -12809,7 +12809,7 @@ minimaxm3-fp8-b300-vllm:
# weights are pre-staged read-only at /scratch/models/MiniMax-M3-NVFP4 (added to
# the STAGED_MODELS allow-list in launch_b300-nv.sh).
minimaxm3-fp4-b300-vllm:
image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41
image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The block comment just above this line still says the NVFP4 support 'is baked into the perf container image', but the PR swaps the image from the bespoke vllm-minimax-m3-perf-* tag to a mainline nightly-* tag. Update the comment to say the support has landed in vLLM main and is picked up from nightly so future readers don't chase a nonexistent perf image. Note: the same wording appears at lines 12839-12840 for the EAGLE3 variant (whose image is NOT changed in this PR), so that comment remains accurate and should be left as-is.

Extended reasoning...

What the bug is

At .github/configs/nvidia-master.yaml lines 12805-12810, immediately above the minimaxm3-fp4-b300-vllm entry, there is a block comment that states:

MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) is baked into the perf container image, so no runtime patch is needed.

That wording was true when the config used the bespoke perf-tagged image vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41. This PR replaces that tag with vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226 — a mainline nightly image, not a perf-variant image. After the swap, the phrase "perf container image" no longer describes what is actually being pulled.

Why the invariant still holds, but the comment is misleading

The underlying claim — "no runtime patch is needed" — is still true, because vllm-project/vllm PR #46380 has presumably landed in vLLM main and the nightly image picks it up. So there is no runtime effect and nothing breaks at benchmark time. The issue is purely that the comment cites the wrong reason: a future reader inspecting this entry will look for a "perf container image" that no longer exists in the config and be confused.

Step-by-step proof

  1. Before this PR, line 12812 reads image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41. The tag literally contains minimax-m3-perf, matching "perf container image".
  2. This PR changes line 12812 to image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226. The tag is nightly-<sha>; there is no "perf" in the tag name.
  3. Lines 12805-12810 (comment above the entry) still literally say baked into the perf container image, unchanged by this PR.
  4. Therefore the comment and the code disagree about which image class is in use.

Fix

Update the comment to reflect that MiniMax-M3 NVFP4 support has landed upstream and is now picked up from the mainline nightly image, e.g.:

MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) has landed in vLLM main and is picked up from the nightly image below, so no runtime patch is needed.

Scope note

The same wording appears at lines 12839-12840 above the EAGLE3 variant (minimaxm3-fp4-eagle3-b300-vllm), which does not change its image in this PR. That comment remains accurate and should be left as-is; only the copy above minimaxm3-fp4-b300-vllm (lines 12805-12810) needs updating.

Severity

This is a documentation-only inconsistency with no runtime effect on the benchmark. Marking as nit.

@github-actions

github-actions Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@Ankur-singh

Copy link
Copy Markdown
Collaborator

/reuse-sweep-run

@Ankur-singh

Copy link
Copy Markdown
Collaborator

As a PR reviewer and CODEOWNER, I have reviewed this and have:

  • Verified that as of the moment of typing this, this is the latest version of PR_REVIEW_CHECKLIST.md
  • Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
  • Verified that this PR has passed PR validation. Please link to GitHub Action workflow that shows this. Link
  • Verified that this PR passes evals. Please link to GitHub Action workflow that shows this. Link
  • Verified that speculative decoding PRs uses chat templates to align the AL distribution to real world
  • If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
  • Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
    • If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
  • If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

  • Single-node vLLM AGG submission (minimaxm3-fp4-b300-vllm, nvidia/MiniMax-M3-NVFP4, B300). The upstream vLLM recipe PR vllm-project/recipes#577 adds the NVFP4 Blackwell (B200/B300) variant to the MiniMax-M3 recipe (MTP + non-MTP); this InferenceX PR updates the image tag + search space and enables FP8 KV cache and the trtllm all-reduce backend, consistent with that recipe variant.

Signed: ankur-singh

@Klaud-Cold

Copy link
Copy Markdown
Collaborator

@anish-shanbhag anish-shanbhag changed the title [WIP] perf: update MiniMax-M3 FP4 B300 vLLM [NV] perf: update MiniMax-M3 FP4 B300 vLLM Jul 2, 2026
@adibarra adibarra merged commit 352d0ce into main Jul 2, 2026
26 checks passed
@adibarra adibarra deleted the codex/minimax-m3-b300-fp4-vllm-update branch July 2, 2026 23:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

4 participants