[WIP] Update Minimax M3 FP4 B300 Eagle by wzhao18 · Pull Request #2006 · SemiAnalysisAI/InferenceX · GitHub
Skip to content

[WIP] Update Minimax M3 FP4 B300 Eagle#2006

Open
wzhao18 wants to merge 5 commits into
mainfrom
wzhao/m3-fp4-b300-eagle
Open

[WIP] Update Minimax M3 FP4 B300 Eagle#2006
wzhao18 wants to merge 5 commits into
mainfrom
wzhao/m3-fp4-b300-eagle

Conversation

@wzhao18

@wzhao18 wzhao18 commented Jul 3, 2026

Copy link
Copy Markdown
Collaborator

No description provided.

@wzhao18 wzhao18 requested a review from a team July 3, 2026 18:14
@wzhao18 wzhao18 requested a review from Ankur-singh as a code owner July 3, 2026 18:14
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

# launch_b300-nv.sh); the EAGLE3 draft is downloaded to the writable models dir.
minimaxm3-fp4-b300-vllm-mtp:
image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41
image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Missing perf-changelog.yaml entry — per AGENTS.md line 126, an image-tag bump combined with env-var/config-param changes requires an appended perf-changelog.yaml entry to trigger the full-sweep re-benchmark. Without it, the sweep won't re-select minimaxm3-fp4-b300-vllm-mtp on merge and the new image / spec-token map never actually get measured. Marking as this may just be pending the [WIP] → ready transition, but flagging so it isn't missed at merge time.

Extended reasoning...

What the bug is

AGENTS.md:126 ("Updating Docker images" section) explicitly states:

Update the image tag in the relevant configs/*-master.yaml and/or benchmarks/*.sh, update any related env vars / config params, and append a perf-changelog.yaml entry (required — triggers benchmarks).

This PR does all three of the qualifying trigger conditions but touches only 2 files (configs/nvidia-master.yaml and benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh) — no perf-changelog.yaml entry is appended.

How it manifests / step-by-step proof

  1. On merge, the full-sweep driver reads perf-changelog.yaml to decide which configs need re-benchmarking (per AGENTS.md:54).
  2. The last perf-changelog.yaml entry covering this recipe key would be for minimaxm3-fp4-b300-vllm-mtp — nothing new is appended in this diff.
  3. Sweep sees no new changelog entry ⇒ does not re-select this config ⇒ no benchmark is produced for the new image (nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226), the new VLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllm env var, the raised --gpu-memory-utilization 0.95, the restructured search space, or the per-op-point NUM_SPEC_TOKENS_MAP.
  4. The whole purpose of the PR — measuring minimaxm3-fp4-b300-vllm-mtp with the new EAGLE3-tuned config — never runs.

Why existing code doesn't cover it

There is no wildcard changelog entry that catches this recipe key; the closest existing entry is for minimaxm3-fp4-b300-vllm (the non-mtp sibling from #1990) which does not match minimaxm3-fp4-b300-vllm-mtp.

Precedent from sibling PRs

Every recent sibling recipe-update PR that bumped an image and/or reshuffled the search space appended a perf-changelog.yaml entry: #1990 (non-mtp b300), #1978 (b200 variant), #1962, #1966, and #2001. This PR is the only outlier so far.

How to fix

Append a perf-changelog.yaml entry naming minimaxm3-fp4-b300-vllm-mtp with a one-liner describing the changes (new nightly image, TRT-LLM allreduce backend, gpu-mem-util 0.90 → 0.95, per-op-point spec-token map, refined search space). Once that lands the full-sweep will re-select this config on merge and the intended benchmarks will run.

Comment on lines +71 to +72
# Key: ISL:TP:EP_SIZE:DP_ATTENTION:CONC (exactly the env vars the launcher sets;
# for dp-attn configs TP holds the data-parallel size, as in PARALLEL_ARGS below).

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new comment at line 72 says as in PARALLEL_ARGS below, but the PARALLEL_ARGS if/elif/else block that sets --data-parallel-size=$TP is at lines 57-63, immediately above this comment. Trivial s/below/above/ (or drop the directional word).

Extended reasoning...

What the bug is. In the newly-added block introducing NUM_SPEC_TOKENS_MAP, the comment ends with:

... for dp-attn configs TP holds the data-parallel size, as in PARALLEL_ARGS below).

The referent — the PARALLEL_ARGS if/elif/else block that actually demonstrates the TP-as-data-parallel-size relationship — is not below this comment; it is above it.

Proof by walkthrough. In benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh after this PR:

  • Lines 57–63 contain the if [ "${DP_ATTENTION}" = "true" ]; then PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel" ... block. Line 58 in particular is what the comment is pointing at (it uses $TP as the data-parallel size when DP_ATTENTION=true).
  • Lines 65–71 are the new explanatory comment.
  • Line 72 is the offending line: ... as in PARALLEL_ARGS below).
  • Line 73 begins declare -A NUM_SPEC_TOKENS_MAP=(.
  • The only occurrence of the token PARALLEL_ARGS after the comment is the bare $PARALLEL_ARGS expansion in the vllm serve invocation (~line 105), which does not itself illustrate the TP-as-data-parallel-size definition.

So the definition being referenced sits ~9 lines above the comment, and the only PARALLEL_ARGS mention below is a bare variable expansion that carries no such demonstration. below is directionally wrong.

Impact. Zero runtime impact — this is a pure documentation error in a freshly-written comment. It only matters because a future reader following the reference (as in PARALLEL_ARGS below) will scan downward and hit $PARALLEL_ARGS in the vllm serve line, which does not explain why TP names the DP size — momentary confusion, then they scroll up. Worth flagging while the author is right there in this region, but not merge-blocking.

Suggested fix. Change below to above, or drop the directional word entirely (as in PARALLEL_ARGS). Speculative cause: the comment was likely drafted expecting the map declaration to sit above the PARALLEL_ARGS block, then the block ordering was swapped during authoring but the wording wasn't updated.

# launch_b300-nv.sh); the EAGLE3 draft is downloaded to the writable models dir.
minimaxm3-fp4-b300-vllm-mtp:
image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41
image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The rationale comments above and inside this recipe are now stale after this PR: (a) the yaml header at lines 12625-12632 still claims the recipe uses "3 speculative tokens" and that NVFP4 support is "baked into the perf container image", and (b) the shell-script header at benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh lines 3-7 repeats the "perf container image" phrasing. Both are stale after this diff swapped the perf-branded image for a generic nightly and replaced the constant NUM_SPEC_TOKENS=3 with a per-operating-point NUM_SPEC_TOKENS_MAP; update both to reference the new image (e.g. note PR #46380 is now merged upstream and included in commit 93d8f83) and the varying token count.

Extended reasoning...

This PR makes two changes whose rationale comments were left behind, and they show up in two adjacent locations.

1. The yaml header at configs/nvidia-master.yaml lines 12625-12632 (immediately above the modified minimaxm3-fp4-b300-vllm-mtp: key at 12633) says the recipe pairs the target with "the Inferact/MiniMax-M3-EAGLE3 draft head (3 speculative tokens)" and that MiniMax-M3 modelopt NVFP4 support is "baked into the perf container image". After this PR both claims are stale:

  • The image on line 12634 is bumped from vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 to nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226. There is no longer a "perf container image" — this recipe is on a generic upstream nightly.
  • The recipe shell replaces the constant NUM_SPEC_TOKENS=3 with a NUM_SPEC_TOKENS_MAP keyed on ISL:TP:EP_SIZE:DP_ATTENTION:CONC that yields 1, 2, or 4 tokens for many operating points (3 is only the default fallback).

2. The shell-script header at benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh lines 3-7 (outside the diff window but still present in the file after this PR) also says MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) is baked into the perf container image, so no runtime patch is needed. This rationale was written to justify why the FP4 recipe skips the inline MSA-contiguity patch that its FP8 sibling uses; the whole justification hinges on "the perf container image" still existing here, which is no longer true.

Impact. Neither comment affects runtime behavior — the recipe still runs. But two failure modes for future readers are worth avoiding:

  • Someone hunts for "the perf container image" (e.g. to reproduce or rebase) and finds nothing, because it's been replaced with a generic upstream nightly.
  • Someone assumes the FP4 recipe still relies on a custom-baked NVFP4 patch (and therefore treats it as fragile / non-portable), when in fact PR #46380 is now available in the upstream nightly and no baking is required.

Fix. Update the yaml header to (a) drop or rephrase "baked into the perf container image" — e.g. "vllm PR #46380 is now merged upstream and included in nightly commit 93d8f83" — and (b) replace "(3 speculative tokens)" with a note that the count varies per operating point (1..4) via NUM_SPEC_TOKENS_MAP in the recipe shell. Apply the same image-rationale update to the .sh header at lines 3-7.

Step-by-step proof.

  1. Read configs/nvidia-master.yaml lines 12625-12634. The comment block above the key still says "3 speculative tokens" and "baked into the perf container image".
  2. Look at line 12634 in the diff — the image tag has changed from vllm-minimax-m3-perf-... to nightly-93d8f83.... No "perf" branding remains.
  3. Read benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh in this PR. Lines 3-7 (comment header) are unchanged and still say "baked into the perf container image".
  4. Look at lines ~65-96 of the same file after the diff: NUM_SPEC_TOKENS=3 is gone; the value now comes from NUM_SPEC_TOKENS_MAP with 3 as a default fallback and many keys set to 1, 2, or 4.
  5. Conclude both rationale comments describe a state that no longer exists after this PR.

Note on the duplicate objection. One refutation argued this collapses into a single .sh-header nit. Kept as a merged finding because the yaml header carries a distinct stale claim ("3 speculative tokens") that the .sh header does not, and it lives in a different physical comment block; combining them makes it likelier the author touches both while updating rationale rather than only the .sh.

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

2 similar comments
@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

@github-actions

github-actions Bot commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant