[WIP] Update Minimax M3 FP4 B300 Eagle#2006
Conversation
| # launch_b300-nv.sh); the EAGLE3 draft is downloaded to the writable models dir. | ||
| minimaxm3-fp4-b300-vllm-mtp: | ||
| image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 | ||
| image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226 |
There was a problem hiding this comment.
🔴 Missing perf-changelog.yaml entry — per AGENTS.md line 126, an image-tag bump combined with env-var/config-param changes requires an appended perf-changelog.yaml entry to trigger the full-sweep re-benchmark. Without it, the sweep won't re-select minimaxm3-fp4-b300-vllm-mtp on merge and the new image / spec-token map never actually get measured. Marking as this may just be pending the [WIP] → ready transition, but flagging so it isn't missed at merge time.
Extended reasoning...
What the bug is
AGENTS.md:126 ("Updating Docker images" section) explicitly states:
Update the image tag in the relevant
configs/*-master.yamland/orbenchmarks/*.sh, update any related env vars / config params, and append aperf-changelog.yamlentry (required — triggers benchmarks).
This PR does all three of the qualifying trigger conditions but touches only 2 files (configs/nvidia-master.yaml and benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh) — no perf-changelog.yaml entry is appended.
How it manifests / step-by-step proof
- On merge, the full-sweep driver reads
perf-changelog.yamlto decide which configs need re-benchmarking (perAGENTS.md:54). - The last
perf-changelog.yamlentry covering this recipe key would be forminimaxm3-fp4-b300-vllm-mtp— nothing new is appended in this diff. - Sweep sees no new changelog entry ⇒ does not re-select this config ⇒ no benchmark is produced for the new image (
nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226), the newVLLM_FLASHINFER_ALLREDUCE_BACKEND=trtllmenv var, the raised--gpu-memory-utilization 0.95, the restructured search space, or the per-op-pointNUM_SPEC_TOKENS_MAP. - The whole purpose of the PR — measuring
minimaxm3-fp4-b300-vllm-mtpwith the new EAGLE3-tuned config — never runs.
Why existing code doesn't cover it
There is no wildcard changelog entry that catches this recipe key; the closest existing entry is for minimaxm3-fp4-b300-vllm (the non-mtp sibling from #1990) which does not match minimaxm3-fp4-b300-vllm-mtp.
Precedent from sibling PRs
Every recent sibling recipe-update PR that bumped an image and/or reshuffled the search space appended a perf-changelog.yaml entry: #1990 (non-mtp b300), #1978 (b200 variant), #1962, #1966, and #2001. This PR is the only outlier so far.
How to fix
Append a perf-changelog.yaml entry naming minimaxm3-fp4-b300-vllm-mtp with a one-liner describing the changes (new nightly image, TRT-LLM allreduce backend, gpu-mem-util 0.90 → 0.95, per-op-point spec-token map, refined search space). Once that lands the full-sweep will re-select this config on merge and the intended benchmarks will run.
| # Key: ISL:TP:EP_SIZE:DP_ATTENTION:CONC (exactly the env vars the launcher sets; | ||
| # for dp-attn configs TP holds the data-parallel size, as in PARALLEL_ARGS below). |
There was a problem hiding this comment.
🟡 The new comment at line 72 says as in PARALLEL_ARGS below, but the PARALLEL_ARGS if/elif/else block that sets --data-parallel-size=$TP is at lines 57-63, immediately above this comment. Trivial s/below/above/ (or drop the directional word).
Extended reasoning...
What the bug is. In the newly-added block introducing NUM_SPEC_TOKENS_MAP, the comment ends with:
... for dp-attn configs TP holds the data-parallel size, as in PARALLEL_ARGS below).
The referent — the PARALLEL_ARGS if/elif/else block that actually demonstrates the TP-as-data-parallel-size relationship — is not below this comment; it is above it.
Proof by walkthrough. In benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh after this PR:
- Lines 57–63 contain the
if [ "${DP_ATTENTION}" = "true" ]; then PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel" ...block. Line 58 in particular is what the comment is pointing at (it uses$TPas the data-parallel size whenDP_ATTENTION=true). - Lines 65–71 are the new explanatory comment.
- Line 72 is the offending line:
... as in PARALLEL_ARGS below). - Line 73 begins
declare -A NUM_SPEC_TOKENS_MAP=(. - The only occurrence of the token
PARALLEL_ARGSafter the comment is the bare$PARALLEL_ARGSexpansion in thevllm serveinvocation (~line 105), which does not itself illustrate theTP-as-data-parallel-size definition.
So the definition being referenced sits ~9 lines above the comment, and the only PARALLEL_ARGS mention below is a bare variable expansion that carries no such demonstration. below is directionally wrong.
Impact. Zero runtime impact — this is a pure documentation error in a freshly-written comment. It only matters because a future reader following the reference (as in PARALLEL_ARGS below) will scan downward and hit $PARALLEL_ARGS in the vllm serve line, which does not explain why TP names the DP size — momentary confusion, then they scroll up. Worth flagging while the author is right there in this region, but not merge-blocking.
Suggested fix. Change below to above, or drop the directional word entirely (as in PARALLEL_ARGS). Speculative cause: the comment was likely drafted expecting the map declaration to sit above the PARALLEL_ARGS block, then the block ordering was swapped during authoring but the wording wasn't updated.
| # launch_b300-nv.sh); the EAGLE3 draft is downloaded to the writable models dir. | ||
| minimaxm3-fp4-b300-vllm-mtp: | ||
| image: vllm/vllm-openai:vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41 | ||
| image: vllm/vllm-openai:nightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226 |
There was a problem hiding this comment.
🟡 The rationale comments above and inside this recipe are now stale after this PR: (a) the yaml header at lines 12625-12632 still claims the recipe uses "3 speculative tokens" and that NVFP4 support is "baked into the perf container image", and (b) the shell-script header at benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh lines 3-7 repeats the "perf container image" phrasing. Both are stale after this diff swapped the perf-branded image for a generic nightly and replaced the constant NUM_SPEC_TOKENS=3 with a per-operating-point NUM_SPEC_TOKENS_MAP; update both to reference the new image (e.g. note PR #46380 is now merged upstream and included in commit 93d8f83) and the varying token count.
Extended reasoning...
This PR makes two changes whose rationale comments were left behind, and they show up in two adjacent locations.
1. The yaml header at configs/nvidia-master.yaml lines 12625-12632 (immediately above the modified minimaxm3-fp4-b300-vllm-mtp: key at 12633) says the recipe pairs the target with "the Inferact/MiniMax-M3-EAGLE3 draft head (3 speculative tokens)" and that MiniMax-M3 modelopt NVFP4 support is "baked into the perf container image". After this PR both claims are stale:
- The image on line 12634 is bumped from
vllm-minimax-m3-perf-x86_64-13.0.1-8b00f41tonightly-93d8f834dd8acf33eb0e2a75b2711b628cb6e226. There is no longer a "perf container image" — this recipe is on a generic upstream nightly. - The recipe shell replaces the constant
NUM_SPEC_TOKENS=3with aNUM_SPEC_TOKENS_MAPkeyed onISL:TP:EP_SIZE:DP_ATTENTION:CONCthat yields 1, 2, or 4 tokens for many operating points (3 is only the default fallback).
2. The shell-script header at benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.sh lines 3-7 (outside the diff window but still present in the file after this PR) also says MiniMax-M3 modelopt NVFP4 support (vllm-project/vllm PR #46380) is baked into the perf container image, so no runtime patch is needed. This rationale was written to justify why the FP4 recipe skips the inline MSA-contiguity patch that its FP8 sibling uses; the whole justification hinges on "the perf container image" still existing here, which is no longer true.
Impact. Neither comment affects runtime behavior — the recipe still runs. But two failure modes for future readers are worth avoiding:
- Someone hunts for "the perf container image" (e.g. to reproduce or rebase) and finds nothing, because it's been replaced with a generic upstream nightly.
- Someone assumes the FP4 recipe still relies on a custom-baked NVFP4 patch (and therefore treats it as fragile / non-portable), when in fact PR #46380 is now available in the upstream nightly and no baking is required.
Fix. Update the yaml header to (a) drop or rephrase "baked into the perf container image" — e.g. "vllm PR #46380 is now merged upstream and included in nightly commit 93d8f83" — and (b) replace "(3 speculative tokens)" with a note that the count varies per operating point (1..4) via NUM_SPEC_TOKENS_MAP in the recipe shell. Apply the same image-rationale update to the .sh header at lines 3-7.
Step-by-step proof.
- Read
configs/nvidia-master.yamllines 12625-12634. The comment block above the key still says "3 speculative tokens" and "baked into the perf container image". - Look at line 12634 in the diff — the image tag has changed from
vllm-minimax-m3-perf-...tonightly-93d8f83.... No "perf" branding remains. - Read
benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_b300_mtp.shin this PR. Lines 3-7 (comment header) are unchanged and still say "baked into the perf container image". - Look at lines ~65-96 of the same file after the diff:
NUM_SPEC_TOKENS=3is gone; the value now comes fromNUM_SPEC_TOKENS_MAPwith 3 as a default fallback and many keys set to 1, 2, or 4. - Conclude both rationale comments describe a state that no longer exists after this PR.
Note on the duplicate objection. One refutation argued this collapses into a single .sh-header nit. Kept as a merged finding because the yaml header carries a distinct stale claim ("3 speculative tokens") that the .sh header does not, and it lives in a different physical comment block; combining them makes it likelier the author touches both while updating rationale rather than only the .sh.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28677170095 |
2 similar comments
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28677170095 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28677170095 |

No description provided.