iframe-proxy

Gasoonjia · 2026-06-24T09:03:11Z

Summary:
Fuse each gemma4_31b MLP's gate_proj|up_proj into a single [2*intermediate, hidden] coalesced-int4 matmul, applied by default in the CUDA export. This issues one activation-quant + one W4A8 matvec per layer instead of two, cutting per-token launch + activation-quant overhead in the launch-bound decode path. Only Q4_K (CudaCoalescedInt4Tensor) gate/up pairs are fused; any other quant type (e.g. Q6_K) is left as two matmuls (guarded, still correct).

Next Step: we will upsteam this kind of operator fusion into gemma4-31b model level when loading gguf. #20481 is the draft PR

pytorch-bot · 2026-06-24T09:03:15Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20482

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 3 New Failures, 1 Pending, 4 Unrelated Failures, 1 Unclassified Failure

As of commit 4025660 with merge base 1b726b2 ():

NEW FAILURES - The following jobs have failed:

pull / test-arm-backend-no-driver (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 42ab82696ecf3cee778a305f76e6dc75fd6e4cfc4d45659ca886127e49f5bcae /exec failed with exit code 1
pull / unittest / linux / linux-job (gh)
RuntimeError: Command docker exec -t 9b7c35dea08f7d7e172d2ef7d9de93f92814edd3c9b27b2a4512c0cf36f65643 /exec failed with exit code 1
pull / unittest-editable / linux / linux-job (gh)
RuntimeError: Command docker exec -t 2b25cd9f539890778880ab9e9ff8565d1ff57db62a1365829ed1613992a6aafc /exec failed with exit code 1

UNCLASSIFIED FAILURE - DrCI could not classify the following job because the workflow did not run on the merge base. The failure may be pre-existing on trunk or introduced by this PR:

MLX / test-mlx-qwen35-moe / test-mlx-qwen35-moe (gh) (this job did not run on the merge base, so DrCI cannot tell whether the failure is pre-existing)

FLAKY - The following job failed but was likely due to flakiness present on trunk:

pull / test-qnn-testsuite-linux / test-backend-linux (qnn, models) / linux-job (gh) (matched linux rule in flaky-rules.json)
The runner has received a shutdown signal. This can happen when the runner service is stopped, or a manually started runner is canceled.

BROKEN TRUNK - The following jobs failed but was present on the merge base:

👉 Rebase onto the `viable/strict` branch to avoid these failures

pull / test-llama-runner-qnn-linux (fp32, qnn_16a16w, qnn) / linux-job (gh) (trunk failure)
pull / unittest / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1
pull / unittest-editable / macos / macos-job (gh) (trunk failure)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 1

This comment was automatically generated by Dr. CI and updates every 15 minutes.

linux-foundation-easycla · 2026-06-24T09:03:19Z

The committers listed above are authorized under a signed CLA.

✅ login: Gasoonjia / name: Songhao Jia (4025660)

github-actions · 2026-06-24T09:03:58Z

Summary: Fuse each gemma4_31b MLP's gate_proj|up_proj into a single [2*intermediate, hidden] coalesced-int4 matmul, applied by default in the CUDA export. This issues one activation-quant + one W4A8 matvec per layer instead of two, cutting per-token launch + activation-quant overhead in the launch-bound decode path. Only Q4_K (CudaCoalescedInt4Tensor) gate/up pairs are fused; any other quant type (e.g. Q6_K) is left as two matmuls (guarded, still correct). Builds on the already-landed kv_len-bounded tq4_sdpa kernel + gemma4_31b call-site (kv_len + mask_is_causal), which recovered 128k decode from ~2.8 to ~43 tok/s. With both, ET gemma4_31b 128k+TurboQuant decode beats llama.cpp at every measured context (cuda_graph ON): ctx ET llama 512 44.80 42.77 2K 43.20 41.97 8K 42.23 41.23 32K 41.64 40.27 127K 38.41 35.97 TurboQuant KV compression kept; prefill restored (6-8x) with no regression; output quality preserved. Test Plan: - Fusion numerics: fused vs unfused MLP through the real W4A8 int4_plain_mm kernel = bit-exact (max_abs_diff 0.0, cos 1.000000) for decode (T=1) and prefill (T=4). - Export + run: fused module exported via CudaPartitioner and executed through executor_runner (RC=0, cos 0.999915 vs eager). Full 31B export logs "Fused gate+up on 60 MLP layers". - Decode A/B (gemma4_31b 128k+TQ, cuda_graph ON, 5x median): table above; beats llama.cpp at 512 -> 127K. nsys: tq4_sdpa 91.7% -> 2.9% of decode.

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026

Gasoonjia temporarily deployed to cadence June 24, 2026 09:03 — with GitHub Actions Inactive

Gasoonjia force-pushed the gemma4_31b-cuda-decode-speedup branch from 8b145b5 to 1c371e2 Compare June 24, 2026 10:00

Gasoonjia temporarily deployed to cadence June 24, 2026 10:00 — with GitHub Actions Inactive

Gasoonjia changed the base branch from main to gemma4_31b_export_under_32gb June 24, 2026 10:01

mergennachin approved these changes Jun 24, 2026

View reviewed changes

Gasoonjia force-pushed the gemma4_31b-cuda-decode-speedup branch from 1c371e2 to 4025660 Compare June 25, 2026 17:23

Gasoonjia had a problem deploying to cadence June 25, 2026 17:25 — with GitHub Actions Error

Gasoonjia temporarily deployed to cadence June 25, 2026 17:25 — with GitHub Actions Inactive

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[executorch][cuda] fuse gate/up MLP projections #20482

[executorch][cuda] fuse gate/up MLP projections #20482
Gasoonjia wants to merge 1 commit into
gemma4_31b_export_under_32gbfrom
gemma4_31b-cuda-decode-speedup

Gasoonjia commented Jun 24, 2026

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

linux-foundation-easycla Bot commented Jun 24, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Gasoonjia commented Jun 24, 2026

Uh oh!

pytorch-bot Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20482

❌ 3 New Failures, 1 Pending, 4 Unrelated Failures, 1 Unclassified Failure

Uh oh!

linux-foundation-easycla Bot commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions Bot commented Jun 24, 2026

This PR needs a release notes: label

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pytorch-bot Bot commented Jun 24, 2026 •

edited

Loading

linux-foundation-easycla Bot commented Jun 24, 2026 •

edited

Loading

This PR needs a `release notes:` label