webgpu: merge batchA into M dimension when batchB==1 by xhcao · Pull Request #28197 · microsoft/onnxruntime · GitHub
Skip to content

webgpu: merge batchA into M dimension when batchB==1#28197

Open
xhcao wants to merge 1 commit intomicrosoft:mainfrom
xhcao:merge-all-outer-dims
Open

webgpu: merge batchA into M dimension when batchB==1#28197
xhcao wants to merge 1 commit intomicrosoft:mainfrom
xhcao:merge-all-outer-dims

Conversation

@xhcao
Copy link
Copy Markdown
Contributor

@xhcao xhcao commented Apr 23, 2026

When M is small and batchA is large, there are some invalid elements in each tile, merge batchA into M dimesion would reduce the workgroup count.

Description

Motivation and Context

When M is small and batchA is large, there are some invalid
elements in each tile, merge batchA into M dimesion would
reduce the workgroup count.
@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Apr 23, 2026
@guschmue guschmue requested a review from Copilot April 23, 2026 16:09
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the WebGPU MatMul implementation to flatten A’s batch dimensions into the effective M dimension when B has no batching (batchB==1), aiming to reduce workgroup overhead for cases with small M and large batchA. It also adds WebGPU-specific regression tests for additional 3D batched MatMul shapes.

Changes:

  • WebGPU MatMul: reshape A/B and treat output as {1, batchA*M, N} when batchA != 1 && batchB == 1 (applies to both the generic and Intel subgroup paths).
  • Add WebGPU-only MatMul test cases covering 3D inputs with batchA=3, M=2 and N in {3,4}.

Reviewed changes

Copilot reviewed 2 out of 3 changed files in this pull request and generated 2 comments.

File Description
onnxruntime/test/providers/cpu/math/matmul_test.cc Adds WebGPU-only test cases for 3D batched MatMul with batchA>1 and M>1.
onnxruntime/core/providers/webgpu/vendor/intel/math/matmul.cc Extends the Intel subgroup MatMul reshape optimization from M==1 to all batchA!=1 && batchB==1 cases.
onnxruntime/core/providers/webgpu/math/matmul.cc Extends the generic WebGPU MatMul reshape optimization similarly, flattening batch dims into M when batchB==1.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +191 to 202
// when B is a matrix (batch is 1), merge batchA into M dimension to improve performance for small M cases.
if (batchA != 1 && batchB == 1) {
// dimensions of A: [1,`batchA`, M, K]
int64_t batchAndM = a_shape.SizeToDimension(a_shape.NumDimensions() - 1);
TensorShapeVector dims_a = {1, batchAndM, helper.K()};
// dimensions of B: [1,K,N]
TensorShapeVector dims_b = {1, helper.K(), helper.N()};

a_shape = TensorShape(dims_a);
b_shape = TensorShape(dims_b);
output_shape = {1, batchA, helper.N()};
output_shape = {1, batchAndM, helper.N()};
}
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reshape path is now enabled for all cases where batchA != 1 && batchB == 1, but the comment/PR motivation suggests it’s intended specifically for small-M scenarios. As written, it will also trigger for large-M workloads (e.g., other callers that reuse ComputeMatMul) and broadly changes dispatch geometry. Consider gating this with an explicit heuristic (e.g., helper.M() below a threshold and/or batchA above a threshold), or update the comment to clarify it’s intentionally unconditional.

Copilot uses AI. Check for mistakes.
TensorShapeVector dims_a = {1, batchA, helper.K()};
// when B is a matrix (batch is 1), merge batchA into M dimension to improve performance for small M cases.
if (batchA != 1 && batchB == 1) {
// dimensions of A: [1,`batchA`, M, K]
Copy link

Copilot AI Apr 23, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The shape comment is inaccurate: dims_a is set to {1, batchAndM, K} (flattening all leading dims including M), not [1, batchA, M, K]. Please update the comment to reflect the actual reshape so it’s clear which dimensions are being merged.

Suggested change

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants