[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test · Pull Request #20584 · pytorch/executorch · GitHub
Skip to content

[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test#20584

Merged
meta-codesync[bot] merged 9 commits into
gh/JulianCloudNTH/76/basefrom
gh/JulianCloudNTH/76/head
Jul 4, 2026
Merged

[ExecuTorch][WebGPU] 2D compute dispatch tests — prefill golden + fold unit test#20584
meta-codesync[bot] merged 9 commits into
gh/JulianCloudNTH/76/basefrom
gh/JulianCloudNTH/76/head

Conversation

@ghost

@ghost ghost commented Jun 28, 2026

Copy link
Copy Markdown

Stack from ghstack (oldest at bottom):

Test coverage for the 2D dispatch fold, stacked above the cap-lift op.

Problem: The 2D fold is load-bearing index math — a wrong {x, y} means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

Solution: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.

  • Before: no coverage for >65535-workgroup dispatch; llama1b_prefill_512/_2048 shapes threw at the cap
  • After: fold_workgroup_count_2d unit-tested at the cap boundaries, and the two prefill shapes run as goldens

Implementation:

  • test/native/test_dispatch_2d.cpp — device-free unit test for utils::fold_workgroup_count_2d: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 ({65535, 3}) and S=2048 ({65535, 33}), and the needs-3rd-dimension throw; asserts each {x, y} covers [0, count)
  • llama1b_prefill_512 + llama1b_prefill_2048 configs appended to the byte-mirrored CONFIGS (test_sdpa.py) and kSdpaConfigs (test_webgpu_native.cpp)
  • Registers webgpu_dispatch_2d_test in CMake + the native CI script

Constraints:

  • The Python/C++ config entries byte-mirror each other (kept in sync)
  • add shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element add fold case is omitted as disproportionate

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109517683

Differential Revision: D109517683

[ghstack-poisoned]
@ghost ghost requested review from kirklandsign and larryliu0820 as code owners June 28, 2026 16:22
@pytorch-bot

pytorch-bot Bot commented Jun 28, 2026

Copy link
Copy Markdown

@github-actions

Copy link
Copy Markdown

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026
@ghost

ghost commented Jun 29, 2026

Copy link
Copy Markdown
Author

@claude review and check for any areas or opportunities for modularization

@claude

claude Bot commented Jun 29, 2026

Copy link
Copy Markdown

[ghstack-poisoned]
@ghost ghost temporarily deployed to cadence June 29, 2026 22:10 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to cadence June 29, 2026 22:10 — with GitHub Actions Inactive
ghost pushed a commit that referenced this pull request Jun 29, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 398258612
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
[ghstack-poisoned]
ghost pushed a commit that referenced this pull request Jun 30, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 398355257
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
@ghost ghost temporarily deployed to cadence June 30, 2026 02:46 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to cadence June 30, 2026 02:46 — with GitHub Actions Inactive
[ghstack-poisoned]
@ghost ghost temporarily deployed to cadence July 3, 2026 20:28 — with GitHub Actions Inactive
[ghstack-poisoned]
@ghost ghost temporarily deployed to cadence July 3, 2026 20:52 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to cadence July 3, 2026 20:52 — with GitHub Actions Inactive
[ghstack-poisoned]
@ghost ghost temporarily deployed to cadence July 3, 2026 21:37 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to cadence July 3, 2026 21:37 — with GitHub Actions Inactive
@ghost ghost temporarily deployed to cadence July 3, 2026 22:06 — with GitHub Actions Inactive
@meta-codesync meta-codesync Bot merged commit 5fc1924 into gh/JulianCloudNTH/76/base Jul 4, 2026
179 of 183 checks passed
@meta-codesync meta-codesync Bot deleted the gh/JulianCloudNTH/76/head branch July 4, 2026 17:06
@meta-codesync meta-codesync Bot temporarily deployed to cherry-pick-bot July 4, 2026 17:06 Inactive
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
ghost pushed a commit that referenced this pull request Jul 4, 2026
…d unit test

Pull Request resolved: #20584

**Test coverage for the 2D dispatch fold, stacked above the cap-lift op.**

**Problem**: The 2D fold is load-bearing index math — a wrong `{x, y}` means out-of-bounds writes or dropped threads — and the prefill shapes that exercise it previously threw at the 1D cap, so they were untested.

**Solution**: A device-free unit test for the fold arithmetic, plus two single-shot prefill SDPA golden configs that fold each kernel family.
- **Before**: no coverage for >65535-workgroup dispatch; `llama1b_prefill_512`/`_2048` shapes threw at the cap
- **After**: `fold_workgroup_count_2d` unit-tested at the cap boundaries, and the two prefill shapes run as goldens

**Implementation**:
- `test/native/test_dispatch_2d.cpp` — device-free unit test for `utils::fold_workgroup_count_2d`: the 1D fast path, the 2D fold, the real Llama-1B QK counts at S=512 (`{65535, 3}`) and S=2048 (`{65535, 33}`), and the needs-3rd-dimension throw; asserts each `{x, y}` covers `[0, count)`
- `llama1b_prefill_512` + `llama1b_prefill_2048` configs appended to the byte-mirrored `CONFIGS` (`test_sdpa.py`) and `kSdpaConfigs` (`test_webgpu_native.cpp`)
- Registers `webgpu_dispatch_2d_test` in CMake + the native CI script

**Constraints**:
- The Python/C++ config entries byte-mirror each other (kept in sync)
- `add` shares the element-form path with QK, so it is covered structurally; a dedicated >16M-element `add` fold case is omitted as disproportionate

Co-authored-with: Claude Code.
ghstack-source-id: 399812923
@exported-using-ghexport

Differential Revision: [D109517683](https://our.internmc.facebook.com/intern/diff/D109517683/)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants