iframe-proxy

JulianCloudNTH · 2026-06-09T21:16:24Z

Stack from ghstack (oldest at bottom):

-> [ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA #20167
[ExecuTorch][WebGPU] SDPA test suite: replay + dynamic input_pos + in-graph KV cache #20087
[ExecuTorch][WebGPU] Add fused SDPA (sdpa_with_kv_cache) with dynamic input_pos #20086
[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its kernel_name so the WebGPUQueryPool can attribute on-GPU time to the attention stage that produced it. sdpa_with_kv_cache runs four chained dispatches — update_cache -> QK (attn_weights) -> softmax -> AV (compute_out); WebGPUGraph::execute() brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the WEBGPU_TIMESTAMP_QUERY env var; off by default, so the production execute() path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up.

Co-authored with Claude.
@exported-using-ghexport

Differential Revision: D107678235

[ghstack-poisoned]

pytorch-bot · 2026-06-09T21:16:27Z

github-actions · 2026-06-09T21:17:12Z

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 391669549 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 391741952 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 391801048 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 Add a faithful re-port of Vulkan's `vkapi::QueryPool` (`backends/vulkan/runtime/vk_api/QueryPool.{h,cpp}`) so a bench can read true on-GPU per-kernel time, isolated from submit/readback latency — the basis for comparing the WGSL SDPA kernels against the Vulkan reference. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. `WebGPUQueryPool` mirrors the Vulkan `ShaderDuration` data model and the ticks->ns conversion exactly. Three deviations are forced by the WebGPU API (not unforced divergences): per-dispatch bracketing uses a compute-pass `timestampWrites` descriptor (begin/end-of-pass) since WebGPU has no mid-encoder `writeTimestamp`; results are read via `resolveQuerySet` + buffer map (no host-side `vkGetQueryPoolResults`); and the `TimestampQuery` capability is requested as an explicit device feature (fail-open if the adapter lacks it). `WebGPUGraph::execute()` brackets each compute pass when the pool is active; chained `update_cache`/QK/softmax/AV dispatches carry a `kernel_name` label for attribution. Co-authored-with Claude. ghstack-source-id: 392065610 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its `kernel_name` so the `WebGPUQueryPool` can attribute on-GPU time to the attention stage that produced it. `sdpa_with_kv_cache` runs four chained dispatches — `update_cache` -> QK (`attn_weights`) -> softmax -> AV (`compute_out`); `WebGPUGraph::execute()` brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up. Co-authored with Claude. ghstack-source-id: 392093463 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its `kernel_name` so the `WebGPUQueryPool` can attribute on-GPU time to the attention stage that produced it. `sdpa_with_kv_cache` runs four chained dispatches — `update_cache` -> QK (`attn_weights`) -> softmax -> AV (`compute_out`); `WebGPUGraph::execute()` brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up. Co-authored with Claude. ghstack-source-id: 392093463 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its `kernel_name` so the `WebGPUQueryPool` can attribute on-GPU time to the attention stage that produced it. `sdpa_with_kv_cache` runs four chained dispatches — `update_cache` -> QK (`attn_weights`) -> softmax -> AV (`compute_out`); `WebGPUGraph::execute()` brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up. Co-authored with Claude. ghstack-source-id: 392093463 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

[ghstack-poisoned]

Pull Request resolved: #20167 SDPA-specific instrumentation layered on the general GPU-timestamp infrastructure (companion diff below): tag each fused SDPA dispatch with its `kernel_name` so the `WebGPUQueryPool` can attribute on-GPU time to the attention stage that produced it. `sdpa_with_kv_cache` runs four chained dispatches — `update_cache` -> QK (`attn_weights`) -> softmax -> AV (`compute_out`); `WebGPUGraph::execute()` brackets each compute pass with a timestamp when the pool is active, and this diff labels each dispatch so the per-pass durations map back to the right stage. Opt-in via the `WEBGPU_TIMESTAMP_QUERY` env var; off by default, so the production `execute()` path is byte-identical. This is the per-kernel hook a forthcoming SDPA kernel benchmark will read; the benchmark itself (and any comparative numbers) is a separate follow-up. Co-authored with Claude. ghstack-source-id: 392093463 @exported-using-ghexport Differential Revision: [D107678235](https://our.internmc.facebook.com/intern/diff/D107678235/)

The register-tile change (pytorch#20507) rewrote the `update_cache`, QK (`sdpa_compute_attn_weights`), `sdpa_softmax`, and AV (`sdpa_compute_out`) `build_dispatch` call sites and dropped the per-dispatch `kernel_name` labels originally added in pytorch#20167. With the labels gone, `WEBGPU_TIMESTAMP_QUERY` profiling can no longer attribute on-GPU time to the attention stage that produced it (every dispatch reports as the default "dispatch"). This re-threads `kernel_name` through `build_dispatch` (defaulted to `""`, so all other callers are unaffected) into the existing `WebGPUDispatch::kernel_name` field that `WebGPUQueryPool` already reads, and re-applies the four SDPA stage labels. No behavior change when profiling is off; the production `execute()` path is byte-identical.

Update

25c045f

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 9, 2026 21:16

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 9, 2026

Update

8a99e7a

[ghstack-poisoned]

meta-codesync Bot added the meta-exported label Jun 10, 2026

Update

5beb63e

[ghstack-poisoned]

JulianCloudNTH mentioned this pull request Jun 10, 2026

[ExecuTorch][WebGPU] GPU timestamp query profiling (general implementation) #20201

Merged

JulianCloudNTH changed the title ~~[ExecuTorch][WebGPU] Add GPU timestamp-query profiling (WebGPUQueryPool)~~ [ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA Jun 10, 2026

Update

efb6b7f

[ghstack-poisoned]

Update

0103656

[ghstack-poisoned]

Update

ebf063e

[ghstack-poisoned]

Update

f5f0ffc

[ghstack-poisoned]

This was referenced Jun 11, 2026

[ExecuTorch][WebGPU] Add 4-bit weight-only quantized linear (et_vk.linear_q4gsw) #20226

Merged

[ExecuTorch][WebGPU] linear_q4gsw test suite: Llama-1B shapes + 4k/8k sweep #20227

Merged

psiddh approved these changes Jun 13, 2026

View reviewed changes

meta-codesync Bot merged commit 373a50b into gh/JulianCloudNTH/21/base Jun 13, 2026
178 of 180 checks passed

meta-codesync Bot deleted the gh/JulianCloudNTH/21/head branch June 13, 2026 01:59

meta-codesync Bot had a problem deploying to cherry-pick-bot June 13, 2026 01:59 Failure

meta-codesync Bot had a problem deploying to cherry-pick-bot June 13, 2026 07:17 Failure

JulianCloudNTH restored the gh/JulianCloudNTH/21/head branch June 13, 2026 07:25

JulianCloudNTH deleted the gh/JulianCloudNTH/21/head branch June 13, 2026 07:26

JulianCloudNTH mentioned this pull request Jun 26, 2026

[ExecuTorch][WebGPU] Restore SDPA per-dispatch kernel_name labels #20551

Merged

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA#20167

[ExecuTorch][WebGPU] GPU timestamp query profiling for SDPA#20167
meta-codesync[bot] merged 7 commits into
gh/JulianCloudNTH/21/basefrom
gh/JulianCloudNTH/21/head

JulianCloudNTH commented Jun 9, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

JulianCloudNTH commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20167

❗ 1 Active SEVs

❌ 1 New Failure, 1 Unrelated Failure

Uh oh!

github-actions Bot commented Jun 9, 2026

This PR needs a release notes: label

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JulianCloudNTH commented Jun 9, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 9, 2026 •

edited

Loading

This PR needs a `release notes:` label