iframe-proxy

SS-JIA · 2026-06-30T21:25:07Z

Stack from ghstack (oldest at bottom):

-> [ET-VK][conv2d] Cap im2col scratch memory via output-height tiling #20653

The im2col + GEMM conv2d path materializes an im2col scratch tensor of size M * K_total * elem (M = N * H_out * W_out, K_total = K_h * K_w * align_up_4(C_in)) as a single shared tensor, allocated during graph build and resident for the model's lifetime. For full-resolution convolutions this scratch is very large -- a 64-channel 3x3 conv at 256x256 in FP32 materializes ~144 MB, and at 512x512 in FP16 ~288 MB. On memory-constrained mobile GPUs, where GPU allocations come from unified, non-reclaimable system memory, this can nearly double peak process memory and trigger the OS low-memory killer.

This change tiles the im2col + GEMM over output-height rows to a fixed scratch budget (kIm2colScratchBudgetBytes, 16 MB). A single scratch tensor sized to oh_tile output rows is reused across tiles, with an oh_offset selecting the live row window per tile. The GEMM inner loop is byte-identical, so the GEMM-based speedup is preserved; scratch becomes O(budget) instead of O(M * K_total), making it resolution-independent. Tiling along output-height (rather than flattened M) keeps the row -> (oh, ow) decode exact for all three storage variants (buffer, texture2d, texture3d). The fixed per-build tile count is safe because tensors are built at the dynamic upper bound, so runtime shapes only shrink and trailing tiles no-op via the shader's oh < H_out guard. oh_tile reaches the resize callbacks as a raw int packed into the resize_args slot (read via static_cast, not get_int) to avoid materializing a graph Value for a build-time constant. The direct-conv fallback for small shapes is unchanged.

Differential Revision: D110231992

[ghstack-poisoned]

pytorch-bot · 2026-06-30T21:25:11Z

github-actions · 2026-06-30T21:27:04Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

metascroy · 2026-06-30T21:33:17Z

@claude review this code

claude · 2026-06-30T21:34:10Z

Claude finished @metascroy's task in 6m 51s —— View job

Code Review: Cap im2col scratch via output-height tiling

Reviewed the GLSL shaders, the Conv2dGemm / Conv2dIm2Col impl + headers, and the new dynamic test against origin/gh/SS-JIA/566/base.

Overall this is a clean, well-reasoned change. The tiling math is correct, the output rows written across tiles are disjoint (so bias is applied exactly once per element), and the shader guards (oh < H_out on both the im2col source fetch and the GEMM store) correctly no-op trailing/dead rows. The documentation is unusually thorough and the new resize test exercises the genuinely tricky cases (crossing tile boundaries down to a single active tile and back up). Notes below, mostly minor.

1. Storage selection still uses the full `M` / `H_out` — now over-conservative (perf, not correctness)

In conv2d_gemm_impl, the fits_2d / fits_3d checks (Conv2dGemm.cpp:367-371) run before oh_tile is computed (line 392) and gate on the full untiled extents:

const bool fits_2d = ... && safe_downcast<uint32_t>(M) <= max_2d;          // M = H_out * W_out
const bool fits_3d = ... && safe_downcast<uint32_t>(H_out) <= max_3d && ...;

But the scratch is now only oh_tile * W_out rows (tex2d) / oh_tile deep (tex3d), both bounded by the 16 MB budget. So a high-resolution conv whose full M exceeds max_texture2d_dim will still fall back to texture3d/buffer even though the tiled scratch would comfortably fit texture2d. It's safe (the selected extent is always ≤ the checked extent), just a missed opportunity — and it makes the "Used as a fallback when M would exceed max_texture2d_dim" rationale in the shader/impl comments somewhat stale, since the tiled M rarely exceeds it anymore. Reordering so oh_tile is computed first and the fits-checks use oh_tile * W_out / oh_tile would let texture2d be chosen in more cases. Fix this →

2. Dead trailing tiles under dynamic down-resize still dispatch full GEMM work (perf)

num_tiles and the per-tile global WG size are fixed at the build-time (max) H_out. When a runtime shape shrinks H_out enough that whole tiles are dead (e.g. build at 128 → run at 64, so 1 of 3 tiles is entirely past H_out), that tile's GEMM still dispatches div_up(oh_tile*W, 4) workgroups and runs the full K-loop matmul — only to early-out at the store guard. The im2col for those tiles also runs (writing zeros). Correctness is fine; it's wasted GPU work proportional to the dead-tile count when shapes shrink substantially. Acceptable given the static-node-count constraint, but worth a one-line note in the dispatch-loop comment so it isn't mistaken for free.

3. Cross-tile WAR serialization — the load-bearing assumption (correctness)

The whole scheme hinges on the comment at Conv2dGemm.cpp ("reusing the same TmpTensor across tiles serializes them via the backend's automatic read/write barriers"). Tile t's GEMM reads the scratch and tile t+1's im2col writes it — a write-after-read hazard that requires an actual pipeline barrier, not just command-buffer ordering. This depends on DispatchNode emitting a barrier on the read→write transition of the shared resource. The new device test validates this end-to-end across tile boundaries, which is the right way to cover it — just flagging that this is the single point where a backend barrier-tracking regression would silently corrupt output, so it's worth keeping that test in the gating set.

4. `static_cast<ValueRef>(oh_tile)` into `resize_args` (robustness — well-handled)

Packing a raw int into a ValueRef (= int32_t) slot is clever and the over-trigger analysis is correct: was_value_updated → is_valid_value_idx bounds-guards the lookup, and neither resize fn dereferences slot 5 as a handle (GEMM ignores it; im2col reads it via static_cast). One residual subtlety the comment could call out: if oh_tile's integer value happens to collide with a real value index that is a ValueList, was_value_updated recurses through toConstValueList() — still memory-safe and read-only, just a slightly deeper spurious walk. Benign, but the "treats this slot as a value index" note understates that it may also recurse. Minor.

5. Budget estimate ignores texture alignment/padding (minor)

bytes_per_h_row = W_out * K_total * elem_size is the logical size. For the texture2d/texture3d scratch variants the physical allocation includes texel (vec4) rounding and image row/layer alignment, so actual device memory can modestly exceed the nominal 16 MB. Not a problem (the budget is a soft tuning knob, not a hard cap), but if the intent is a strict ceiling on mobile, the comment on kIm2colScratchBudgetBytes could note it's a logical-size budget.

Nits

Test io_storage is pinned to kTexture3D (well-justified in the comment) and the test requires a Vulkan device — confirm it's wired into a gated job so it actually runs, otherwise the cross-tile barrier coverage (Rename _pt2e to pt2e #3) is dormant.
conv2d_gemm.glsl store-guard comment is excellent; the m_local < M_TILE check also enforces the upper oh bound of the tile (oh < oh_offset + oh_tile), which is why no row leaks into a neighboring tile's range — might be worth stating since the comment only mentions the oh < H_out lower-bound guard.

None of these are blocking. #1 is the most worthwhile follow-up (it directly extends the memory win to a perf win on high-res texture2d paths).
• branch gh/SS-JIA/566/head

[ghstack-poisoned]

Pull Request resolved: #20653 The im2col + GEMM conv2d path materializes an im2col scratch tensor of size M * K_total * elem (M = N * H_out * W_out, K_total = K_h * K_w * align_up_4(C_in)) as a single shared tensor, allocated during graph build and resident for the model's lifetime. For full-resolution convolutions this scratch is very large -- a 64-channel 3x3 conv at 256x256 in FP32 materializes ~144 MB, and at 512x512 in FP16 ~288 MB. On memory-constrained mobile GPUs, where GPU allocations come from unified, non-reclaimable system memory, this can nearly double peak process memory and trigger the OS low-memory killer. This change tiles the im2col + GEMM over output-height rows to a fixed scratch budget (kIm2colScratchBudgetBytes, 16 MB). A single scratch tensor sized to oh_tile output rows is reused across tiles, with an oh_offset selecting the live row window per tile. The GEMM inner loop is byte-identical, so the GEMM-based speedup is preserved; scratch becomes O(budget) instead of O(M * K_total), making it resolution-independent. Tiling along output-height (rather than flattened M) keeps the row -> (oh, ow) decode exact for all three storage variants (buffer, texture2d, texture3d). The fixed per-build tile count is safe because tensors are built at the dynamic upper bound, so runtime shapes only shrink and trailing tiles no-op via the shader's oh < H_out guard. oh_tile reaches the resize callbacks as a raw int packed into the resize_args slot (read via static_cast, not get_int) to avoid materializing a graph Value for a build-time constant. The direct-conv fallback for small shapes is unchanged. ghstack-source-id: 398747242 @exported-using-ghexport Differential Revision: [D110231992](https://our.internmc.facebook.com/intern/diff/D110231992/)

Update

7166324

[ghstack-poisoned]

SS-JIA temporarily deployed to cadence June 30, 2026 21:25 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 30, 2026

metascroy approved these changes Jun 30, 2026

View reviewed changes

meta-codesync Bot added the meta-exported label Jun 30, 2026

Update

c8efc34

[ghstack-poisoned]

SS-JIA temporarily deployed to cadence June 30, 2026 23:09 — with GitHub Actions Inactive

SS-JIA changed the base branch from gh/SS-JIA/566/base to main July 1, 2026 05:04

SS-JIA changed the base branch from main to gh/SS-JIA/566/base July 1, 2026 05:04

meta-codesync Bot merged commit cf4e746 into gh/SS-JIA/566/base Jul 1, 2026
185 of 187 checks passed

meta-codesync Bot deleted the gh/SS-JIA/566/head branch July 1, 2026 05:18

meta-codesync Bot temporarily deployed to cherry-pick-bot July 1, 2026 05:18 Inactive

pytorchbot mentioned this pull request Jul 1, 2026

[ET-VK][conv2d] Cap im2col scratch memory via output-height tiling #20657

Merged

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ET-VK][conv2d] Cap im2col scratch memory via output-height tiling#20653

[ET-VK][conv2d] Cap im2col scratch memory via output-height tiling#20653
meta-codesync[bot] merged 2 commits into
gh/SS-JIA/566/basefrom
gh/SS-JIA/566/head

SS-JIA commented Jun 30, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 30, 2026

Uh oh!

metascroy commented Jun 30, 2026

Uh oh!

claude Bot commented Jun 30, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

SS-JIA commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20653

✅ No Failures

Uh oh!

github-actions Bot commented Jun 30, 2026

This PR needs a release notes: label

Uh oh!

metascroy commented Jun 30, 2026

Uh oh!

claude Bot commented Jun 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review: Cap im2col scratch via output-height tiling

1. Storage selection still uses the full M / H_out — now over-conservative (perf, not correctness)

2. Dead trailing tiles under dynamic down-resize still dispatch full GEMM work (perf)

3. Cross-tile WAR serialization — the load-bearing assumption (correctness)

4. static_cast<ValueRef>(oh_tile) into resize_args (robustness — well-handled)

5. Budget estimate ignores texture alignment/padding (minor)

Nits

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SS-JIA commented Jun 30, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 30, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 30, 2026 •

edited

Loading

1. Storage selection still uses the full `M` / `H_out` — now over-conservative (perf, not correctness)

4. `static_cast<ValueRef>(oh_tile)` into `resize_args` (robustness — well-handled)