iframe-proxy

JulianCloudNTH · 2026-06-28T16:22:47Z

Stack from ghstack (oldest at bottom):

End-to-end validation that one graph built at the upper-bound seq-len serves every smaller live shape, matching the torch golden.

Problem: the dynamic-resize engine (allocate-at-max buffers + per-op resize hooks + output resize) had unit-level reasoning but no single oracle proving a graph built at S=MAX runs correctly at S<MAX without reallocating buffers (which would invalidate bind groups).

Solution: a native test that builds each toy model at S=MAX and runs it at several live S, asserting the output matches a torch-computed golden and that the output EValue is resized to the live shape.

Cases A-D: dynamic + static rms_norm (resize shrinks the dispatch; one reused graph across S proves buffers never move; static path unchanged).
Cases F-H: rms(rms(x)) cascade, rms(x)+x (rms->add cascade), rms(x)*x (mul).
Cases I-L: dynamic linear_q4gsw (GEMM at several M), sdpa_with_kv_cache (GQA prefill at several S), embedding_q4gsw (int64 ids), apply_rotary_emb (two outputs).
Cases M-N: dynamic sigmoid (elementwise) and select_copy(0, -1) (negative index resolved against the live leading dim each call).
Graph-reuse variants: every dynamic op above (rms_norm incl. a grow-first smallest→largest order, the rms(rms(x)) cascade, linear_q4gsw, embedding_q4gsw, apply_rotary_emb, sigmoid, select_copy) also runs ONE loaded graph across multiple live shapes — proving buffers never move so bind groups stay valid across every resize.

Implementation:

test/ops/dynamic_shape/test_dynamic_shape_export.py exports each toy model through VulkanPartitioner with a dynamic dim and writes per-S torch goldens; reuses the existing op-test helpers for quant/sdpa/embedding/rope.
test/native/test_dynamic_shape.cpp loads each .pte, runs each live S, and compares at the per-op tolerance (rms 1e-3, quant 5e-3, sdpa 2e-3). Reuse tests split each per-op helper into load-once + run-at-shape so a single Module serves the whole shape sweep.
Multi-output ops select their output by full shape, never numel.

Constraints: numerics computed with torch (no hand-rolled reference); toy models stay within the 65535 1D-dispatch cap; SDPA case is skipped gracefully if sym_size.int/copy_ op coverage is incomplete (does not fail the suite).

Co-authored-with: Claude Code.
@exported-using-ghexport

Differential Revision: D109906090

[ghstack-poisoned]

pytorch-bot · 2026-06-28T16:22:51Z

github-actions · 2026-06-28T16:23:29Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

JulianCloudNTH · 2026-06-29T20:21:35Z

@claude review and check for any areas or opportunities for modularization

claude · 2026-06-29T20:22:27Z

Claude finished @JulianCloudNTH's task in 1m 28s —— View job

Review: dynamic-shape integration test

Read changed files
Review for correctness & best practices
Identify modularization opportunities

Solid end-to-end test — building one graph at S=MAXS and replaying smaller live shapes against a torch golden is exactly the right oracle for the allocate-at-max + per-op-resize engine, and selecting multi-output tensors by full shape (not numel) is the correct call. The graceful SKIP paths (no WebGPU device, SDPA op-coverage PENDING) keep the suite green where appropriate. Below are the modularization opportunities you asked about, plus a few minor notes.

Modularization — Python (`test_dynamic_shape_export.py`)

1. The lower → assert-delegated → write-buffer block is copy-pasted 5×. _export (lines 97-108) already does this, but _export_dynamic_linear, _export_dynamic_sdpa, _export_dynamic_embedding, and _export_dynamic_rope each re-implement the identical to_edge_transform_and_lower(...).to_executorch() + any(d.id == "VulkanBackend" ...) + open().write(et.buffer) sequence (e.g. lines 210-218, 253-261, 292-299, 335-343). Since the only thing that differs between callers is how the ExportedProgram is built, split _export so everyone shares the tail:

def _lower_and_write(ep, path: str) -> None:
    et = to_edge_transform_and_lower(ep, partitioner=[VulkanPartitioner()]).to_executorch()
    assert any(
        d.id == "VulkanBackend"
        for plan in et.executorch_program.execution_plan
        for d in plan.delegates
    ), f"Expected VulkanBackend delegate in {path}"
    with open(path, "wb") as f:
        f.write(et.buffer)
    print(f"Exported {path}")

_export then becomes torch.export.export(...) + _lower_and_write(ep, path), and the four specialized exporters drop ~6 duplicated lines each. Fix this →

2. The tensor → little-endian f32 → file incantation appears ~10×. t.detach().cpu().numpy().astype("<f4").tofile(os.path.join(out_dir, ...)) is repeated in _write_goldens, _export_dynamic_linear, _export_dynamic_sdpa, _export_dynamic_embedding, and _export_dynamic_rope. A one-liner _write_f32(t, path) (and a sibling _write_i64 for the embedding idx at line 307) would centralize the dtype contract that the native side depends on, so a future dtype change is one edit instead of ten.

3. SDPA and RoPE golden loops are structurally identical (lines 266-270, 347-351): iterate [(name, tensor), ...] and dump each to {prefix}.S{s}.{name}.bin. Once _write_f32 exists, both collapse to a shared _write_named(prefix, s, pairs, out_dir).

Modularization — C++ (`test_dynamic_shape.cpp`)

4. The four op-specific checkers share a fixed skeleton. check_linear, check_sdpa, check_embedding, and check_rope each repeat: construct Module, load_forward() with FAIL/ok=false handling, read input bins, make_tensor_ptr, forward, result .ok()/isTensor guards, slice to numel, max_err, and the printf(... PASS/FAIL); ok = ok && pass tail. The per-op delta is really just (a) input tensor construction, (b) output selection, and (c) tolerance. Worth extracting the invariant pieces:

forward_and_check(got, golden, label, s, tol, ok) — the max_err + PASS/FAIL print + ok &= pass tail (currently duplicated in check_s, check_linear, check_embedding, check_rope, check_sdpa).
find_output_by_shape(result, predicate) — the shape-matching output-selection loop is written twice (SDPA lines 207-217, RoPE lines 309-321) with the same structure.

5. The main()-level "load fresh module per S and check_s" loop is repeated 4× for cases A/B, F, G, H (lines 360-369, 398-407, 410-419, 422-431) — identical except the .pte name and the S list. A helper:

void run_dyn_cases(const std::string& dir, const std::string& prefix,
                   std::initializer_list<int> ss, bool& ok) {
  for (int s : ss) {
    Module m(dir + "/" + prefix + ".pte");
    if (m.load_forward() != Error::Ok) { printf("  FAIL load %s.pte\n", prefix.c_str()); ok = false; return; }
    check_s(m, dir, prefix, s, ok);
  }
}

turns each case into a single call. Fix this →

6. Templatize the binary reader. read_bin only handles float, so check_embedding (lines 246-255) hand-rolls the int64 read. A read_bin<T>(path) would remove that special case and keep one file-reading code path.

Minor notes (non-blocking)

Magic tolerances. 1e-3f, 5e-3f, 2e-3f are inlined at each call site (lines 106, 155, 230, 270, 331). Named constants (kTolRms, kTolQuant, kTolSdpa) would document intent and pair naturally with the tol parameter in suggestion Add unlifting pass under private config #4.
Header comment lists A–F but cases extend to L (lines 9-21). The block comment stops at case F while the suite runs through L — worth completing or trimming so it doesn't drift.
max_err returns 1e30f as a size-mismatch sentinel (line 57). It works because every comparison is < tol, but an explicit got.empty() || size mismatch → FAIL at the call site reads more clearly than relying on a magic large float comparing false.

None of these are correctness blockers — the test logic itself looks right. The Python _lower_and_write extraction (#1) is the highest-value change; the C++ helpers (#4/#5) are next.

[ghstack-poisoned]

SS-JIA

Review automatically exported from Phabricator review in Meta.

[ghstack-poisoned]

Update

29ec110

[ghstack-poisoned]

JulianCloudNTH requested review from kirklandsign and larryliu0820 as code owners June 28, 2026 16:22

JulianCloudNTH temporarily deployed to cadence June 28, 2026 16:22 — with GitHub Actions Inactive

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 28, 2026

Update

0c7b54b

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 29, 2026 22:10 — with GitHub Actions Inactive

meta-codesync Bot added the meta-exported label Jun 29, 2026

Update

9a9981f

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 30, 2026 02:46 — with GitHub Actions Inactive

Update

c3cd768

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence June 30, 2026 21:10 — with GitHub Actions Inactive

This was referenced Jun 30, 2026

[ExecuTorch][WebGPU] 2D-fold mul + permute dispatch (lift 65535 1D cap) #20651

Open

[ExecuTorch][WebGPU] Use requiredFeatures instance API on native + emscripten Dawn #20652

Open

SS-JIA requested changes Jul 2, 2026

View reviewed changes

Update

c0e55b5

[ghstack-poisoned]

JulianCloudNTH had a problem deploying to cadence July 2, 2026 23:00 — with GitHub Actions Error

Update

766b9c8

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence July 2, 2026 23:14 — with GitHub Actions Inactive

JulianCloudNTH mentioned this pull request Jul 2, 2026

[ExecuTorch][WebGPU] Convert remaining native tests to GTest #20706

Open

Update

7e10497

[ghstack-poisoned]

JulianCloudNTH had a problem deploying to cadence July 3, 2026 20:28 — with GitHub Actions Error

JulianCloudNTH temporarily deployed to cadence July 3, 2026 20:28 — with GitHub Actions Inactive

Update

49f913e

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence July 3, 2026 20:52 — with GitHub Actions Inactive

JulianCloudNTH had a problem deploying to cadence July 3, 2026 21:19 — with GitHub Actions Error

JulianCloudNTH requested a review from psiddh July 3, 2026 21:26

Update

85f0845

[ghstack-poisoned]

JulianCloudNTH temporarily deployed to cadence July 3, 2026 21:37 — with GitHub Actions Inactive

JulianCloudNTH temporarily deployed to cadence July 3, 2026 22:05 — with GitHub Actions Inactive

psiddh approved these changes Jul 4, 2026

View reviewed changes

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20582

[ExecuTorch][WebGPU] Dynamic-shape integration test (allocate-at-max + per-op resize)#20582
JulianCloudNTH wants to merge 9 commits into
gh/JulianCloudNTH/74/basefrom
gh/JulianCloudNTH/74/head

JulianCloudNTH commented Jun 28, 2026 •

edited

Loading

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 28, 2026

Uh oh!

JulianCloudNTH commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 •

edited

Loading

Uh oh!

SS-JIA left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

JulianCloudNTH commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot Bot commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/20582

❌ 1 New Failure

Uh oh!

github-actions Bot commented Jun 28, 2026

This PR needs a release notes: label

Uh oh!

JulianCloudNTH commented Jun 29, 2026

Uh oh!

claude Bot commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: dynamic-shape integration test

Modularization — Python (test_dynamic_shape_export.py)

Modularization — C++ (test_dynamic_shape.cpp)

Minor notes (non-blocking)

Uh oh!

SS-JIA left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JulianCloudNTH commented Jun 28, 2026 •

edited

Loading

pytorch-bot Bot commented Jun 28, 2026 •

edited

Loading

This PR needs a `release notes:` label

claude Bot commented Jun 29, 2026 •

edited

Loading

Modularization — Python (`test_dynamic_shape_export.py`)

Modularization — C++ (`test_dynamic_shape.cpp`)