Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (#20499) by zonglinpeng · Pull Request #20499 · pytorch/executorch · GitHub
Skip to content

Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (#20499)#20499

Open
zonglinpeng wants to merge 1 commit into
pytorch:mainfrom
zonglinpeng:export-D109500113
Open

Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (#20499)#20499
zonglinpeng wants to merge 1 commit into
pytorch:mainfrom
zonglinpeng:export-D109500113

Conversation

@zonglinpeng

@zonglinpeng zonglinpeng commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary:

When the input and output buffers are 16-byte aligned (dequant_simd_aligned), the per-tensor path runs an inline PDX SIMD loop (xb_vecMxf32/xb_vecMx32/PDX_MUL_MXF32); otherwise it falls back to the NNLib path (xa_nn_elm_dequantize_*). The result is numerically identical to the original op — the same float-domain affine (x - zero_point) * scale.

@pytorch-bot

pytorch-bot Bot commented Jun 24, 2026

Copy link
Copy Markdown

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jun 24, 2026
@linux-foundation-easycla

linux-foundation-easycla Bot commented Jun 24, 2026

Copy link
Copy Markdown

CLA Signed
The committers listed above are authorized under a signed CLA.

  • ✅ login: zonglinpeng / name: Zonglin Peng (3772ece)

@meta-codesync

meta-codesync Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

…20499)

Summary:

Recreates the optimized Fusion-G3 dequantize from D108798741, but instead of shipping it as a separate devmate kernel wired in through `operator_fallback.bzl`, it places the PDX SIMD fast path directly into the existing executorch operator `dequantize_per_tensor_out` in `executorch/backends/cadence/fusion_g3/operators/op_dequantize.cpp` (per-tensor function only; `per_channel`/`tensor`/`tensor_args` variants are untouched).

When the input and output buffers are 16-byte aligned (`dequant_simd_aligned`), the per-tensor path runs an inline PDX SIMD loop (`xb_vecMxf32`/`xb_vecMx32`/`PDX_MUL_MXF32`); otherwise it falls back to the NNLib path (`xa_nn_elm_dequantize_*`). The result is numerically identical to the original op — the same float-domain affine `(x - zero_point) * scale`.

This intentionally does NOT include the mvartanian integer-subtract change (D109458111, `PDX_SUB_MX32`); it uses the float-domain asymmetric path from D108798741 as requested. The macro fast paths (`ASYM_DEQUANTIZE_IMPL_CHANNEL`/`SYM_DEQUANTIZE_IMPL_CHANNEL`) get the `static_cast<CTYPE_OUT>((x - zp) * scale)` parenthesization required to build clean under the G3 `dev` mode's `-Werror,-Wdouble-promotion`.

For A/B measurement this also adds `op_dequantize_baseline.cpp` under the Jarvis operator test dir: a benchmark-only snapshot of the ORIGINAL executorch op (pre-SIMD, with only the `-Wdouble-promotion` fix). It defines `impl::G3::native::dequantize_per_tensor_out`, so the shared benchmark source from D109441948 is linked into two binaries — `_optimized` (against the real executorch op) and `_stock` (against the snapshot) — and compared on the cycle-accurate G3 ISS. `operators_header` visibility is extended to the Jarvis test package so the snapshot can include `operators.h`.

Reviewed By: mvartani-meta

Differential Revision: D109500113
@meta-codesync meta-codesync Bot changed the title Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (recreate D108798741) Inline per-tensor SIMD fast path in fusion_g3 op_dequantize (#20499) Jun 25, 2026
@zonglinpeng zonglinpeng added the release notes: cadence Changes to the Cadence backend delegate label Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported release notes: cadence Changes to the Cadence backend delegate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants