iframe-proxy

thevar1able · 2026-06-22T14:57:10Z

The x86_64 counterpart of #3 (which does the same for AArch64 with NEON tbl). Independent of that PR; the two touch the same lines and whichever merges second will need a trivial conflict resolution.

Vectorize small-offset overlapping copies on x86_64 with SSSE3 `pshufb`

For small match offsets (overlap period < 16 bytes), ZSTD_wildcopy falls back to an 8-byte-per-iteration COPY8 loop. This loop dominates decompression of integer and low-cardinality data, whose matches frequently have offsets of 4 or 8 bytes (e.g. runs of repeated fixed-width values).

This builds the repeating pattern once in an SSE register and stores 16 bytes per iteration, advancing the pattern with a single _mm_shuffle_epi8 (pshufb) table lookup and no load inside the loop. pshufb is the direct analog of NEON vqtbl1q_u8, using the same index tables:

init[diff][j] = j % diff builds the period-diff pattern from a 16-byte load;
adv[diff][j] = (16 + j) % diff shifts the pattern forward by 16 bytes each iteration, which composes correctly because pattern[k] = src[k % diff].

The path is gated behind compile-time SSSE3 detection (ZSTD_ARCH_X86_SSSE3, from __SSSE3__). The scalar COPY8 path is kept unchanged for all other targets, and the non-overlapping (offset >= 16) path is untouched.

Results

On an AMD Ryzen 9 9950X (x86-64-v3), ZSTD level 1 decompression of ClickBench hits columns (1M rows, 64 KiB blocks, best of 15):

The string and incompressible columns are unchanged, as expected: their decompression does not exercise the small-offset path.

Correctness

Round-trip decompression is byte-exact against the original input for all tested columns at block sizes from 4 KB to 1 MB, and clean under AddressSanitizer and UndefinedBehaviorSanitizer. Each iteration stores 16 bytes and stops once op >= oend, overrunning the logical end by at most 15 bytes — well within the existing WILDCOPY_OVERLENGTH (32) slack, asserted with ZSTD_STATIC_ASSERT.

This ports the AArch64 NEON `tbl` optimization for `ZSTD_wildcopy`'s small-offset path to x86_64 using SSSE3 `pshufb`. For small match offsets (overlap period < 16 bytes), `ZSTD_wildcopy` falls back to an 8-byte-per-iteration `COPY8` loop. This dominates decompression of integer and low-cardinality columns, whose matches frequently have offsets of 4 or 8 bytes (e.g. repeated fixed-width values). Build the repeating pattern once in an SSE register and store 16 bytes per iteration, advancing the pattern with a single `_mm_shuffle_epi8` (`pshufb`) table lookup and no load inside the loop. `pshufb` is the direct analog of NEON `vqtbl1q_u8` and reuses the same index-table scheme: init[diff][j] = j % diff : build the pattern from a 16-byte load adv[diff][j] = (16 + j) % diff : shift the pattern forward by 16 bytes The path is gated behind compile-time SSSE3 detection (`ZSTD_ARCH_X86_SSSE3`, from `__SSSE3__`); the scalar `COPY8` path is kept for all other targets, and the non-overlapping (offset >= 16) path is unchanged. Each iteration stores 16 bytes and stops once `op >= oend`, overrunning the logical end by at most 15 bytes, which always fits in the `WILDCOPY_OVERLENGTH` (32) slack callers guarantee; this is asserted with `ZSTD_STATIC_ASSERT`. On an AMD Ryzen 9 9950X (x86-64-v3), ZSTD level 1 decompression of ClickBench `hits` columns (1M rows, 64 KiB blocks, best of 15) speeds up: column baseline MB/s patched MB/s speedup UserID 7373.3 8818.0 1.20x AdvEngineID 11036.8 12401.9 1.12x ClientIP 4000.8 4640.8 1.16x RegionID 3591.1 4080.2 1.14x URL 3052.1 3079.7 1.01x Title 4643.5 4688.6 1.01x WatchID 84246.0 83910.3 1.00x The string columns (`URL`, `Title`) and the incompressible `WatchID` column are unchanged, as expected: their decompression does not exercise the small-offset path. Verified byte-exact across block sizes 4 KiB-1 MiB and clean under ASan and UBSan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Algunenano · 2026-06-22T18:54:17Z

This was referenced Jun 22, 2026

Vectorize small-offset overlapping copies on AArch64 with NEON tbl #3

Merged

Speed up ZSTD decompression on AArch64 (small-offset overlapping copies) ClickHouse/ClickHouse#108049

Merged

Algunenano approved these changes Jun 22, 2026

View reviewed changes

column	type	baseline MB/s	patched MB/s	speedup
`UserID`	Int64	7373	8818	1.20x
`AdvEngineID`	Int16	11037	12402	1.12x
`ClientIP`	Int32	4001	4641	1.16x
`RegionID`	Int32	3591	4080	1.14x
`URL` / `Title`	String	3052 / 4644	3080 / 4689	~1.00x (unchanged)
`WatchID`	Int64 (incompressible)	84246	83910	~1.00x (unchanged)

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb#4

Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb#4
thevar1able wants to merge 1 commit into
clickhousefrom
x86-ssse3-overlap-copy

thevar1able commented Jun 22, 2026

Uh oh!

Algunenano commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

thevar1able commented Jun 22, 2026

Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb

Results

Correctness

Uh oh!

Algunenano commented Jun 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Vectorize small-offset overlapping copies on x86_64 with SSSE3 `pshufb`