Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb by thevar1able · Pull Request #4 · ClickHouse/zstd · GitHub
Skip to content

Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb#4

Open
thevar1able wants to merge 1 commit into
clickhousefrom
x86-ssse3-overlap-copy
Open

Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb#4
thevar1able wants to merge 1 commit into
clickhousefrom
x86-ssse3-overlap-copy

Conversation

@thevar1able

Copy link
Copy Markdown
Member

The x86_64 counterpart of #3 (which does the same for AArch64 with NEON tbl). Independent of that PR; the two touch the same lines and whichever merges second will need a trivial conflict resolution.

Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb

For small match offsets (overlap period < 16 bytes), ZSTD_wildcopy falls back to an 8-byte-per-iteration COPY8 loop. This loop dominates decompression of integer and low-cardinality data, whose matches frequently have offsets of 4 or 8 bytes (e.g. runs of repeated fixed-width values).

This builds the repeating pattern once in an SSE register and stores 16 bytes per iteration, advancing the pattern with a single _mm_shuffle_epi8 (pshufb) table lookup and no load inside the loop. pshufb is the direct analog of NEON vqtbl1q_u8, using the same index tables:

  • init[diff][j] = j % diff builds the period-diff pattern from a 16-byte load;
  • adv[diff][j] = (16 + j) % diff shifts the pattern forward by 16 bytes each iteration, which composes correctly because pattern[k] = src[k % diff].

The path is gated behind compile-time SSSE3 detection (ZSTD_ARCH_X86_SSSE3, from __SSSE3__). The scalar COPY8 path is kept unchanged for all other targets, and the non-overlapping (offset >= 16) path is untouched.

Results

On an AMD Ryzen 9 9950X (x86-64-v3), ZSTD level 1 decompression of ClickBench hits columns (1M rows, 64 KiB blocks, best of 15):

column type baseline MB/s patched MB/s speedup
UserID Int64 7373 8818 1.20x
AdvEngineID Int16 11037 12402 1.12x
ClientIP Int32 4001 4641 1.16x
RegionID Int32 3591 4080 1.14x
URL / Title String 3052 / 4644 3080 / 4689 ~1.00x (unchanged)
WatchID Int64 (incompressible) 84246 83910 ~1.00x (unchanged)

The string and incompressible columns are unchanged, as expected: their decompression does not exercise the small-offset path.

Correctness

Round-trip decompression is byte-exact against the original input for all tested columns at block sizes from 4 KB to 1 MB, and clean under AddressSanitizer and UndefinedBehaviorSanitizer. Each iteration stores 16 bytes and stops once op >= oend, overrunning the logical end by at most 15 bytes — well within the existing WILDCOPY_OVERLENGTH (32) slack, asserted with ZSTD_STATIC_ASSERT.

This ports the AArch64 NEON `tbl` optimization for `ZSTD_wildcopy`'s small-offset
path to x86_64 using SSSE3 `pshufb`.

For small match offsets (overlap period < 16 bytes), `ZSTD_wildcopy` falls back
to an 8-byte-per-iteration `COPY8` loop. This dominates decompression of integer
and low-cardinality columns, whose matches frequently have offsets of 4 or 8
bytes (e.g. repeated fixed-width values).

Build the repeating pattern once in an SSE register and store 16 bytes per
iteration, advancing the pattern with a single `_mm_shuffle_epi8` (`pshufb`)
table lookup and no load inside the loop. `pshufb` is the direct analog of NEON
`vqtbl1q_u8` and reuses the same index-table scheme:
  init[diff][j] = j % diff        : build the pattern from a 16-byte load
  adv[diff][j]  = (16 + j) % diff : shift the pattern forward by 16 bytes
The path is gated behind compile-time SSSE3 detection (`ZSTD_ARCH_X86_SSSE3`,
from `__SSSE3__`); the scalar `COPY8` path is kept for all other targets, and the
non-overlapping (offset >= 16) path is unchanged. Each iteration stores 16 bytes
and stops once `op >= oend`, overrunning the logical end by at most 15 bytes,
which always fits in the `WILDCOPY_OVERLENGTH` (32) slack callers guarantee; this
is asserted with `ZSTD_STATIC_ASSERT`.

On an AMD Ryzen 9 9950X (x86-64-v3), ZSTD level 1 decompression of ClickBench
`hits` columns (1M rows, 64 KiB blocks, best of 15) speeds up:

  column        baseline MB/s   patched MB/s   speedup
  UserID               7373.3         8818.0     1.20x
  AdvEngineID         11036.8        12401.9     1.12x
  ClientIP             4000.8         4640.8     1.16x
  RegionID             3591.1         4080.2     1.14x
  URL                  3052.1         3079.7     1.01x
  Title                4643.5         4688.6     1.01x
  WatchID             84246.0        83910.3     1.00x

The string columns (`URL`, `Title`) and the incompressible `WatchID` column are
unchanged, as expected: their decompression does not exercise the small-offset
path. Verified byte-exact across block sizes 4 KiB-1 MiB and clean under ASan and
UBSan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@Algunenano

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants