{{ message }}
Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb#4
Open
thevar1able wants to merge 1 commit into
Open
Vectorize small-offset overlapping copies on x86_64 with SSSE3 pshufb#4thevar1able wants to merge 1 commit into
thevar1able wants to merge 1 commit into
Conversation
This ports the AArch64 NEON `tbl` optimization for `ZSTD_wildcopy`'s small-offset path to x86_64 using SSSE3 `pshufb`. For small match offsets (overlap period < 16 bytes), `ZSTD_wildcopy` falls back to an 8-byte-per-iteration `COPY8` loop. This dominates decompression of integer and low-cardinality columns, whose matches frequently have offsets of 4 or 8 bytes (e.g. repeated fixed-width values). Build the repeating pattern once in an SSE register and store 16 bytes per iteration, advancing the pattern with a single `_mm_shuffle_epi8` (`pshufb`) table lookup and no load inside the loop. `pshufb` is the direct analog of NEON `vqtbl1q_u8` and reuses the same index-table scheme: init[diff][j] = j % diff : build the pattern from a 16-byte load adv[diff][j] = (16 + j) % diff : shift the pattern forward by 16 bytes The path is gated behind compile-time SSSE3 detection (`ZSTD_ARCH_X86_SSSE3`, from `__SSSE3__`); the scalar `COPY8` path is kept for all other targets, and the non-overlapping (offset >= 16) path is unchanged. Each iteration stores 16 bytes and stops once `op >= oend`, overrunning the logical end by at most 15 bytes, which always fits in the `WILDCOPY_OVERLENGTH` (32) slack callers guarantee; this is asserted with `ZSTD_STATIC_ASSERT`. On an AMD Ryzen 9 9950X (x86-64-v3), ZSTD level 1 decompression of ClickBench `hits` columns (1M rows, 64 KiB blocks, best of 15) speeds up: column baseline MB/s patched MB/s speedup UserID 7373.3 8818.0 1.20x AdvEngineID 11036.8 12401.9 1.12x ClientIP 4000.8 4640.8 1.16x RegionID 3591.1 4080.2 1.14x URL 3052.1 3079.7 1.01x Title 4643.5 4688.6 1.01x WatchID 84246.0 83910.3 1.00x The string columns (`URL`, `Title`) and the incompressible `WatchID` column are unchanged, as expected: their decompression does not exercise the small-offset path. Verified byte-exact across block sizes 4 KiB-1 MiB and clean under ASan and UBSan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This was referenced Jun 22, 2026
Member
Algunenano
approved these changes
Jun 22, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

The x86_64 counterpart of #3 (which does the same for AArch64 with NEON
tbl). Independent of that PR; the two touch the same lines and whichever merges second will need a trivial conflict resolution.Vectorize small-offset overlapping copies on x86_64 with SSSE3
pshufbFor small match offsets (overlap period < 16 bytes),
ZSTD_wildcopyfalls back to an 8-byte-per-iterationCOPY8loop. This loop dominates decompression of integer and low-cardinality data, whose matches frequently have offsets of 4 or 8 bytes (e.g. runs of repeated fixed-width values).This builds the repeating pattern once in an SSE register and stores 16 bytes per iteration, advancing the pattern with a single
_mm_shuffle_epi8(pshufb) table lookup and no load inside the loop.pshufbis the direct analog of NEONvqtbl1q_u8, using the same index tables:init[diff][j] = j % diffbuilds the period-diffpattern from a 16-byte load;adv[diff][j] = (16 + j) % diffshifts the pattern forward by 16 bytes each iteration, which composes correctly becausepattern[k] = src[k % diff].The path is gated behind compile-time SSSE3 detection (
ZSTD_ARCH_X86_SSSE3, from__SSSE3__). The scalarCOPY8path is kept unchanged for all other targets, and the non-overlapping (offset >= 16) path is untouched.Results
On an AMD Ryzen 9 9950X (
x86-64-v3), ZSTD level 1 decompression of ClickBenchhitscolumns (1M rows, 64 KiB blocks, best of 15):UserIDAdvEngineIDClientIPRegionIDURL/TitleWatchIDThe string and incompressible columns are unchanged, as expected: their decompression does not exercise the small-offset path.
Correctness
Round-trip decompression is byte-exact against the original input for all tested columns at block sizes from 4 KB to 1 MB, and clean under AddressSanitizer and UndefinedBehaviorSanitizer. Each iteration stores 16 bytes and stops once
op >= oend, overrunning the logical end by at most 15 bytes — well within the existingWILDCOPY_OVERLENGTH(32) slack, asserted withZSTD_STATIC_ASSERT.