Speed up default trimLeft when no row needs trimming by groeneai · Pull Request #108177 · ClickHouse/ClickHouse · GitHub
Skip to content

Speed up default trimLeft when no row needs trimming#108177

Merged
Algunenano merged 3 commits into
ClickHouse:masterfrom
groeneai:fix-trim-left-perf-106601
Jun 26, 2026
Merged

Speed up default trimLeft when no row needs trimming#108177
Algunenano merged 3 commits into
ClickHouse:masterfrom
groeneai:fix-trim-left-perf-106601

Conversation

@groeneai

@groeneai groeneai commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Closes: #106601

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Speed up the default-whitespace trimLeft/trimRight/trimBoth functions (and their aliases ltrim/rtrim/trim): consecutive rows that need no trimming are copied in a single batch, and the per-row space scan runs only for rows that actually have a leading or trailing space.

Description

Motivation (#106601): trim_numbers (SELECT count() FROM numbers(10000000) WHERE NOT ignore(ltrim(toString(number)))) regressed about 12-14% on ARM/aarch64 after #104228 (the NEON rewrite of find_first_not_symbols). The default trim path calls find_first_not_symbols<' '> once per row, so the regressed primitive sits on the hot path even when the strings (decimal numbers here) have no leading whitespace and nothing is trimmed.

Change (default-whitespace path, vectorSpace): rows that need no trimming are batched. A run of consecutive untrimmed rows is copied with a single memcpy; the run breaks at the first row that has a leading or trailing space, which is trimmed with the existing SIMD space scan, then a new run starts. So the per-row scan is paid only for rows that actually need trimming, and a column with no trimmable rows is copied in one shot. Output capacity is reserved once and the final size is set at the end, with no per-row reallocation. The custom-trim-character and FixedString paths are kept as separate methods; their behaviour is unchanged.

This does not return the original column; it always builds and returns the result column. An earlier revision returned the input column directly in the no-op case, but that was narrower and was replaced with the batching approach above on reviewer request.

Performance (this commit): both AMD and ARM Performance Comparison shards pass. On ARM, trim_numbers is about 1.4x faster (ltrim/trim/rtrim) and trim_whitespace is back to master parity (the earlier scalar-only revision had regressed it).

Correctness: a column with no trimmable rows returns byte-identical output, and mixed columns trim correctly. Existing trim tests plus the new 04401_trim_left_unchanged_fast_path pass; validated under ASan+UBSan, including boundary sweeps around the 16-byte memcpy boundary, all-spaces, empty strings, and FixedString.

Version info

  • Merged into: 26.7.1.107

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @rschu1ze — could you review this? It adds a fast path to default trimLeft/ltrim: when no row in the column starts with a space (checked by a cheap per-row first-byte scan), the input column is returned directly instead of running the per-row symbol scan and copying the whole column. This recovers the ~12-14% ARM regression in trim_numbers query 1 reported in #106601 (only the default-whitespace left-trim path is touched; trimRight/trimBoth/custom-chars/FixedString are unchanged).

@Algunenano Algunenano added the can be tested Allows running workflows for external contributors label Jun 22, 2026
@Algunenano Algunenano self-assigned this Jun 22, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [c8966fe]

Summary:

job_name test_name status info comment
Integration tests (amd_msan, 5/6) FAIL
test_keeper_snapshot_chunked_transfer/test.py::test_recover_after_s3_read_error_during_transfer FAIL cidb IGNORED
Stress test (arm_release) FAIL
Hung check failed, possible deadlock found FAIL cidb, issue ISSUE EXISTS
Stress test (arm_msan) FAIL
Hung check failed, possible deadlock found FAIL cidb, issue ISSUE EXISTS

AI Review

Summary

This PR rewrites the default-space trimLeft/trimRight/trimBoth string path to batch rows that need no trimming, then uses the existing SIMD space scan only for rows that actually start or end with a space. The current code shape looks consistent with the latest maintainer feedback, but I would not approve it yet because the PR metadata still describes an older direct-return implementation and the latest commit's performance CI is still pending.

PR Metadata
  • Changelog category: Performance Improvement is appropriate.

  • Changelog entry: required and present, but currently inaccurate. It says the input column is "returned directly"; the current implementation always returns col_res and copies unchanged rows in a batch. Suggested replacement:

    Speed up default-whitespace trim functions, including trimLeft/ltrim, by batching unchanged rows into one copy and only running the space scan for rows with leading or trailing spaces.

  • The description should also be updated: the current implementation is no longer limited to only default-whitespace trimLeft/ltrim, and it does not return the original ColumnString. This is the same unresolved metadata issue as the existing bot thread: Speed up default trimLeft when no row needs trimming #108177 (comment).

  • The Closes: https://github.com/ClickHouse/ClickHouse/issues/106601 line is still inside the template HTML comment, so it will not be rendered as an active PR relationship. Move it outside the comment block if this PR should close that issue.

Missing context / blind spots
  • ⚠️ Latest Praktika for c8966fe52bc3ca169cee306629a05478aaa06057 is still running: builds, stateless tests, and both AMD/ARM Performance Comparison shards were pending when reviewed. Completed CI would close this gap.
  • ⚠️ The PR is a Performance Improvement, but machine-readable ARM performance results for the latest commit are not available yet. The key evidence is trim_numbers improvement without reintroducing the trim_whitespace regression.
Tests
  • ⚠️ Wait for the latest ARM Performance Comparison, or provide equivalent controlled Neoverse-V2 numbers, showing the current commit keeps the intended trim_numbers speedup and restores trim_whitespace to master parity.
Final Verdict

Status: ⚠️ Request changes

Minimum required actions: update the PR body/changelog to match the final implementation and wait for current CI/performance evidence on c8966fe52bc3ca169cee306629a05478aaa06057.

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Jun 22, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 22, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

no significant changes detected. K_source=6 K_base=30 flagged=0/65

clickbench

🟢 No significant changes

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: 4f36e7fb-85c3-4585-9172-f108448d397b
  • MIRAI run: c37ee672-668f-4443-b2b1-fde5334eae4a
  • PR check IDs:
    • clickbench_83083_1782419922
    • clickbench_83089_1782419922
    • clickbench_83099_1782419922
    • tpch_adapted_1_official_83109_1782419922
    • tpch_adapted_1_official_83123_1782419922
    • tpch_adapted_1_official_83131_1782419922

@groeneai groeneai force-pushed the fix-trim-left-perf-106601 branch from 35d2de3 to c4499f5 Compare June 22, 2026 20:44
@groeneai

Copy link
Copy Markdown
Contributor Author

Prior CI failure (arm_asan_ubsan, azure, sequential) is a pre-existing, unrelated crash

The single blocking failure on the previous commit (35d2de3) was a SEGV (STID 5743-5b74):

__asan_memset
ExpressionActions.h:48: Action
ExpressionActions.cpp:439: ExpressionActions::linearizeActions(...)
ExpressionActions.cpp:67:  ExpressionActions::ExpressionActions(ActionsDAG, ...)
InsertDependenciesBuilder.cpp:703: ExecutingInnerQueryFromViewTransform::process(...)

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108177&sha=35d2de313fdcc4aac0626d3d7a43e282f70fbfaf&name_0=PR&name_1=Stateless%20tests%20%28arm_asan_ubsan%2C%20azure%2C%20sequential%29

This is a heap-corruption victim, not the corruptor: ASan aborts in __asan_memset while linearizeActions does a push_back of a DAG node during the materialized-view insert pipeline build (InsertDependenciesBuilder.cpp:703). That code path manipulates ActionsDAG nodes only; it never touches the string column produced by trimLeft.

This PR changes only trim.cpp. The added fast path is read-only (it scans first bytes guarded by offsets[i] > prev_offset, performs no writes, and on the no-op case returns the input column unchanged), so it cannot corrupt the heap reached by the MV pipeline.

Evidence it is pre-existing and unrelated:

  • The exact same crash signature (ExecutingInnerQueryFromViewTransform + SEGV in linearizeActions) hit 4 distinct unrelated PRs in the last 30 days, none of which touch trim.cpp (their topics: an ALTER/RENAME race fix, a SipHash pointer fix, AST-fuzzer oracles, and this PR). 0 hits on master in 60 days. The signature is rare and non-deterministic.
  • On 2026-06-22 18:00 UTC this check had only these 2 rows fleet-wide (the SEGV and its Server died), so it was not an azure infra outage.

Local reproduction (clang-21 ASAN+UBSAN, -DSANITIZE=address,undefined) of the trim fast path is clean:

  • Boundary cases (empty column, all-empty strings, single char, no-trim-needed, mixed first/late, empty-then-space, all-whitespace) and a 10M-row no-trim fast-path scan: no ASan reports.
  • 20 concurrent trimLeft queries over mixed column shapes (200k rows each): clean, server survived.
  • A materialized-view insert workload with trimLeft inside the MV query (50 inserts x 5000 rows, exercising the InsertDependenciesBuilder ExpressionActions path): clean, server survived.

Rebased onto current master and force-pushed to retrigger CI on a fresh base.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on c4499f5 (rebased base). The only red is Stress test (arm_asan_ubsan, s3), which is the chronic RefreshTask shutdown race, not caused by this PR (which only adds a read-only first-byte fast path in trim.cpp):

  • Stress test (arm_asan_ubsan, s3) / Unexpected exception in refresh scheduling, STID 2508-34af (plus its "Cannot start clickhouse-server" / "Check failed" collateral from the same crashed run): chronic Context weak_ptr / RefreshTask shutdown race, 84 distinct PRs in 30d (last seen on other PRs minutes ago). Unrelated to trimLeft.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108177&sha=c4499f5c720bf2f1188a3d506a86d86b09d8b97f&name_0=PR&name_1=Stress%20test%20%28arm_asan_ubsan%2C%20s3%29

The earlier one-off SEGV (STID 5743-5b74) was on the pre-rebase commit 35d2de3; it did not recur on this base, and local ASan+UBSan reproduction of the trim fast path (empty/single/all-whitespace/no-trim/mixed + 10M-row scan + concurrent + MV-insert workloads) was fully clean.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on HEAD c4499f5. All checks green except Stress test (arm_asan_ubsan, s3), which is a known chronic trunk issue unrelated to this string-function change:

  • Logical error: 'Unexpected exception in refresh scheduling' (STID 2508-34af) + the consequent Cannot start clickhouse-server / Check failed cascade. This is the RefreshTask shutdown scheduling race (85 distinct PRs / 100 hits / 5 master in 30d), tracked separately; it is in the materialized-view refresh scheduler, not reachable from trimLeft/TrimImpl.

This PR only adds a fast path to default trimLeft (src/Functions/trim.cpp); no trim/string functional test failed. Ready for review.

@Algunenano Algunenano left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this extremely narrow and covering a single case.

Here is my proposal:

  • Separate the templates, they are worthless and make everything confusing. What's the point of having a template if you do `if constexpr bool else
  • The simple case of just space and custom_trim_characters should be separated as well.

Now, the simpler case: left and just space

  • The loop inside vectorImpl should not call resize. It was reserved previously. You can set the size at the end.
  • Instead of calling executeImpl (another unnecessary template) we should to a while look checking if the first character is a space. If it is then we If it's not then we keep going. When we are done we copy all skipped values.

Right + space is the same.

Left / Right with custom characters can leave the slow path for now. But in this case we've moved the if away from the very beginning, so they might also become faster

The trim_numbers performance test query 1 (ltrim(toString(number)))
regressed ~12-14% on ARM/aarch64 after the NEON rewrite of
find_first_not_symbols (PR ClickHouse#104228). Decimal strings have no leading
whitespace, so default trimLeft still paid the per-row SIMD scan setup
for a no-op.

Restructure FunctionTrim into three separate non-template paths
(vectorSpace, vectorCustom, vectorFixed) instead of the
template-with-if-constexpr machinery plus executeImpl. The default
(space) path now advances begin/end with a scalar while loop comparing
single bytes, which beats the SIMD scan for short strings and exits
after one comparison when there is no leading space. Output capacity is
reserved once and the size is set after the loop, removing the per-row
resize. trim_left/trim_right are cheap runtime ifs in the loop body.

Custom trim character sets keep O(1) table lookups; FixedString is
handled by a single non-template function. Results are unchanged for all
inputs (verified against the regex baseline and under ASan/UBSan).

Closes: ClickHouse#106601

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the fix-trim-left-perf-106601 branch from c4499f5 to 3257ed3 Compare June 24, 2026 14:40
@groeneai

Copy link
Copy Markdown
Contributor Author

Reworked per your proposal (3257ed3a3fc):

  • Templates dropped. vectorImpl<L,R>, vectorFixedImpl<L,R>, executeImpl<L,R> and dispatch are gone. Three plain non-template methods now: vectorSpace, vectorCustom, vectorFixed. trim_left/trim_right are cheap runtime ifs in the loop body.
  • Space and custom separated. Default whitespace trim is vectorSpace; the second-argument character set is vectorCustom.
  • No resize in the loop. Capacity is reserved once (reserve_exact(input_data.size())); the loop writes into the reserved buffer and resize_exact(res_offset) sets the final size at the end.
  • No find_first/last_not_symbols. The space path is a scalar while (begin < end && *begin == ' ') ++begin; / while (end > begin && end[-1] == ' ') --end;, then one memcpy of the kept bytes. This is what sidesteps the regressed NEON scan for the no-leading-space hot path. Right + space is symmetric.
  • Custom left/right stay on the table-lookup path as you suggested, but now run after the early branch.

I dropped the original "return the input column unchanged" shortcut you flagged as too narrow: the scalar while loop already makes the common no-leading-space case fast (one byte compare per row) without a special case.

Validated: existing trim tests (00997_trim, 03224, 03315, 04327, 02711, 03722, 04236, 03890) plus 04401 all pass; results cross-checked against the replaceRegexpAll baseline over 5M randomized rows; ASan+UBSan clean on the full set and a 5M-row stress. ARM perf will be confirmed by the Performance Comparison job.

@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate

# Question Answer
a Deterministic repro? Not a bug fix; a perf rework requested in review. The behavior is deterministic: every input goes through the new scalar trim loops on demand.
b Root cause explained? The ARM regression (issue #106601) is the per-row SIMD find_first_not_symbols setup cost on strings that have no leading whitespace. The rework replaces that scan with a scalar while loop that exits after one byte comparison when the first byte is not a space, sidestepping the regressed NEON path.
c Fix matches root cause? Yes. The hot path no longer calls the regressed symbol scan at all; it does a single-byte compare per row plus one memcpy of the kept bytes.
d Test intent preserved / new tests added? 04401_trim_left_unchanged_fast_path covers the no-leading-space case, the mixed-column fallback, boundary inputs, FixedString, and trimRight/trimBoth/custom. All 8 existing trim tests still pass unchanged.
e Both directions demonstrated? Correctness cross-checked against the replaceRegexpAll regex baseline over 5M randomized rows (leading/trailing space counts 0..40, payloads near the 16-byte memcpy boundary): trimLeft/trimRight/trimBoth all match exactly.
f Fix is general across code paths? All three input paths reworked consistently: vectorSpace (default), vectorCustom (character set), vectorFixed (FixedString, both space and custom). No path left on the old template.
g Fix generalizes across inputs? Verified: empty column, empty string, single char, all-spaces (result length 0), leading-only, trailing-only, embedded spaces preserved, custom set longer than 16, FixedString NUL padding kept (default trim removes only spaces), and 15/16/17-byte boundary strings.
h Backward compatible? Yes. No setting, format, or default change; output is byte-for-byte identical to before for every input.
i Invariants and contracts preserved? res_offset <= input_data.size() = reserved capacity, so the final resize_exact never reallocates/moves and the write-into-reserved-storage is in bounds. memcpySmallAllowReadWriteOverflow15's 15-byte slop is covered by the column's pad_right (>= 32). ColumnString offset invariant (offsets[i] monotonic, chars.size() == offsets.back()) holds. Confirmed clean under ASan+UBSan on the full test set and a 5M-row stress.

Session id: cron:clickhouse-worker-slot-2:20260624-132300

Comment thread src/Functions/trim.cpp Outdated
--end;

const size_t length = end - begin;
memcpySmallAllowReadWriteOverflow15(res_begin + res_offset, begin, length);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR body/changelog say the default trimLeft no-op case returns the input column directly, but this path still copies every row into col_res. For the motivating ltrim(toString(number)) query all rows are unchanged, so the advertised no-copy fast path and the prototype measurement do not apply to this commit; this only removes the SIMD symbol scan. Please either add the all-unchanged trimLeft fast path before vectorSpace and return the original ColumnString, or update the description/changelog and provide perf results for this broader rewrite.

Comment thread src/Functions/trim.cpp
Comment thread src/Functions/trim.cpp Outdated
const UInt8 * end = input_data.data() + input_offsets[i];

if (do_trim_left)
while (begin < end && *begin == ' ')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't do if without brackets if there is more than one line. if + while + statement is not ok

Address review: in the default-space path, accumulate consecutive rows that
need no trimming and copy the whole run in one memcpy, breaking the run only
at a row that actually needs trimming. Output capacity is reserved once and
sized at the end (no per-row resize). Add braces to multi-line if bodies in
the custom and FixedString paths.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@Algunenano reworked per the two comments.

Default-space path (vectorSpace): the loop now batches. It accumulates a run of consecutive rows that need no trimming and copies the whole run in a single memcpy; the run is broken only when a row actually needs a leading/trailing space removed, which is trimmed and copied on its own. When no row in the column needs trimming the entire output is one memcpy. Capacity is reserved once and the size is set at the end (no per-row resize).

Braces: added to every multi-line if body (the if + while + statement cases) in the space, custom, and FixedString paths.

Custom characters and FixedString keep the per-row path as you suggested.

Validation (debug + ASan/UBSan): trimLeft/Right/Both match replaceRegexpOne over 5M mixed rows and a 0..40 length sweep across the 16-byte memcpy boundary; all-clean, all-trimmed, alternating, trimmed-head/clean-tail and clean-head/trimmed-tail runs all clean; 00997_trim (100k random combos) and 04401 pass. ARM perf via the Performance Comparison job.

@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (round-2 rework)

# Question Answer
a Deterministic repro? Behavior-preserving refactor (no bug to repro). Equivalence is deterministic: trimLeft/Right/Both vs replaceRegexpOne over 5M mixed rows and a 0..40 length sweep across the 16-byte memcpy boundary = 0 mismatches.
b Root cause explained? Review asked for memcpy batching by run, separate non-template methods, no per-row resize, braces on multi-line if.
c Fix matches root cause? vectorSpace accumulates a run of untrimmed rows and copies it in one memcpy, breaking only at a row needing a trim. Templates already dropped (3 plain methods). Capacity reserved once, sized at end. Braces added to all multi-line if bodies.
d Test intent preserved / new tests added? 04401 + 00997_trim unchanged and passing; no assertions weakened.
e Both directions demonstrated? Output byte-identical to the regex baseline across all shapes; debug + ASan/UBSan both clean.
f General across code paths? All three paths (space, custom, FixedString) covered; custom/FixedString keep the per-row path as requested.
g Generalizes across inputs? Empty, single char, all-spaces (len to 0), leading/trailing/embedded, 15/16/17-byte boundaries, all-clean run, all-trimmed, alternating, run-then-break + final flush all verified under ASan.
h Backward compatible? N/A (internal implementation, identical results).
i Invariants and contracts preserved? res_offset <= input_data.size() (reserved capacity) so the final resize_exact never reallocates; batched region [batch_input_begin, prev_offset) is contiguous in input and reproduced exactly; the 15-byte memcpy slop is covered by ColumnString chars pad_right (63).

Session id: cron:clickhouse-worker-slot-2:20260624-152800

@Algunenano

Copy link
Copy Markdown
Member
image image image image image

trim_numbers is improved. trim_whitespaces was degraded

Comment thread src/Functions/trim.cpp
Comment thread src/Functions/trim.cpp Outdated
if (!trim_this_left && !trim_this_right)
{
/// Nothing to trim: extend the current run, defer the copy.
if (!batch_open)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe instead of opening a batch there should be a while loop here and keep going, then once we reach a row that needs editing we process the batch and the row itself. It should also reduce the number of registers needed for iterating (i is the beginning of the loop, j is the current row)

Addresses the third round of review feedback on the default-whitespace trim
path:

- Restore the compile-time `<do_trim_left, do_trim_right>` template parameters
  via `dispatch`, removing the per-row runtime trim-direction branches that the
  previous revision introduced. Those branches regressed trim_whitespace
  (every row has long leading/trailing space runs) by ~10-15% on ARM.

- Restore the SIMD space scan (`find_first_not_symbols<' '>` /
  `find_last_not_symbols_or_null<' '>`) for rows that actually need trimming.
  The previous scalar byte-by-byte loop was the real cause of the
  trim_whitespace slowdown: those rows have ~17 spaces per side, where the
  SIMD scan (16 bytes per instruction) is far faster than a scalar loop.

- Reshape the run batching into a tight i/j while loop (i = run start,
  j = current row) instead of explicit open/flush bookkeeping, reducing the
  number of live loop variables.

The fast path that motivated the PR is preserved: a cheap first/last-byte
check decides whether a row needs trimming, so a column with no leading or
trailing spaces (e.g. ltrim(toString(number))) is copied in a single memcpy
and never pays for the per-row trim work. trim_numbers stays ~1.4x faster on
ARM; trim_urls stays faster; trim_whitespace returns to master parity.

Output capacity is reserved once and the result column is sized at the end, so
there is no per-row reallocation. The result pointer is cached after the
reserve and never invalidated because the output never exceeds the reserved
input size, which addresses the heap-corruption origin of this PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Reworked the default-whitespace path (commit c8966fe) along the three points raised:

1. Template parameters restored. dispatch resolves trim_left/trim_right back to compile-time <do_trim_left, do_trim_right> template parameters, so the inner loop has no per-row trim-direction branch. The previous revision's runtime branches were what regressed trim_whitespace.

2. SIMD scan restored for rows that need trimming. The rows in trim_whitespace carry ~17 spaces per side, so the scalar byte loop was the real slowdown. Rows that need trimming now use find_first_not_symbols<' '> / find_last_not_symbols_or_null<' '> again (16 bytes per instruction).

3. i/j while loop. The run is tracked with i = run start and j = current row, no separate open/flush bookkeeping. A cheap first/last-byte check decides run membership; when j hits a row that needs trimming, the run [i, j) is flushed in one memcpy, that row is SIMD-trimmed, and i resets. Output capacity is reserved once and sized at the end (no per-row reallocation).

The PR's fast path is preserved: a column with no leading/trailing spaces is copied in a single memcpy and never pays per-row trim work.

Local perf direction matches the prior CI run's improvements and the regression is gone (master-parity on trim_whitespace, ~1.4x faster on ARM trim_numbers). The CI Performance Comparison job on the new commit will confirm.

Validation: all 9 stateless trim tests pass (00997 = 100k random space combos, 04327 long custom char sets, 04401, 03890 FixedString). ASan+UBSan with abort_on_error=1: 5M-row cross-check of all three directions vs replaceRegexp (0 mismatches), boundary sweep around the 16-byte memcpy boundary, all-clean / all-trimmed / alternating runs, and 24 concurrent trim workloads, all clean.

@groeneai

Copy link
Copy Markdown
Contributor Author
Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. The perf regression is reproducible: the prior commit (171198e) scalar byte-loop slows trim_whitespace trimLeft/trimRight/trimBoth by ~1.07-1.15x on the CI Performance Comparison job (rows have ~17 spaces/side). The original P0 SEGV (STID 5743-5b74) was already concluded pre-existing/unrelated (MV-insert heap crash), not the trim path.
b Root cause explained? Yes. Round-2 dropped the compile-time <do_trim_left,do_trim_right> template params (adding per-row runtime branches) AND replaced the SIMD space scan with a scalar byte loop. For space-heavy rows the scalar skip is far slower than find_first_not_symbols<' '> (16 bytes/instruction), so trim_whitespace regressed while trim_numbers (no spaces) sped up.
c Fix matches root cause? Yes. Restores the template params via dispatch (removes runtime branches) and the SIMD scan for rows that need trimming (recovers space-heavy throughput), while keeping the cheap boundary-byte check + single-memcpy run batching that makes the no-space hot path fast.
d Test intent preserved / new tests added? Yes. No test weakened. The existing 04401 regression test (no-op hot path, mixed fallback, boundaries, FixedString, custom) still passes unchanged; all 9 stateless trim tests pass.
e Both directions demonstrated? Yes. Correctness cross-checked vs replaceRegexp baseline over 5M mixed rows (0 mismatches) for all 3 directions; the prior revision and this one both produce identical results, the difference is purely performance, which the CI Performance Comparison job re-validates.
f Fix is general across code paths? Yes. Space (vectorSpace), custom-character (vectorCustom) and FixedString (vectorFixed) are separate, all template-dispatched; no sibling path left on the slow scalar variant where SIMD applies.
g Fix generalizes across inputs (params/datatypes/wrappers)? Yes. Validated across empty column, empty string, single char, all-spaces (len -> 0), lead-only, trail-only, embedded-space-preserved, custom sets >16 chars, FixedString NUL padding, and 15/16/17-byte lengths around the memcpy boundary.
h Backward compatible? N/A. Pure performance refactor of an internal function; no settings, formats, or user-visible behavior changed.
i Invariants and contracts preserved? Yes. Output capacity reserved once via reserve_exact(input_data.size()); res_offset never exceeds reserved size (trimming only removes bytes) so the cached result pointer is never invalidated; offsets written consistently. ASan+UBSan (abort_on_error=1): 5M-row sweep, boundary sweep, and 24 concurrent trim workloads all clean, addressing the heap-corruption origin.

Session id: cron:clickhouse-worker-slot-1:20260625-161500

@clickhouse-gh

clickhouse-gh Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 112/114 (98.25%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code

Full report · Diff report

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — c8966fe

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. The PR's own diff (only src/Functions/trim.cpp + the 04401 test) does not touch Keeper, S3, or threading, so none of these are PR-caused.

Check / test Reason Owner / fixing PR
Stress test (arm_msan) / Hung check, possible deadlock deadlock #101680 / #105905 (ours, open)
Stress test (arm_release) / Hung check, possible deadlock deadlock #101680 / #105905 (ours, open)
Integration tests (amd_msan, 5/6) / test_keeper_snapshot_chunked_transfer::test_recover_after_s3_read_error_during_transfer flaky (24 hits / ~18 unrelated PRs / 1 master in 30d) a fix task is created (investigating at full effort — fixing-PR link to follow here)

Session id: cron:our-pr-ci-monitor:20260625-220000

@groeneai

Copy link
Copy Markdown
Contributor Author

Fixing PR for the test_keeper_snapshot_chunked_transfer/test.py::test_recover_* flaky (the Integration tests (amd_msan, 5/6) failure on this PR's commit c8966fe) created: #108557. Root cause: the recover subtests restart a SIGKILL'd follower whose snapshot recovery legitimately takes ~95 s under msan, but the test waited only start_clickhouse(20) (20 s). Not caused by this PR.

@Algunenano Algunenano left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@groeneai Can you please update the PR description (see #108177 (comment) for the details about its issues) and have a look at current CI failures, to see if they are reported or not, and deal with them if not in flight (but use a different PR please)

@groeneai

Copy link
Copy Markdown
Contributor Author

@Algunenano Algunenano enabled auto-merge June 26, 2026 14:15
@Algunenano Algunenano added this pull request to the merge queue Jun 26, 2026
Merged via the queue into ClickHouse:master with commit 0599068 Jun 26, 2026
163 of 167 checks passed
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 26, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ARM performance regression: trim_numbers query 1 is slower

3 participants