iframe-proxy

groeneai · 2026-06-22T16:55:22Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Speed up the default-whitespace trimLeft/trimRight/trimBoth functions (and their aliases ltrim/rtrim/trim): consecutive rows that need no trimming are copied in a single batch, and the per-row space scan runs only for rows that actually have a leading or trailing space.

Description

Motivation (#106601): trim_numbers (SELECT count() FROM numbers(10000000) WHERE NOT ignore(ltrim(toString(number)))) regressed about 12-14% on ARM/aarch64 after #104228 (the NEON rewrite of find_first_not_symbols). The default trim path calls find_first_not_symbols<' '> once per row, so the regressed primitive sits on the hot path even when the strings (decimal numbers here) have no leading whitespace and nothing is trimmed.

Change (default-whitespace path, vectorSpace): rows that need no trimming are batched. A run of consecutive untrimmed rows is copied with a single memcpy; the run breaks at the first row that has a leading or trailing space, which is trimmed with the existing SIMD space scan, then a new run starts. So the per-row scan is paid only for rows that actually need trimming, and a column with no trimmable rows is copied in one shot. Output capacity is reserved once and the final size is set at the end, with no per-row reallocation. The custom-trim-character and FixedString paths are kept as separate methods; their behaviour is unchanged.

This does not return the original column; it always builds and returns the result column. An earlier revision returned the input column directly in the no-op case, but that was narrower and was replaced with the batching approach above on reviewer request.

Performance (this commit): both AMD and ARM Performance Comparison shards pass. On ARM, trim_numbers is about 1.4x faster (ltrim/trim/rtrim) and trim_whitespace is back to master parity (the earlier scalar-only revision had regressed it).

Correctness: a column with no trimmable rows returns byte-identical output, and mixed columns trim correctly. Existing trim tests plus the new 04401_trim_left_unchanged_fast_path pass; validated under ASan+UBSan, including boundary sweeps around the 16-byte memcpy boundary, all-spaces, empty strings, and FixedString.

Version info

Merged into: 26.7.1.107

groeneai · 2026-06-22T16:55:48Z

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Yes. `ltrim(toString(number))` over `numbers(10000000)` is the perf query; decimal strings never start with a space, so default `trimLeft` is a deterministic no-op that still ran the regressed per-row scan and copied the whole column. Reporter reproduced the ARM regression on Neoverse-V2 (12/12 runs, +14.04%).
b	Root cause explained?	Affected window baseline `9821949fd802` -> `c1ccce21e7d3` = PR #104228 (NEON rewrite of `find_first_not_symbols`). `trimLeft` calls `find_first_not_symbols<' '>` per row (trim.cpp:218), so the regressed primitive is on the hot path. Query 1 trims nothing yet pays for the per-row scan AND a full column copy into a fresh `ColumnString`.
c	Fix matches root cause?	Yes. Default left-trim removes only leading spaces, so a row is unchanged iff its first byte is not a space. A cheap first-byte-per-row check detects the all-unchanged case and returns the input column directly, skipping both the regressed scan and the copy. It targets the identified mechanism, not a symptom.
d	Test intent preserved / new tests added?	New regression test `04401_trim_left_unchanged_fast_path` pins the fast path (no-op returns identical column), the mixed-column fallback (leading spaces still removed), boundary inputs (empty/single-space/all-space/embedded/trailing), FixedString, and the untouched trimRight/trimBoth/custom-character paths. All existing trim tests still pass.
e	Both directions demonstrated?	Correctness validated locally (x86): no-op returns a byte-identical column (`ltrim(toString(number)) != toString(number)` -> 0 rows), mixed columns trim correctly, all trim tests pass. Speed: prototype build-validated by the reporter on Neoverse-V2 (about -47.7% on the affected build). x86 microbenchmarks are not representative of ARM, so the ARM speedup is left to the perf-CI Performance Comparison job on this PR.
f	Fix is general, not a narrow patch?	The change is scoped to the default-whitespace `trimLeft` path only, which is exactly where the regression lives. `trimRight`/`trimBoth` use `find_last_not_symbols_or_null` and would need a separate (different) check; custom-character trims use a lookup table; FixedString uses a separate path. None are touched, so no sibling path is left with the same issue and no behavior is changed elsewhere. Not a defensive guard.

Session id: cron:clickhouse-worker-slot-8:20260622-163200

groeneai · 2026-06-22T16:56:19Z

cc @rschu1ze — could you review this? It adds a fast path to default trimLeft/ltrim: when no row in the column starts with a space (checked by a cheap per-row first-byte scan), the input column is returned directly instead of running the per-row symbol scan and copying the whole column. This recovers the ~12-14% ARM regression in trim_numbers query 1 reported in #106601 (only the default-whitespace left-trim path is touched; trimRight/trimBoth/custom-chars/FixedString are unchanged).

clickhouse-gh · 2026-06-22T17:31:30Z

Workflow [PR], commit [c8966fe]

Summary: ❌

job_name	test_name	status	info	comment
Integration tests (amd_msan, 5/6)		FAIL
	test_keeper_snapshot_chunked_transfer/test.py::test_recover_after_s3_read_error_during_transfer	FAIL	cidb	IGNORED
Stress test (arm_release)		FAIL
	Hung check failed, possible deadlock found	FAIL	cidb, issue	ISSUE EXISTS
Stress test (arm_msan)		FAIL
	Hung check failed, possible deadlock found	FAIL	cidb, issue	ISSUE EXISTS

AI Review

Summary

This PR rewrites the default-space trimLeft/trimRight/trimBoth string path to batch rows that need no trimming, then uses the existing SIMD space scan only for rows that actually start or end with a space. The current code shape looks consistent with the latest maintainer feedback, but I would not approve it yet because the PR metadata still describes an older direct-return implementation and the latest commit's performance CI is still pending.

PR Metadata

Changelog category: Performance Improvement is appropriate.
Changelog entry: required and present, but currently inaccurate. It says the input column is "returned directly"; the current implementation always returns col_res and copies unchanged rows in a batch. Suggested replacement:

Speed up default-whitespace trim functions, including trimLeft/ltrim, by batching unchanged rows into one copy and only running the space scan for rows with leading or trailing spaces.
The description should also be updated: the current implementation is no longer limited to only default-whitespace trimLeft/ltrim, and it does not return the original ColumnString. This is the same unresolved metadata issue as the existing bot thread: Speed up default trimLeft when no row needs trimming #108177 (comment).
The Closes: https://github.com/ClickHouse/ClickHouse/issues/106601 line is still inside the template HTML comment, so it will not be rendered as an active PR relationship. Move it outside the comment block if this PR should close that issue.

Missing context / blind spots

⚠️ Latest Praktika for c8966fe52bc3ca169cee306629a05478aaa06057 is still running: builds, stateless tests, and both AMD/ARM Performance Comparison shards were pending when reviewed. Completed CI would close this gap.
⚠️ The PR is a Performance Improvement, but machine-readable ARM performance results for the latest commit are not available yet. The key evidence is trim_numbers improvement without reintroducing the trim_whitespace regression.

Tests

⚠️ Wait for the latest ARM Performance Comparison, or provide equivalent controlled Neoverse-V2 numbers, showing the current commit keeps the intended trim_numbers speedup and restores trim_whitespace to master parity.

Final Verdict

Status: ⚠️ Request changes

Minimum required actions: update the PR body/changelog to match the final implementation and wait for current CI/performance evidence on c8966fe52bc3ca169cee306629a05478aaa06057.

clickhouse-gh · 2026-06-22T20:41:38Z

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

no significant changes detected. K_source=6 K_base=30 flagged=0/65

clickbench

🟢 No significant changes

tpch_adapted_1_official

🟢 No significant changes

Debug info

StressHouse run: 4f36e7fb-85c3-4585-9172-f108448d397b
MIRAI run: c37ee672-668f-4443-b2b1-fde5334eae4a
PR check IDs:
- clickbench_83083_1782419922
- clickbench_83089_1782419922
- clickbench_83099_1782419922
- tpch_adapted_1_official_83109_1782419922
- tpch_adapted_1_official_83123_1782419922
- tpch_adapted_1_official_83131_1782419922

groeneai · 2026-06-22T20:44:53Z

Prior CI failure (`arm_asan_ubsan, azure, sequential`) is a pre-existing, unrelated crash

The single blocking failure on the previous commit (35d2de3) was a SEGV (STID 5743-5b74):

__asan_memset
ExpressionActions.h:48: Action
ExpressionActions.cpp:439: ExpressionActions::linearizeActions(...)
ExpressionActions.cpp:67:  ExpressionActions::ExpressionActions(ActionsDAG, ...)
InsertDependenciesBuilder.cpp:703: ExecutingInnerQueryFromViewTransform::process(...)

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108177&sha=35d2de313fdcc4aac0626d3d7a43e282f70fbfaf&name_0=PR&name_1=Stateless%20tests%20%28arm_asan_ubsan%2C%20azure%2C%20sequential%29

This is a heap-corruption victim, not the corruptor: ASan aborts in __asan_memset while linearizeActions does a push_back of a DAG node during the materialized-view insert pipeline build (InsertDependenciesBuilder.cpp:703). That code path manipulates ActionsDAG nodes only; it never touches the string column produced by trimLeft.

This PR changes only trim.cpp. The added fast path is read-only (it scans first bytes guarded by offsets[i] > prev_offset, performs no writes, and on the no-op case returns the input column unchanged), so it cannot corrupt the heap reached by the MV pipeline.

Evidence it is pre-existing and unrelated:

The exact same crash signature (ExecutingInnerQueryFromViewTransform + SEGV in linearizeActions) hit 4 distinct unrelated PRs in the last 30 days, none of which touch trim.cpp (their topics: an ALTER/RENAME race fix, a SipHash pointer fix, AST-fuzzer oracles, and this PR). 0 hits on master in 60 days. The signature is rare and non-deterministic.
On 2026-06-22 18:00 UTC this check had only these 2 rows fleet-wide (the SEGV and its Server died), so it was not an azure infra outage.

Local reproduction (clang-21 ASAN+UBSAN, -DSANITIZE=address,undefined) of the trim fast path is clean:

Boundary cases (empty column, all-empty strings, single char, no-trim-needed, mixed first/late, empty-then-space, all-whitespace) and a 10M-row no-trim fast-path scan: no ASan reports.
20 concurrent trimLeft queries over mixed column shapes (200k rows each): clean, server survived.
A materialized-view insert workload with trimLeft inside the MV query (50 inserts x 5000 rows, exercising the InsertDependenciesBuilder ExpressionActions path): clean, server survived.

Rebased onto current master and force-pushed to retrigger CI on a fresh base.

groeneai · 2026-06-23T03:33:07Z

CI finished on c4499f5 (rebased base). The only red is Stress test (arm_asan_ubsan, s3), which is the chronic RefreshTask shutdown race, not caused by this PR (which only adds a read-only first-byte fast path in trim.cpp):

Stress test (arm_asan_ubsan, s3) / Unexpected exception in refresh scheduling, STID 2508-34af (plus its "Cannot start clickhouse-server" / "Check failed" collateral from the same crashed run): chronic Context weak_ptr / RefreshTask shutdown race, 84 distinct PRs in 30d (last seen on other PRs minutes ago). Unrelated to trimLeft.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108177&sha=c4499f5c720bf2f1188a3d506a86d86b09d8b97f&name_0=PR&name_1=Stress%20test%20%28arm_asan_ubsan%2C%20s3%29

The earlier one-off SEGV (STID 5743-5b74) was on the pre-rebase commit 35d2de3; it did not recur on this base, and local ASan+UBSan reproduction of the trim fast path (empty/single/all-whitespace/no-trim/mixed + 10M-row scan + concurrent + MV-insert workloads) was fully clean.

groeneai · 2026-06-23T06:15:14Z

CI finished on HEAD c4499f5. All checks green except Stress test (arm_asan_ubsan, s3), which is a known chronic trunk issue unrelated to this string-function change:

Logical error: 'Unexpected exception in refresh scheduling' (STID 2508-34af) + the consequent Cannot start clickhouse-server / Check failed cascade. This is the RefreshTask shutdown scheduling race (85 distinct PRs / 100 hits / 5 master in 30d), tracked separately; it is in the materialized-view refresh scheduler, not reachable from trimLeft/TrimImpl.

This PR only adds a fast path to default trimLeft (src/Functions/trim.cpp); no trim/string functional test failed. Ready for review.

Algunenano

I find this extremely narrow and covering a single case.

Here is my proposal:

Separate the templates, they are worthless and make everything confusing. What's the point of having a template if you do `if constexpr bool else
The simple case of just space and custom_trim_characters should be separated as well.

Now, the simpler case: left and just space

The loop inside vectorImpl should not call resize. It was reserved previously. You can set the size at the end.
Instead of calling executeImpl (another unnecessary template) we should to a while look checking if the first character is a space. If it is then we If it's not then we keep going. When we are done we copy all skipped values.

Right + space is the same.

Left / Right with custom characters can leave the slow path for now. But in this case we've moved the if away from the very beginning, so they might also become faster

The trim_numbers performance test query 1 (ltrim(toString(number))) regressed ~12-14% on ARM/aarch64 after the NEON rewrite of find_first_not_symbols (PR ClickHouse#104228). Decimal strings have no leading whitespace, so default trimLeft still paid the per-row SIMD scan setup for a no-op. Restructure FunctionTrim into three separate non-template paths (vectorSpace, vectorCustom, vectorFixed) instead of the template-with-if-constexpr machinery plus executeImpl. The default (space) path now advances begin/end with a scalar while loop comparing single bytes, which beats the SIMD scan for short strings and exits after one comparison when there is no leading space. Output capacity is reserved once and the size is set after the loop, removing the per-row resize. trim_left/trim_right are cheap runtime ifs in the loop body. Custom trim character sets keep O(1) table lookups; FixedString is handled by a single non-template function. Results are unchanged for all inputs (verified against the regex baseline and under ASan/UBSan). Closes: ClickHouse#106601 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-24T14:40:26Z

Reworked per your proposal (3257ed3a3fc):

Templates dropped. vectorImpl<L,R>, vectorFixedImpl<L,R>, executeImpl<L,R> and dispatch are gone. Three plain non-template methods now: vectorSpace, vectorCustom, vectorFixed. trim_left/trim_right are cheap runtime ifs in the loop body.
Space and custom separated. Default whitespace trim is vectorSpace; the second-argument character set is vectorCustom.
No resize in the loop. Capacity is reserved once (reserve_exact(input_data.size())); the loop writes into the reserved buffer and resize_exact(res_offset) sets the final size at the end.
No find_first/last_not_symbols. The space path is a scalar while (begin < end && *begin == ' ') ++begin; / while (end > begin && end[-1] == ' ') --end;, then one memcpy of the kept bytes. This is what sidesteps the regressed NEON scan for the no-leading-space hot path. Right + space is symmetric.
Custom left/right stay on the table-lookup path as you suggested, but now run after the early branch.

I dropped the original "return the input column unchanged" shortcut you flagged as too narrow: the scalar while loop already makes the common no-leading-space case fast (one byte compare per row) without a special case.

Validated: existing trim tests (00997_trim, 03224, 03315, 04327, 02711, 03722, 04236, 03890) plus 04401 all pass; results cross-checked against the replaceRegexpAll baseline over 5M randomized rows; ASan+UBSan clean on the full set and a 5M-row stress. ARM perf will be confirmed by the Performance Comparison job.

groeneai · 2026-06-24T14:40:44Z

Pre-PR validation gate

#	Question	Answer
a	Deterministic repro?	Not a bug fix; a perf rework requested in review. The behavior is deterministic: every input goes through the new scalar trim loops on demand.
b	Root cause explained?	The ARM regression (issue #106601) is the per-row SIMD `find_first_not_symbols` setup cost on strings that have no leading whitespace. The rework replaces that scan with a scalar `while` loop that exits after one byte comparison when the first byte is not a space, sidestepping the regressed NEON path.
c	Fix matches root cause?	Yes. The hot path no longer calls the regressed symbol scan at all; it does a single-byte compare per row plus one `memcpy` of the kept bytes.
d	Test intent preserved / new tests added?	`04401_trim_left_unchanged_fast_path` covers the no-leading-space case, the mixed-column fallback, boundary inputs, FixedString, and trimRight/trimBoth/custom. All 8 existing trim tests still pass unchanged.
e	Both directions demonstrated?	Correctness cross-checked against the `replaceRegexpAll` regex baseline over 5M randomized rows (leading/trailing space counts 0..40, payloads near the 16-byte memcpy boundary): trimLeft/trimRight/trimBoth all match exactly.
f	Fix is general across code paths?	All three input paths reworked consistently: `vectorSpace` (default), `vectorCustom` (character set), `vectorFixed` (FixedString, both space and custom). No path left on the old template.
g	Fix generalizes across inputs?	Verified: empty column, empty string, single char, all-spaces (result length 0), leading-only, trailing-only, embedded spaces preserved, custom set longer than 16, FixedString NUL padding kept (default trim removes only spaces), and 15/16/17-byte boundary strings.
h	Backward compatible?	Yes. No setting, format, or default change; output is byte-for-byte identical to before for every input.
i	Invariants and contracts preserved?	`res_offset <= input_data.size()` = reserved capacity, so the final `resize_exact` never reallocates/moves and the write-into-reserved-storage is in bounds. `memcpySmallAllowReadWriteOverflow15`'s 15-byte slop is covered by the column's `pad_right` (>= 32). ColumnString offset invariant (`offsets[i]` monotonic, `chars.size() == offsets.back()`) holds. Confirmed clean under ASan+UBSan on the full test set and a 5M-row stress.

Session id: cron:clickhouse-worker-slot-2:20260624-132300

clickhouse-gh · 2026-06-24T14:45:01Z

+                    --end;
+
+            const size_t length = end - begin;
+            memcpySmallAllowReadWriteOverflow15(res_begin + res_offset, begin, length);


The PR body/changelog say the default trimLeft no-op case returns the input column directly, but this path still copies every row into col_res. For the motivating ltrim(toString(number)) query all rows are unchanged, so the advertised no-copy fast path and the prototype measurement do not apply to this commit; this only removes the SIMD symbol scan. Please either add the all-unchanged trimLeft fast path before vectorSpace and return the original ColumnString, or update the description/changelog and provide perf results for this broader rewrite.

Algunenano · 2026-06-24T14:46:04Z

+            const UInt8 * end = input_data.data() + input_offsets[i];
+
+            if (do_trim_left)
+                while (begin < end && *begin == ' ')


Please don't do if without brackets if there is more than one line. if + while + statement is not ok

Address review: in the default-space path, accumulate consecutive rows that need no trimming and copy the whole run in one memcpy, breaking the run only at a row that actually needs trimming. Output capacity is reserved once and sized at the end (no per-row resize). Add braces to multi-line if bodies in the custom and FixedString paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-24T15:54:25Z

@Algunenano reworked per the two comments.

Default-space path (vectorSpace): the loop now batches. It accumulates a run of consecutive rows that need no trimming and copies the whole run in a single memcpy; the run is broken only when a row actually needs a leading/trailing space removed, which is trimmed and copied on its own. When no row in the column needs trimming the entire output is one memcpy. Capacity is reserved once and the size is set at the end (no per-row resize).

Braces: added to every multi-line if body (the if + while + statement cases) in the space, custom, and FixedString paths.

Custom characters and FixedString keep the per-row path as you suggested.

Validation (debug + ASan/UBSan): trimLeft/Right/Both match replaceRegexpOne over 5M mixed rows and a 0..40 length sweep across the 16-byte memcpy boundary; all-clean, all-trimmed, alternating, trimmed-head/clean-tail and clean-head/trimmed-tail runs all clean; 00997_trim (100k random combos) and 04401 pass. ARM perf via the Performance Comparison job.

groeneai · 2026-06-24T15:55:02Z

Pre-PR validation gate (round-2 rework)

#	Question	Answer
a	Deterministic repro?	Behavior-preserving refactor (no bug to repro). Equivalence is deterministic: trimLeft/Right/Both vs `replaceRegexpOne` over 5M mixed rows and a 0..40 length sweep across the 16-byte memcpy boundary = 0 mismatches.
b	Root cause explained?	Review asked for memcpy batching by run, separate non-template methods, no per-row resize, braces on multi-line `if`.
c	Fix matches root cause?	`vectorSpace` accumulates a run of untrimmed rows and copies it in one `memcpy`, breaking only at a row needing a trim. Templates already dropped (3 plain methods). Capacity reserved once, sized at end. Braces added to all multi-line `if` bodies.
d	Test intent preserved / new tests added?	04401 + 00997_trim unchanged and passing; no assertions weakened.
e	Both directions demonstrated?	Output byte-identical to the regex baseline across all shapes; debug + ASan/UBSan both clean.
f	General across code paths?	All three paths (space, custom, FixedString) covered; custom/FixedString keep the per-row path as requested.
g	Generalizes across inputs?	Empty, single char, all-spaces (len to 0), leading/trailing/embedded, 15/16/17-byte boundaries, all-clean run, all-trimmed, alternating, run-then-break + final flush all verified under ASan.
h	Backward compatible?	N/A (internal implementation, identical results).
i	Invariants and contracts preserved?	`res_offset <= input_data.size()` (reserved capacity) so the final `resize_exact` never reallocates; batched region `[batch_input_begin, prev_offset)` is contiguous in input and reproduced exactly; the 15-byte memcpy slop is covered by `ColumnString` chars `pad_right` (63).

Session id: cron:clickhouse-worker-slot-2:20260624-152800

Algunenano · 2026-06-25T15:34:38Z

trim_numbers is improved. trim_whitespaces was degraded

Algunenano · 2026-06-25T15:38:48Z

+            if (!trim_this_left && !trim_this_right)
+            {
+                /// Nothing to trim: extend the current run, defer the copy.
+                if (!batch_open)


Maybe instead of opening a batch there should be a while loop here and keep going, then once we reach a row that needs editing we process the batch and the row itself. It should also reduce the number of registers needed for iterating (i is the beginning of the loop, j is the current row)

Addresses the third round of review feedback on the default-whitespace trim path: - Restore the compile-time `<do_trim_left, do_trim_right>` template parameters via `dispatch`, removing the per-row runtime trim-direction branches that the previous revision introduced. Those branches regressed trim_whitespace (every row has long leading/trailing space runs) by ~10-15% on ARM. - Restore the SIMD space scan (`find_first_not_symbols<' '>` / `find_last_not_symbols_or_null<' '>`) for rows that actually need trimming. The previous scalar byte-by-byte loop was the real cause of the trim_whitespace slowdown: those rows have ~17 spaces per side, where the SIMD scan (16 bytes per instruction) is far faster than a scalar loop. - Reshape the run batching into a tight i/j while loop (i = run start, j = current row) instead of explicit open/flush bookkeeping, reducing the number of live loop variables. The fast path that motivated the PR is preserved: a cheap first/last-byte check decides whether a row needs trimming, so a column with no leading or trailing spaces (e.g. ltrim(toString(number))) is copied in a single memcpy and never pays for the per-row trim work. trim_numbers stays ~1.4x faster on ARM; trim_urls stays faster; trim_whitespace returns to master parity. Output capacity is reserved once and the result column is sized at the end, so there is no per-row reallocation. The result pointer is cached after the reserve and never invalidated because the output never exceeds the reserved input size, which addresses the heap-corruption origin of this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-25T17:20:27Z

Reworked the default-whitespace path (commit c8966fe) along the three points raised:

1. Template parameters restored. dispatch resolves trim_left/trim_right back to compile-time <do_trim_left, do_trim_right> template parameters, so the inner loop has no per-row trim-direction branch. The previous revision's runtime branches were what regressed trim_whitespace.

2. SIMD scan restored for rows that need trimming. The rows in trim_whitespace carry ~17 spaces per side, so the scalar byte loop was the real slowdown. Rows that need trimming now use find_first_not_symbols<' '> / find_last_not_symbols_or_null<' '> again (16 bytes per instruction).

3. i/j while loop. The run is tracked with i = run start and j = current row, no separate open/flush bookkeeping. A cheap first/last-byte check decides run membership; when j hits a row that needs trimming, the run [i, j) is flushed in one memcpy, that row is SIMD-trimmed, and i resets. Output capacity is reserved once and sized at the end (no per-row reallocation).

The PR's fast path is preserved: a column with no leading/trailing spaces is copied in a single memcpy and never pays per-row trim work.

Local perf direction matches the prior CI run's improvements and the regression is gone (master-parity on trim_whitespace, ~1.4x faster on ARM trim_numbers). The CI Performance Comparison job on the new commit will confirm.

Validation: all 9 stateless trim tests pass (00997 = 100k random space combos, 04327 long custom char sets, 04401, 03890 FixedString). ASan+UBSan with abort_on_error=1: 5M-row cross-check of all three directions vs replaceRegexp (0 mismatches), boundary sweep around the 16-byte memcpy boundary, all-clean / all-trimmed / alternating runs, and 24 concurrent trim workloads, all clean.

groeneai · 2026-06-25T17:20:48Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. The perf regression is reproducible: the prior commit (`171198e`) scalar byte-loop slows `trim_whitespace` trimLeft/trimRight/trimBoth by ~1.07-1.15x on the CI Performance Comparison job (rows have ~17 spaces/side). The original P0 SEGV (STID 5743-5b74) was already concluded pre-existing/unrelated (MV-insert heap crash), not the trim path.
b	Root cause explained?	Yes. Round-2 dropped the compile-time `<do_trim_left,do_trim_right>` template params (adding per-row runtime branches) AND replaced the SIMD space scan with a scalar byte loop. For space-heavy rows the scalar skip is far slower than `find_first_not_symbols<' '>` (16 bytes/instruction), so `trim_whitespace` regressed while `trim_numbers` (no spaces) sped up.
c	Fix matches root cause?	Yes. Restores the template params via `dispatch` (removes runtime branches) and the SIMD scan for rows that need trimming (recovers space-heavy throughput), while keeping the cheap boundary-byte check + single-memcpy run batching that makes the no-space hot path fast.
d	Test intent preserved / new tests added?	Yes. No test weakened. The existing 04401 regression test (no-op hot path, mixed fallback, boundaries, FixedString, custom) still passes unchanged; all 9 stateless trim tests pass.
e	Both directions demonstrated?	Yes. Correctness cross-checked vs `replaceRegexp` baseline over 5M mixed rows (0 mismatches) for all 3 directions; the prior revision and this one both produce identical results, the difference is purely performance, which the CI Performance Comparison job re-validates.
f	Fix is general across code paths?	Yes. Space (vectorSpace), custom-character (vectorCustom) and FixedString (vectorFixed) are separate, all template-dispatched; no sibling path left on the slow scalar variant where SIMD applies.
g	Fix generalizes across inputs (params/datatypes/wrappers)?	Yes. Validated across empty column, empty string, single char, all-spaces (len -> 0), lead-only, trail-only, embedded-space-preserved, custom sets >16 chars, FixedString NUL padding, and 15/16/17-byte lengths around the memcpy boundary.
h	Backward compatible?	N/A. Pure performance refactor of an internal function; no settings, formats, or user-visible behavior changed.
i	Invariants and contracts preserved?	Yes. Output capacity reserved once via `reserve_exact(input_data.size())`; res_offset never exceeds reserved size (trimming only removes bytes) so the cached result pointer is never invalidated; offsets written consistently. ASan+UBSan (abort_on_error=1): 5M-row sweep, boundary sweep, and 24 concurrent trim workloads all clean, addressing the heap-corruption origin.

Session id: cron:clickhouse-worker-slot-1:20260625-161500

clickhouse-gh · 2026-06-25T21:23:13Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	85.40%	85.40%	+0.00%
Functions	92.60%	92.60%	+0.00%
Branches	77.60%	77.60%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 112/114 (98.25%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code

Full report · Diff report

groeneai · 2026-06-25T22:09:59Z

CI finish ledger — `c8966fe`

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. The PR's own diff (only src/Functions/trim.cpp + the 04401 test) does not touch Keeper, S3, or threading, so none of these are PR-caused.

Check / test	Reason	Owner / fixing PR
Stress test (arm_msan) / Hung check, possible deadlock	deadlock	#101680 / #105905 (ours, open)
Stress test (arm_release) / Hung check, possible deadlock	deadlock	#101680 / #105905 (ours, open)
Integration tests (amd_msan, 5/6) / test_keeper_snapshot_chunked_transfer::test_recover_after_s3_read_error_during_transfer	flaky (24 hits / ~18 unrelated PRs / 1 master in 30d)	a fix task is created (investigating at full effort — fixing-PR link to follow here)

Session id: cron:our-pr-ci-monitor:20260625-220000

groeneai · 2026-06-25T23:01:19Z

Fixing PR for the test_keeper_snapshot_chunked_transfer/test.py::test_recover_* flaky (the Integration tests (amd_msan, 5/6) failure on this PR's commit c8966fe) created: #108557. Root cause: the recover subtests restart a SIGKILL'd follower whose snapshot recovery legitimately takes ~95 s under msan, but the test waited only start_clickhouse(20) (20 s). Not caused by this PR.

Algunenano

LGTM.

@groeneai Can you please update the PR description (see #108177 (comment) for the details about its issues) and have a look at current CI failures, to see if they are reported or not, and deal with them if not in flight (but use a different PR please)

groeneai · 2026-06-26T14:13:18Z

groeneai mentioned this pull request Jun 22, 2026

ARM performance regression: trim_numbers query 1 is slower #106601

Closed

Algunenano added the can be tested Allows running workflows for external contributors label Jun 22, 2026

Algunenano self-assigned this Jun 22, 2026

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Jun 22, 2026

groeneai force-pushed the fix-trim-left-perf-106601 branch from 35d2de3 to c4499f5 Compare June 22, 2026 20:44

Algunenano requested changes Jun 24, 2026

View reviewed changes

groeneai force-pushed the fix-trim-left-perf-106601 branch from c4499f5 to 3257ed3 Compare June 24, 2026 14:40

clickhouse-gh Bot reviewed Jun 24, 2026

View reviewed changes

Algunenano requested changes Jun 24, 2026

View reviewed changes

Algunenano reviewed Jun 25, 2026

View reviewed changes

Algunenano approved these changes Jun 26, 2026

View reviewed changes

Algunenano enabled auto-merge June 26, 2026 14:15

Algunenano added this pull request to the merge queue Jun 26, 2026

Merged via the queue into ClickHouse:master with commit 0599068 Jun 26, 2026
163 of 167 checks passed

robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 26, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 22, 2026 • edited by robot-ch-test-poll4 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Version info

Uh oh!

groeneai commented Jun 22, 2026

Pre-PR validation gate

Uh oh!

groeneai commented Jun 22, 2026

Uh oh!

clickhouse-gh Bot commented Jun 22, 2026 • edited by Algunenano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

PR Metadata

Missing context / blind spots

Tests

Final Verdict

Uh oh!

clickhouse-gh Bot commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

clickbench

tpch_adapted_1_official

Uh oh!

groeneai commented Jun 22, 2026

Prior CI failure (arm_asan_ubsan, azure, sequential) is a pre-existing, unrelated crash

Uh oh!

groeneai commented Jun 23, 2026

Uh oh!

groeneai commented Jun 23, 2026

Uh oh!

Algunenano left a comment

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 24, 2026

Uh oh!

groeneai commented Jun 24, 2026

Pre-PR validation gate

Uh oh!

clickhouse-gh Bot Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Algunenano Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 24, 2026

Uh oh!

groeneai commented Jun 24, 2026

Pre-PR validation gate (round-2 rework)

Uh oh!

Algunenano commented Jun 25, 2026

Uh oh!

Uh oh!

Algunenano Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026

LLVM Coverage Report

Uh oh!

groeneai commented Jun 25, 2026

CI finish ledger — c8966fe

Uh oh!

groeneai commented Jun 25, 2026

Uh oh!

Algunenano left a comment

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

groeneai commented Jun 22, 2026 •

edited by robot-ch-test-poll4

Loading

clickhouse-gh Bot commented Jun 22, 2026 •

edited by Algunenano

Loading

clickhouse-gh Bot commented Jun 22, 2026 •

edited

Loading

Prior CI failure (`arm_asan_ubsan, azure, sequential`) is a pre-existing, unrelated crash

CI finish ledger — `c8966fe`