Speed up default trimLeft when no row needs trimming#108177
Conversation
|
cc @rschu1ze — could you review this? It adds a fast path to default |
|
Workflow [PR], commit [c8966fe] Summary: ❌
AI ReviewSummaryThis PR rewrites the default-space PR Metadata
Missing context / blind spots
Tests
Final VerdictStatus: Minimum required actions: update the PR body/changelog to match the final implementation and wait for current CI/performance evidence on |
|
📊 Cloud Performance Report ✅ AI verdict: no significant changes detected. K_source=6 K_base=30 flagged=0/65 clickbench🟢 No significant changes tpch_adapted_1_official🟢 No significant changes Debug info
|
35d2de3 to
c4499f5
Compare
Prior CI failure (
|
|
CI finished on c4499f5 (rebased base). The only red is
The earlier one-off SEGV (STID 5743-5b74) was on the pre-rebase commit 35d2de3; it did not recur on this base, and local ASan+UBSan reproduction of the trim fast path (empty/single/all-whitespace/no-trim/mixed + 10M-row scan + concurrent + MV-insert workloads) was fully clean. |
|
CI finished on HEAD c4499f5. All checks green except
This PR only adds a fast path to default |
Algunenano
left a comment
There was a problem hiding this comment.
I find this extremely narrow and covering a single case.
Here is my proposal:
- Separate the templates, they are worthless and make everything confusing. What's the point of having a template if you do `if constexpr bool else
- The simple case of just space and custom_trim_characters should be separated as well.
Now, the simpler case: left and just space
- The loop inside
vectorImplshould not call resize. It was reserved previously. You can set the size at the end. - Instead of calling executeImpl (another unnecessary template) we should to a
whilelook checking if the first character is a space. If it is then we If it's not then we keep going. When we are done we copy all skipped values.
Right + space is the same.
Left / Right with custom characters can leave the slow path for now. But in this case we've moved the if away from the very beginning, so they might also become faster
The trim_numbers performance test query 1 (ltrim(toString(number))) regressed ~12-14% on ARM/aarch64 after the NEON rewrite of find_first_not_symbols (PR ClickHouse#104228). Decimal strings have no leading whitespace, so default trimLeft still paid the per-row SIMD scan setup for a no-op. Restructure FunctionTrim into three separate non-template paths (vectorSpace, vectorCustom, vectorFixed) instead of the template-with-if-constexpr machinery plus executeImpl. The default (space) path now advances begin/end with a scalar while loop comparing single bytes, which beats the SIMD scan for short strings and exits after one comparison when there is no leading space. Output capacity is reserved once and the size is set after the loop, removing the per-row resize. trim_left/trim_right are cheap runtime ifs in the loop body. Custom trim character sets keep O(1) table lookups; FixedString is handled by a single non-template function. Results are unchanged for all inputs (verified against the regex baseline and under ASan/UBSan). Closes: ClickHouse#106601 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
c4499f5 to
3257ed3
Compare
|
Reworked per your proposal (
I dropped the original "return the input column unchanged" shortcut you flagged as too narrow: the scalar Validated: existing trim tests ( |
Pre-PR validation gate
Session id: cron:clickhouse-worker-slot-2:20260624-132300 |
| --end; | ||
|
|
||
| const size_t length = end - begin; | ||
| memcpySmallAllowReadWriteOverflow15(res_begin + res_offset, begin, length); |
There was a problem hiding this comment.
The PR body/changelog say the default trimLeft no-op case returns the input column directly, but this path still copies every row into col_res. For the motivating ltrim(toString(number)) query all rows are unchanged, so the advertised no-copy fast path and the prototype measurement do not apply to this commit; this only removes the SIMD symbol scan. Please either add the all-unchanged trimLeft fast path before vectorSpace and return the original ColumnString, or update the description/changelog and provide perf results for this broader rewrite.
| const UInt8 * end = input_data.data() + input_offsets[i]; | ||
|
|
||
| if (do_trim_left) | ||
| while (begin < end && *begin == ' ') |
There was a problem hiding this comment.
Please don't do if without brackets if there is more than one line. if + while + statement is not ok
Address review: in the default-space path, accumulate consecutive rows that need no trimming and copy the whole run in one memcpy, breaking the run only at a row that actually needs trimming. Output capacity is reserved once and sized at the end (no per-row resize). Add braces to multi-line if bodies in the custom and FixedString paths. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@Algunenano reworked per the two comments. Default-space path ( Braces: added to every multi-line Custom characters and FixedString keep the per-row path as you suggested. Validation (debug + ASan/UBSan): trimLeft/Right/Both match |
Pre-PR validation gate (round-2 rework)
Session id: cron:clickhouse-worker-slot-2:20260624-152800 |
| if (!trim_this_left && !trim_this_right) | ||
| { | ||
| /// Nothing to trim: extend the current run, defer the copy. | ||
| if (!batch_open) |
There was a problem hiding this comment.
Maybe instead of opening a batch there should be a while loop here and keep going, then once we reach a row that needs editing we process the batch and the row itself. It should also reduce the number of registers needed for iterating (i is the beginning of the loop, j is the current row)
Addresses the third round of review feedback on the default-whitespace trim path: - Restore the compile-time `<do_trim_left, do_trim_right>` template parameters via `dispatch`, removing the per-row runtime trim-direction branches that the previous revision introduced. Those branches regressed trim_whitespace (every row has long leading/trailing space runs) by ~10-15% on ARM. - Restore the SIMD space scan (`find_first_not_symbols<' '>` / `find_last_not_symbols_or_null<' '>`) for rows that actually need trimming. The previous scalar byte-by-byte loop was the real cause of the trim_whitespace slowdown: those rows have ~17 spaces per side, where the SIMD scan (16 bytes per instruction) is far faster than a scalar loop. - Reshape the run batching into a tight i/j while loop (i = run start, j = current row) instead of explicit open/flush bookkeeping, reducing the number of live loop variables. The fast path that motivated the PR is preserved: a cheap first/last-byte check decides whether a row needs trimming, so a column with no leading or trailing spaces (e.g. ltrim(toString(number))) is copied in a single memcpy and never pays for the per-row trim work. trim_numbers stays ~1.4x faster on ARM; trim_urls stays faster; trim_whitespace returns to master parity. Output capacity is reserved once and the result column is sized at the end, so there is no per-row reallocation. The result pointer is cached after the reserve and never invalidated because the output never exceeds the reserved input size, which addresses the heap-corruption origin of this PR. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Reworked the default-whitespace path (commit c8966fe) along the three points raised: 1. Template parameters restored. 2. SIMD scan restored for rows that need trimming. The rows in 3. i/j while loop. The run is tracked with The PR's fast path is preserved: a column with no leading/trailing spaces is copied in a single Local perf direction matches the prior CI run's improvements and the regression is gone (master-parity on Validation: all 9 stateless trim tests pass (00997 = 100k random space combos, 04327 long custom char sets, 04401, 03890 FixedString). ASan+UBSan with |
Pre-PR validation gate (click to expand)
Session id: cron:clickhouse-worker-slot-1:20260625-161500 |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 112/114 (98.25%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code |
CI finish ledger — c8966feEvery failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. The PR's own diff (only
Session id: cron:our-pr-ci-monitor:20260625-220000 |
|
Fixing PR for the |
Algunenano
left a comment
There was a problem hiding this comment.
LGTM.
@groeneai Can you please update the PR description (see #108177 (comment) for the details about its issues) and have a look at current CI failures, to see if they are reported or not, and deal with them if not in flight (but use a different PR please)
0599068






Closes: #106601
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Speed up the default-whitespace
trimLeft/trimRight/trimBothfunctions (and their aliasesltrim/rtrim/trim): consecutive rows that need no trimming are copied in a single batch, and the per-row space scan runs only for rows that actually have a leading or trailing space.Description
Motivation (#106601):
trim_numbers(SELECT count() FROM numbers(10000000) WHERE NOT ignore(ltrim(toString(number)))) regressed about 12-14% on ARM/aarch64 after #104228 (the NEON rewrite offind_first_not_symbols). The default trim path callsfind_first_not_symbols<' '>once per row, so the regressed primitive sits on the hot path even when the strings (decimal numbers here) have no leading whitespace and nothing is trimmed.Change (default-whitespace path,
vectorSpace): rows that need no trimming are batched. A run of consecutive untrimmed rows is copied with a singlememcpy; the run breaks at the first row that has a leading or trailing space, which is trimmed with the existing SIMD space scan, then a new run starts. So the per-row scan is paid only for rows that actually need trimming, and a column with no trimmable rows is copied in one shot. Output capacity is reserved once and the final size is set at the end, with no per-row reallocation. The custom-trim-character andFixedStringpaths are kept as separate methods; their behaviour is unchanged.This does not return the original column; it always builds and returns the result column. An earlier revision returned the input column directly in the no-op case, but that was narrower and was replaced with the batching approach above on reviewer request.
Performance (this commit): both AMD and ARM Performance Comparison shards pass. On ARM,
trim_numbersis about 1.4x faster (ltrim/trim/rtrim) andtrim_whitespaceis back to master parity (the earlier scalar-only revision had regressed it).Correctness: a column with no trimmable rows returns byte-identical output, and mixed columns trim correctly. Existing trim tests plus the new
04401_trim_left_unchanged_fast_pathpass; validated under ASan+UBSan, including boundary sweeps around the 16-byte memcpy boundary, all-spaces, empty strings, andFixedString.Version info
26.7.1.107