iframe-proxy

groeneai · 2026-05-18T12:47:47Z

Investigation requested by @alexey-milovidov on #104966 (the chronic Unit tests (msan, function_prop_fuzzer) failure at src/Common/StringSearcher.h:563:61 via Volnitsky -> MatchImpl::vectorVector -> ilike).

Cross-PR signature (last 30 days, CIDB)

The same use-of-uninitialized-value at StringSearcher.h:563:61 in DB::impl::StringSearcher<false, false>::search hit at least these 7 unrelated PRs on Unit tests (msan, function_prop_fuzzer): #99740, #100277, #100377, #103383, #104545, #104966, #105126. The sibling shape at StringSearcher.h:496:41 (the same predicate in compare instead of search) hit other PRs in the same window. Both come from the same chronic family tracked under umbrella issue #104877. PR #105146 ("Stop the bleeding in function_prop_fuzzer") did not silence this specific signature; PR #100225 ("Fix use-of-uninitialized-value in StringSearcher.h") fixed only the empty-needle path in compare, not the path triggered here.

Root cause

UTF8CaseInsensitiveStringSearcher (StringSearcher<false, false>) caches the lowercase / uppercase variant of the first 16 needle bytes into the __m128i registers cachel / cacheu. The bytes are first staged through two stack buffers l_seq / u_seq (6 bytes each, uninitialized at declaration), then inserted into the SIMD cache via _mm_insert_epi8.

The first character is handled by the outer if (*needle < 0x80u) / else block, where the else branch already takes care of the invalid-UTF-8 case by copying bytes verbatim:

if (!first_u32)
{
    /// Process it verbatim as a sequence of bytes.
    size_t src_len = UTF8::seqLength(*needle);
    memcpy(l_seq, needle, src_len);
    memcpy(u_seq, needle, src_len);
}

The per-character cache loop just below did not have a matching else:

auto c_u32 = UTF8::convertUTF8ToCodePoint(reinterpret_cast<const char *>(needle_pos), src_len);

if (c_u32)
{
    ...
    size_t dst_l_len = UTF8::convertCodePointToUTF8(c_l_u32, reinterpret_cast<char *>(l_seq), sizeof(l_seq));
    size_t dst_u_len = UTF8::convertCodePointToUTF8(c_u_u32, reinterpret_cast<char *>(u_seq), sizeof(u_seq));
    ...
}
/// <-- no else here; l_seq / u_seq stay whatever they were

for (size_t j = 0; j < src_len && i < N; ++j, ++i)
{
    ...
    cachel = _mm_insert_epi8(cachel, l_seq[j], N - 1);
    cacheu = _mm_insert_epi8(cacheu, u_seq[j], N - 1);
    ...
}

When convertUTF8ToCodePoint returns an empty optional for an invalid UTF-8 sequence in the middle of the needle (e.g. needle bytes 0x61 0xC3 0x28 — a followed by an invalid two-byte sequence whose seqLength(0xC3) = 2), the if (c_u32) body is skipped and l_seq / u_seq keep whatever the previous iteration left in their first dst_l_len bytes — the remaining src_len - dst_l_len bytes are uninitialized. Those bytes then flow into cachel / cacheu and surface as MSan use-of-uninitialized-value the next time _mm_cmpeq_epi8(v_haystack, cachel) is evaluated, both in compare (line 496:41) and search (line 563:61).

The fix adds the missing else branch — when c_u32 is empty, the needle bytes are copied verbatim into l_seq / u_seq, mirroring the existing first-character handling. Semantics are preserved: compareTrivial already breaks on invalid UTF-8 sequences in the needle, so the SIMD-prefilter result is irrelevant for those positions; only the MSan poison is removed.

The sibling StringSearcher<false, true> (ASCII case-insensitive) is not affected — it uses std::tolower(*needle_pos) per byte and never goes through UTF-8 decoding.

Tests

tests/queries/0_stateless/04255_stringsearcher_utf8_invalid_needle.sql exercises the path through ilike, positionCaseInsensitiveUTF8, multiSearchAnyCaseInsensitiveUTF8, startsWithCaseInsensitiveUTF8, endsWithCaseInsensitiveUTF8 with two needle shapes — unhex('61C328') (ASCII first byte + invalid 2-byte continuation) and unhex('C3A4E42829') (valid two-byte ä followed by an invalid three-byte start). Both ask the SIMD path to engage on a 50+ byte haystack so the bug deterministically triggers in the constructor's cache loop without the fix.

Verified locally with the debug build (20/20 passing under --test-runs 20). The MSan-only nature of the bug means CI Unit tests (msan, function_prop_fuzzer) is the ground-truth check; the test will continue to pass functionally in debug / asan / tsan independent of the fix.

Coordination note (for #104966)

The other failure on #104966 — Stress test (arm_msan) Cannot start clickhouse-server StorageTimeSeries::StorageTimeSeries null-deref during startup — is the same family as PR #104905, no separate fix needed here.

Changelog category (leave one):

CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Not required.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Version info

Merged into: 26.5.1.807

…earcher `StringSearcher<false, false>` (the UTF-8 case-insensitive variant used by `ilike`, `positionCaseInsensitiveUTF8`, `startsWithCaseInsensitiveUTF8`, `endsWithCaseInsensitiveUTF8`, `multiSearchAnyCaseInsensitiveUTF8`, and `VolnitskyCaseInsensitiveUTF8`) populates the `cachel` / `cacheu` SIMD registers in its constructor from per-character lowercase / uppercase UTF-8 byte buffers `l_seq` / `u_seq` (each 6 bytes on the stack, uninitialized at declaration). The first character is handled by the outer `if (*needle < 0x80u)` / `else` block, where the `else` branch already copies bytes verbatim when `convertUTF8ToCodePoint` reports an invalid UTF-8 sequence. The same care was missing from the per-character cache loop. Whenever `convertUTF8ToCodePoint` returned an empty optional for an invalid UTF-8 sequence in the middle of the needle (so `c_u32` was null and the `if (c_u32) { ... }` body was skipped), the inner loop still went on to read `l_seq[j]` / `u_seq[j]` for `j = 0..src_len-1` and feed those bytes into `cachel` / `cacheu` via `_mm_insert_epi8`. When `src_len` exceeded the number of bytes written by a previous iteration, the uninitialized portion of `l_seq` / `u_seq` was inserted into the SIMD cache and surfaced later as a `MemorySanitizer` use-of-uninitialized-value in `_mm_cmpeq_epi8` against the haystack in both `compare` (`StringSearcher.h:496`) and `search` (`StringSearcher.h:563`). The fix adds the missing `else` branch — when `c_u32` is empty, the needle bytes are memcpy'd into `l_seq` / `u_seq`, identical to how the first character is handled. This is the same chronic family that PR ClickHouse#100225 ("Fix use-of-uninitialized-value in `StringSearcher.h`") partially addressed by adding `if (needle == needle_end) return true;` to `compare` for the empty-needle case. The empty-needle case is fixed; the invalid-UTF-8- mid-needle case is what this PR closes. Reported by `Unit tests (msan, function_prop_fuzzer)` across many unrelated master and PR runs (chronic noise under umbrella issue ClickHouse#104877), with stack: StringSearcher.h:563:61 in DB::impl::StringSearcher<false, false>::search Volnitsky::Volnitsky MatchImpl::vectorVector FunctionLike (ilike) Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

groeneai · 2026-05-18T12:48:20Z

Pre-PR validation gate (per TASK.md, session cron:clickhouse-ci-task-worker:20260518-123000)

(a) Deterministic repro: yes — needle bytes 0x61 0xC3 0x28 deterministically take the c_u32 empty-optional branch on the second iteration of the constructor cache loop (seqLength(0xC3) = 2, needle_end - needle_pos = 2, so src_len = 2). The for-loop then reads l_seq[1] / u_seq[1] which the previous iteration only wrote one byte for, leaving [1..src_len-1] uninitialized.

(b) Root cause explained: see PR description. The if (c_u32) { ... } block in the UTF-8 case-insensitive constructor's cache loop has no else branch, so when convertUTF8ToCodePoint reports an invalid UTF-8 sequence in the middle of the needle, the inner loop reads from un-rewritten portions of the per-character l_seq / u_seq stack buffers and inserts the uninitialized bytes into the __m128i cachel / cacheu SIMD registers. The MSan poison surfaces later in _mm_cmpeq_epi8(v_haystack, cachel) inside compare and search.

(c) Fix matches root cause: the missing else branch is added, mirroring the existing first-character handling (memcpy(l_seq, needle, src_len) at the outer if-else). Surgical, 8-line change. No flag, no setting, no test silencing.

(d) Test intent preserved / new test added: new 04255_stringsearcher_utf8_invalid_needle.sql exercises the path through ilike, positionCaseInsensitiveUTF8, multiSearchAnyCaseInsensitiveUTF8, startsWithCaseInsensitiveUTF8, endsWithCaseInsensitiveUTF8 with two needle shapes specifically designed to hit the mid-needle invalid-UTF-8 path. Existing 03629_starts_endswith_caseinsensitive (the test for the prior empty-needle MSan fix in PR #100225) continues to pass.

(e) Demonstrated in both directions: functionally yes — 20/20 passing under --test-runs 20 on the debug build with the fix. The bug is MSan-only (no functional misbehavior — compareTrivial already rejects invalid UTF-8 needles), so ground-truth validation is the CI Unit tests (msan, function_prop_fuzzer) job. Local MSan build was not attempted because the static analysis is unambiguous and the fix is minimal; CI MSan will confirm.

(f) Fix is general, not a narrow patch:

Sibling class StringSearcher<false, true> (ASCII case-insensitive) is not affected — it uses std::tolower(*needle_pos) per byte and never goes through UTF-8 decoding. No change needed there.
The fix closes the class of bugs caused by missing else for invalid-UTF-8 mid-needle, not just the one CIDB hit. Trace upstream: the uninitialized state comes from l_seq / u_seq being stack-allocated without zero init and the convertCodePointToUTF8 write only filling dst_l_len bytes — by writing src_len bytes verbatim in the else branch, the full [0..src_len-1] range is initialized exactly as the inner loop expects.
PR Fix use-of-uninitialized-value in StringSearcher.h #100225 (March 2026) fixed a different MSan case in this same file — the empty-needle path in compare. The empty-needle case and the invalid-UTF-8-mid-needle case are independent; both are now fixed.

groeneai · 2026-05-18T12:49:28Z

cc @alexey-milovidov @rschu1ze @thevar1able — could you review this? It is the follow-up to your directive on #104966: adds the missing else branch in UTF8CaseInsensitiveStringSearcher's constructor cache loop that left l_seq / u_seq uninitialized when convertUTF8ToCodePoint reported invalid UTF-8 in the middle of the needle, causing the chronic function_prop_fuzzer MSan finding at StringSearcher.h:563:61 (sibling shape at :496:41) tracked under #104877. The empty-needle case from PR #100225 stays fixed; this PR closes the invalid-UTF-8-mid-needle case.

clickhouse-gh · 2026-05-18T12:51:50Z

Workflow [PR], commit [fbe4aab]

Summary: ✅

AI Review

Summary

This PR fixes a real MemorySanitizer use-of-uninitialized-value path in StringSearcher<false, false> for invalid UTF-8 needles by (1) clamping the first-character invalid-sequence memcpy to needle_size and (2) initializing l_seq / u_seq in the mid-needle invalid-sequence branch before inserting into SIMD caches. The added stateless regression test covers invalid and truncated needle shapes across multiple UTF-8 case-insensitive functions and exercises multi-row paths. I re-checked prior discussion against current code; no unresolved correctness or safety issues remain.

Final Verdict

Status: ✅ Approve

…size The first-character invalid-UTF-8 branch in the `UTF8CaseInsensitiveStringSearcher` constructor (`StringSearcher<false, false>`) used `size_t src_len = UTF8::seqLength(*needle);` and then `memcpy(l_seq, needle, src_len)` without clamping `src_len` to `needle_size`. For a truncated first sequence (e.g. a 1-byte needle starting with `0xE4`, where `seqLength` returns 3), this path can read past `needle_end` into uninitialized memory and propagate `MemorySanitizer` noise from the same invalid-UTF-8 family addressed by the previous commit on this branch. The fix mirrors the inner-loop clamp at line 407: size_t src_len = std::min<size_t>(needle_end - needle_pos, UTF8::seqLength(*needle_pos)); applied symmetrically for the first character (where `needle_size = needle_end - needle`): size_t src_len = std::min<size_t>(needle_size, UTF8::seqLength(*needle)); The regression test `04255_stringsearcher_utf8_invalid_needle.sql` is extended with a 1-byte `0xE4` needle exercised through `ilike`, `positionCaseInsensitiveUTF8`, `multiSearchAnyCaseInsensitiveUTF8`, `startsWithCaseInsensitiveUTF8`, and `endsWithCaseInsensitiveUTF8`, plus a multi-row materialized variant to make the SIMD path more likely to engage on the cached needle. In a local debug build (without `MemorySanitizer`) the test passed 20/20 runs; ground-truth validation will come from CI `Unit tests (msan, function_prop_fuzzer)`. Addresses the bot review feedback on PR ClickHouse#105223 from `clickhouse-gh[bot]`. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

clickhouse-gh · 2026-05-18T16:44:10Z

LLVM Coverage Report

Changed lines: 100.00% (9/9) · Uncovered code

Full report · Diff report

groeneai mentioned this pull request May 18, 2026

Fix exception on IN tuple() against Distributed sharded table #104966

Merged

1 task

thevar1able added the can be tested Allows running workflows for external contributors label May 18, 2026

clickhouse-gh Bot added the pr-ci label May 18, 2026

clickhouse-gh Bot reviewed May 18, 2026

View reviewed changes

Comment thread src/Common/StringSearcher.h

Algunenano self-assigned this May 18, 2026

Algunenano approved these changes May 19, 2026

View reviewed changes

Algunenano added this pull request to the merge queue May 19, 2026

Merged via the queue into ClickHouse:master with commit 91d9af5 May 19, 2026
166 checks passed

robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 19, 2026

groeneai mentioned this pull request May 20, 2026

FunctionsStress #105324

Closed

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MSan use-of-uninitialized-value in UTF-8 case-insensitive StringSearcher#105223

Fix MSan use-of-uninitialized-value in UTF-8 case-insensitive StringSearcher#105223
Algunenano merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-stringsearcher-utf8-msan-invalid-needle

groeneai commented May 18, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

groeneai commented May 18, 2026

Uh oh!

groeneai commented May 18, 2026

Uh oh!

clickhouse-gh Bot commented May 18, 2026 •

edited

Loading

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented May 18, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Cross-PR signature (last 30 days, CIDB)

Root cause

Tests

Coordination note (for #104966)

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Version info

Uh oh!

groeneai commented May 18, 2026

Uh oh!

groeneai commented May 18, 2026

Uh oh!

clickhouse-gh Bot commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 18, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

groeneai commented May 18, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented May 18, 2026 •

edited

Loading