Fix MSan use-of-uninitialized-value in UTF-8 case-insensitive StringSearcher#105223
Conversation
…earcher
`StringSearcher<false, false>` (the UTF-8 case-insensitive variant used by
`ilike`, `positionCaseInsensitiveUTF8`, `startsWithCaseInsensitiveUTF8`,
`endsWithCaseInsensitiveUTF8`, `multiSearchAnyCaseInsensitiveUTF8`, and
`VolnitskyCaseInsensitiveUTF8`) populates the `cachel` / `cacheu` SIMD
registers in its constructor from per-character lowercase / uppercase
UTF-8 byte buffers `l_seq` / `u_seq` (each 6 bytes on the stack,
uninitialized at declaration).
The first character is handled by the outer `if (*needle < 0x80u)` /
`else` block, where the `else` branch already copies bytes verbatim
when `convertUTF8ToCodePoint` reports an invalid UTF-8 sequence.
The same care was missing from the per-character cache loop. Whenever
`convertUTF8ToCodePoint` returned an empty optional for an invalid
UTF-8 sequence in the middle of the needle (so `c_u32` was null and the
`if (c_u32) { ... }` body was skipped), the inner loop still went on to
read `l_seq[j]` / `u_seq[j]` for `j = 0..src_len-1` and feed those
bytes into `cachel` / `cacheu` via `_mm_insert_epi8`. When `src_len`
exceeded the number of bytes written by a previous iteration, the
uninitialized portion of `l_seq` / `u_seq` was inserted into the SIMD
cache and surfaced later as a `MemorySanitizer` use-of-uninitialized-value
in `_mm_cmpeq_epi8` against the haystack in both `compare`
(`StringSearcher.h:496`) and `search` (`StringSearcher.h:563`).
The fix adds the missing `else` branch — when `c_u32` is empty, the
needle bytes are memcpy'd into `l_seq` / `u_seq`, identical to how the
first character is handled.
This is the same chronic family that PR ClickHouse#100225 ("Fix
use-of-uninitialized-value in `StringSearcher.h`") partially addressed
by adding `if (needle == needle_end) return true;` to `compare` for the
empty-needle case. The empty-needle case is fixed; the invalid-UTF-8-
mid-needle case is what this PR closes.
Reported by `Unit tests (msan, function_prop_fuzzer)` across many
unrelated master and PR runs (chronic noise under umbrella issue
ClickHouse#104877), with stack:
StringSearcher.h:563:61 in DB::impl::StringSearcher<false, false>::search
Volnitsky::Volnitsky
MatchImpl::vectorVector
FunctionLike (ilike)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
|
cc @alexey-milovidov @rschu1ze @thevar1able — could you review this? It is the follow-up to your directive on #104966: adds the missing |
|
Workflow [PR], commit [fbe4aab] Summary: ✅ AI ReviewSummaryThis PR fixes a real Final Verdict
|
…size
The first-character invalid-UTF-8 branch in the `UTF8CaseInsensitiveStringSearcher`
constructor (`StringSearcher<false, false>`) used
`size_t src_len = UTF8::seqLength(*needle);` and then `memcpy(l_seq, needle, src_len)`
without clamping `src_len` to `needle_size`.
For a truncated first sequence (e.g. a 1-byte needle starting with `0xE4`,
where `seqLength` returns 3), this path can read past `needle_end` into
uninitialized memory and propagate `MemorySanitizer` noise from the same
invalid-UTF-8 family addressed by the previous commit on this branch.
The fix mirrors the inner-loop clamp at line 407:
size_t src_len = std::min<size_t>(needle_end - needle_pos, UTF8::seqLength(*needle_pos));
applied symmetrically for the first character (where `needle_size = needle_end - needle`):
size_t src_len = std::min<size_t>(needle_size, UTF8::seqLength(*needle));
The regression test `04255_stringsearcher_utf8_invalid_needle.sql` is extended
with a 1-byte `0xE4` needle exercised through `ilike`, `positionCaseInsensitiveUTF8`,
`multiSearchAnyCaseInsensitiveUTF8`, `startsWithCaseInsensitiveUTF8`, and
`endsWithCaseInsensitiveUTF8`, plus a multi-row materialized variant to make the
SIMD path more likely to engage on the cached needle.
In a local debug build (without `MemorySanitizer`) the test passed 20/20 runs;
ground-truth validation will come from CI `Unit tests (msan, function_prop_fuzzer)`.
Addresses the bot review feedback on PR ClickHouse#105223 from `clickhouse-gh[bot]`.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
LLVM Coverage ReportChanged lines: 100.00% (9/9) · Uncovered code |

Investigation requested by @alexey-milovidov on #104966 (the chronic
Unit tests (msan, function_prop_fuzzer)failure atsrc/Common/StringSearcher.h:563:61viaVolnitsky->MatchImpl::vectorVector->ilike).Cross-PR signature (last 30 days, CIDB)
The same
use-of-uninitialized-valueatStringSearcher.h:563:61inDB::impl::StringSearcher<false, false>::searchhit at least these 7 unrelated PRs onUnit tests (msan, function_prop_fuzzer): #99740, #100277, #100377, #103383, #104545, #104966, #105126. The sibling shape atStringSearcher.h:496:41(the same predicate incompareinstead ofsearch) hit other PRs in the same window. Both come from the same chronic family tracked under umbrella issue #104877. PR #105146 ("Stop the bleeding infunction_prop_fuzzer") did not silence this specific signature; PR #100225 ("Fix use-of-uninitialized-value inStringSearcher.h") fixed only the empty-needle path incompare, not the path triggered here.Root cause
UTF8CaseInsensitiveStringSearcher(StringSearcher<false, false>) caches the lowercase / uppercase variant of the first 16 needle bytes into the__m128iregisterscachel/cacheu. The bytes are first staged through two stack buffersl_seq/u_seq(6 bytes each, uninitialized at declaration), then inserted into the SIMD cache via_mm_insert_epi8.The first character is handled by the outer
if (*needle < 0x80u)/elseblock, where theelsebranch already takes care of the invalid-UTF-8 case by copying bytes verbatim:The per-character cache loop just below did not have a matching
else:When
convertUTF8ToCodePointreturns an empty optional for an invalid UTF-8 sequence in the middle of the needle (e.g. needle bytes0x61 0xC3 0x28—afollowed by an invalid two-byte sequence whoseseqLength(0xC3) = 2), theif (c_u32)body is skipped andl_seq/u_seqkeep whatever the previous iteration left in their firstdst_l_lenbytes — the remainingsrc_len - dst_l_lenbytes are uninitialized. Those bytes then flow intocachel/cacheuand surface as MSan use-of-uninitialized-value the next time_mm_cmpeq_epi8(v_haystack, cachel)is evaluated, both incompare(line496:41) andsearch(line563:61).The fix adds the missing
elsebranch — whenc_u32is empty, the needle bytes are copied verbatim intol_seq/u_seq, mirroring the existing first-character handling. Semantics are preserved:compareTrivialalready breaks on invalid UTF-8 sequences in the needle, so the SIMD-prefilter result is irrelevant for those positions; only the MSan poison is removed.The sibling
StringSearcher<false, true>(ASCII case-insensitive) is not affected — it usesstd::tolower(*needle_pos)per byte and never goes through UTF-8 decoding.Tests
tests/queries/0_stateless/04255_stringsearcher_utf8_invalid_needle.sqlexercises the path throughilike,positionCaseInsensitiveUTF8,multiSearchAnyCaseInsensitiveUTF8,startsWithCaseInsensitiveUTF8,endsWithCaseInsensitiveUTF8with two needle shapes —unhex('61C328')(ASCII first byte + invalid 2-byte continuation) andunhex('C3A4E42829')(valid two-byteäfollowed by an invalid three-byte start). Both ask the SIMD path to engage on a 50+ byte haystack so the bug deterministically triggers in the constructor's cache loop without the fix.Verified locally with the debug build (20/20 passing under
--test-runs 20). The MSan-only nature of the bug means CIUnit tests (msan, function_prop_fuzzer)is the ground-truth check; the test will continue to pass functionally in debug / asan / tsan independent of the fix.Coordination note (for #104966)
The other failure on #104966 —
Stress test (arm_msan)Cannot start clickhouse-serverStorageTimeSeries::StorageTimeSeriesnull-deref during startup — is the same family as PR #104905, no separate fix needed here.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Not required.
Documentation entry for user-facing changes
Version info
26.5.1.807