Fix MSan use-of-uninitialized-value in UTF-8 case-insensitive StringSearcher by groeneai · Pull Request #105223 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix MSan use-of-uninitialized-value in UTF-8 case-insensitive StringSearcher#105223

Merged
Algunenano merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-stringsearcher-utf8-msan-invalid-needle
May 19, 2026
Merged

Fix MSan use-of-uninitialized-value in UTF-8 case-insensitive StringSearcher#105223
Algunenano merged 2 commits into
ClickHouse:masterfrom
groeneai:fix-stringsearcher-utf8-msan-invalid-needle

Conversation

@groeneai

@groeneai groeneai commented May 18, 2026

Copy link
Copy Markdown
Contributor

Investigation requested by @alexey-milovidov on #104966 (the chronic Unit tests (msan, function_prop_fuzzer) failure at src/Common/StringSearcher.h:563:61 via Volnitsky -> MatchImpl::vectorVector -> ilike).

Cross-PR signature (last 30 days, CIDB)

The same use-of-uninitialized-value at StringSearcher.h:563:61 in DB::impl::StringSearcher<false, false>::search hit at least these 7 unrelated PRs on Unit tests (msan, function_prop_fuzzer): #99740, #100277, #100377, #103383, #104545, #104966, #105126. The sibling shape at StringSearcher.h:496:41 (the same predicate in compare instead of search) hit other PRs in the same window. Both come from the same chronic family tracked under umbrella issue #104877. PR #105146 ("Stop the bleeding in function_prop_fuzzer") did not silence this specific signature; PR #100225 ("Fix use-of-uninitialized-value in StringSearcher.h") fixed only the empty-needle path in compare, not the path triggered here.

Root cause

UTF8CaseInsensitiveStringSearcher (StringSearcher<false, false>) caches the lowercase / uppercase variant of the first 16 needle bytes into the __m128i registers cachel / cacheu. The bytes are first staged through two stack buffers l_seq / u_seq (6 bytes each, uninitialized at declaration), then inserted into the SIMD cache via _mm_insert_epi8.

The first character is handled by the outer if (*needle < 0x80u) / else block, where the else branch already takes care of the invalid-UTF-8 case by copying bytes verbatim:

if (!first_u32)
{
    /// Process it verbatim as a sequence of bytes.
    size_t src_len = UTF8::seqLength(*needle);
    memcpy(l_seq, needle, src_len);
    memcpy(u_seq, needle, src_len);
}

The per-character cache loop just below did not have a matching else:

auto c_u32 = UTF8::convertUTF8ToCodePoint(reinterpret_cast<const char *>(needle_pos), src_len);

if (c_u32)
{
    ...
    size_t dst_l_len = UTF8::convertCodePointToUTF8(c_l_u32, reinterpret_cast<char *>(l_seq), sizeof(l_seq));
    size_t dst_u_len = UTF8::convertCodePointToUTF8(c_u_u32, reinterpret_cast<char *>(u_seq), sizeof(u_seq));
    ...
}
/// <-- no else here; l_seq / u_seq stay whatever they were

for (size_t j = 0; j < src_len && i < N; ++j, ++i)
{
    ...
    cachel = _mm_insert_epi8(cachel, l_seq[j], N - 1);
    cacheu = _mm_insert_epi8(cacheu, u_seq[j], N - 1);
    ...
}

When convertUTF8ToCodePoint returns an empty optional for an invalid UTF-8 sequence in the middle of the needle (e.g. needle bytes 0x61 0xC3 0x28a followed by an invalid two-byte sequence whose seqLength(0xC3) = 2), the if (c_u32) body is skipped and l_seq / u_seq keep whatever the previous iteration left in their first dst_l_len bytes — the remaining src_len - dst_l_len bytes are uninitialized. Those bytes then flow into cachel / cacheu and surface as MSan use-of-uninitialized-value the next time _mm_cmpeq_epi8(v_haystack, cachel) is evaluated, both in compare (line 496:41) and search (line 563:61).

The fix adds the missing else branch — when c_u32 is empty, the needle bytes are copied verbatim into l_seq / u_seq, mirroring the existing first-character handling. Semantics are preserved: compareTrivial already breaks on invalid UTF-8 sequences in the needle, so the SIMD-prefilter result is irrelevant for those positions; only the MSan poison is removed.

The sibling StringSearcher<false, true> (ASCII case-insensitive) is not affected — it uses std::tolower(*needle_pos) per byte and never goes through UTF-8 decoding.

Tests

tests/queries/0_stateless/04255_stringsearcher_utf8_invalid_needle.sql exercises the path through ilike, positionCaseInsensitiveUTF8, multiSearchAnyCaseInsensitiveUTF8, startsWithCaseInsensitiveUTF8, endsWithCaseInsensitiveUTF8 with two needle shapes — unhex('61C328') (ASCII first byte + invalid 2-byte continuation) and unhex('C3A4E42829') (valid two-byte ä followed by an invalid three-byte start). Both ask the SIMD path to engage on a 50+ byte haystack so the bug deterministically triggers in the constructor's cache loop without the fix.

Verified locally with the debug build (20/20 passing under --test-runs 20). The MSan-only nature of the bug means CI Unit tests (msan, function_prop_fuzzer) is the ground-truth check; the test will continue to pass functionally in debug / asan / tsan independent of the fix.

Coordination note (for #104966)

The other failure on #104966Stress test (arm_msan) Cannot start clickhouse-server StorageTimeSeries::StorageTimeSeries null-deref during startup — is the same family as PR #104905, no separate fix needed here.

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Not required.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.807

…earcher

`StringSearcher<false, false>` (the UTF-8 case-insensitive variant used by
`ilike`, `positionCaseInsensitiveUTF8`, `startsWithCaseInsensitiveUTF8`,
`endsWithCaseInsensitiveUTF8`, `multiSearchAnyCaseInsensitiveUTF8`, and
`VolnitskyCaseInsensitiveUTF8`) populates the `cachel` / `cacheu` SIMD
registers in its constructor from per-character lowercase / uppercase
UTF-8 byte buffers `l_seq` / `u_seq` (each 6 bytes on the stack,
uninitialized at declaration).

The first character is handled by the outer `if (*needle < 0x80u)` /
`else` block, where the `else` branch already copies bytes verbatim
when `convertUTF8ToCodePoint` reports an invalid UTF-8 sequence.

The same care was missing from the per-character cache loop. Whenever
`convertUTF8ToCodePoint` returned an empty optional for an invalid
UTF-8 sequence in the middle of the needle (so `c_u32` was null and the
`if (c_u32) { ... }` body was skipped), the inner loop still went on to
read `l_seq[j]` / `u_seq[j]` for `j = 0..src_len-1` and feed those
bytes into `cachel` / `cacheu` via `_mm_insert_epi8`. When `src_len`
exceeded the number of bytes written by a previous iteration, the
uninitialized portion of `l_seq` / `u_seq` was inserted into the SIMD
cache and surfaced later as a `MemorySanitizer` use-of-uninitialized-value
in `_mm_cmpeq_epi8` against the haystack in both `compare`
(`StringSearcher.h:496`) and `search` (`StringSearcher.h:563`).

The fix adds the missing `else` branch — when `c_u32` is empty, the
needle bytes are memcpy'd into `l_seq` / `u_seq`, identical to how the
first character is handled.

This is the same chronic family that PR ClickHouse#100225 ("Fix
use-of-uninitialized-value in `StringSearcher.h`") partially addressed
by adding `if (needle == needle_end) return true;` to `compare` for the
empty-needle case. The empty-needle case is fixed; the invalid-UTF-8-
mid-needle case is what this PR closes.

Reported by `Unit tests (msan, function_prop_fuzzer)` across many
unrelated master and PR runs (chronic noise under umbrella issue
ClickHouse#104877), with stack:

    StringSearcher.h:563:61 in DB::impl::StringSearcher<false, false>::search
    Volnitsky::Volnitsky
    MatchImpl::vectorVector
    FunctionLike (ilike)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @alexey-milovidov @rschu1ze @thevar1able — could you review this? It is the follow-up to your directive on #104966: adds the missing else branch in UTF8CaseInsensitiveStringSearcher's constructor cache loop that left l_seq / u_seq uninitialized when convertUTF8ToCodePoint reported invalid UTF-8 in the middle of the needle, causing the chronic function_prop_fuzzer MSan finding at StringSearcher.h:563:61 (sibling shape at :496:41) tracked under #104877. The empty-needle case from PR #100225 stays fixed; this PR closes the invalid-UTF-8-mid-needle case.

@thevar1able thevar1able added the can be tested Allows running workflows for external contributors label May 18, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [fbe4aab]

Summary:


AI Review

Summary

This PR fixes a real MemorySanitizer use-of-uninitialized-value path in StringSearcher<false, false> for invalid UTF-8 needles by (1) clamping the first-character invalid-sequence memcpy to needle_size and (2) initializing l_seq / u_seq in the mid-needle invalid-sequence branch before inserting into SIMD caches. The added stateless regression test covers invalid and truncated needle shapes across multiple UTF-8 case-insensitive functions and exercises multi-row paths. I re-checked prior discussion against current code; no unresolved correctness or safety issues remain.

Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label May 18, 2026
Comment thread src/Common/StringSearcher.h
@Algunenano Algunenano self-assigned this May 18, 2026
…size

The first-character invalid-UTF-8 branch in the `UTF8CaseInsensitiveStringSearcher`
constructor (`StringSearcher<false, false>`) used
`size_t src_len = UTF8::seqLength(*needle);` and then `memcpy(l_seq, needle, src_len)`
without clamping `src_len` to `needle_size`.

For a truncated first sequence (e.g. a 1-byte needle starting with `0xE4`,
where `seqLength` returns 3), this path can read past `needle_end` into
uninitialized memory and propagate `MemorySanitizer` noise from the same
invalid-UTF-8 family addressed by the previous commit on this branch.

The fix mirrors the inner-loop clamp at line 407:

    size_t src_len = std::min<size_t>(needle_end - needle_pos, UTF8::seqLength(*needle_pos));

applied symmetrically for the first character (where `needle_size = needle_end - needle`):

    size_t src_len = std::min<size_t>(needle_size, UTF8::seqLength(*needle));

The regression test `04255_stringsearcher_utf8_invalid_needle.sql` is extended
with a 1-byte `0xE4` needle exercised through `ilike`, `positionCaseInsensitiveUTF8`,
`multiSearchAnyCaseInsensitiveUTF8`, `startsWithCaseInsensitiveUTF8`, and
`endsWithCaseInsensitiveUTF8`, plus a multi-row materialized variant to make the
SIMD path more likely to engage on the cached needle.

In a local debug build (without `MemorySanitizer`) the test passed 20/20 runs;
ground-truth validation will come from CI `Unit tests (msan, function_prop_fuzzer)`.

Addresses the bot review feedback on PR ClickHouse#105223 from `clickhouse-gh[bot]`.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.20% +0.00%
Functions 91.40% 91.40% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 100.00% (9/9) · Uncovered code

Full report · Diff report

@Algunenano Algunenano added this pull request to the merge queue May 19, 2026
Merged via the queue into ClickHouse:master with commit 91d9af5 May 19, 2026
166 checks passed
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 19, 2026
@groeneai groeneai mentioned this pull request May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants