Fix use-of-uninitialized-value in StringSearcher.h by thevar1able · Pull Request #100225 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix use-of-uninitialized-value in StringSearcher.h#100225

Merged
thevar1able merged 1 commit into
masterfrom
fix-msan-stringsearcher
Mar 31, 2026
Merged

Fix use-of-uninitialized-value in StringSearcher.h#100225
thevar1able merged 1 commit into
masterfrom
fix-msan-stringsearcher

Conversation

@thevar1able

@thevar1able thevar1able commented Mar 20, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Critical Bug Fix (crash, data loss, RBAC)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix use-of-uninitialized-value in StringSearcher.h

Resolves #99165

Version info

  • Merged into: 26.4.1.456
  • Backported to: 26.3.10.3, 26.2.15.4, 26.1.10.5

@clickhouse-gh

clickhouse-gh Bot commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Mar 20, 2026
@clickhouse-gh

clickhouse-gh Bot commented Mar 20, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 83.80% 83.80% +0.00%
Functions 24.00% 24.00% +0.00%
Branches 76.30% 76.40% +0.10%

PR changed lines: PR changed-lines coverage: 100.00% (26/26, 0 noise lines excluded)
Diff coverage report
Uncovered code

@thevar1able thevar1able marked this pull request as ready for review March 31, 2026 17:15
@evillique evillique self-assigned this Mar 31, 2026
@thevar1able thevar1able added this pull request to the merge queue Mar 31, 2026
Merged via the queue into master with commit 626e92c Mar 31, 2026
152 checks passed
@thevar1able thevar1able deleted the fix-msan-stringsearcher branch March 31, 2026 17:30
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Mar 31, 2026
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Mar 31, 2026
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Apr 14, 2026
thevar1able added a commit that referenced this pull request Apr 14, 2026
Backport #100225 to 26.1: Fix use-of-uninitialized-value in StringSearcher.h
thevar1able added a commit that referenced this pull request Apr 14, 2026
Backport #100225 to 26.2: Fix use-of-uninitialized-value in StringSearcher.h
thevar1able added a commit that referenced this pull request Apr 14, 2026
Backport #100225 to 26.3: Fix use-of-uninitialized-value in StringSearcher.h
leshikus pushed a commit to leshikus/ClickHouse that referenced this pull request May 19, 2026
…earcher

`StringSearcher<false, false>` (the UTF-8 case-insensitive variant used by
`ilike`, `positionCaseInsensitiveUTF8`, `startsWithCaseInsensitiveUTF8`,
`endsWithCaseInsensitiveUTF8`, `multiSearchAnyCaseInsensitiveUTF8`, and
`VolnitskyCaseInsensitiveUTF8`) populates the `cachel` / `cacheu` SIMD
registers in its constructor from per-character lowercase / uppercase
UTF-8 byte buffers `l_seq` / `u_seq` (each 6 bytes on the stack,
uninitialized at declaration).

The first character is handled by the outer `if (*needle < 0x80u)` /
`else` block, where the `else` branch already copies bytes verbatim
when `convertUTF8ToCodePoint` reports an invalid UTF-8 sequence.

The same care was missing from the per-character cache loop. Whenever
`convertUTF8ToCodePoint` returned an empty optional for an invalid
UTF-8 sequence in the middle of the needle (so `c_u32` was null and the
`if (c_u32) { ... }` body was skipped), the inner loop still went on to
read `l_seq[j]` / `u_seq[j]` for `j = 0..src_len-1` and feed those
bytes into `cachel` / `cacheu` via `_mm_insert_epi8`. When `src_len`
exceeded the number of bytes written by a previous iteration, the
uninitialized portion of `l_seq` / `u_seq` was inserted into the SIMD
cache and surfaced later as a `MemorySanitizer` use-of-uninitialized-value
in `_mm_cmpeq_epi8` against the haystack in both `compare`
(`StringSearcher.h:496`) and `search` (`StringSearcher.h:563`).

The fix adds the missing `else` branch — when `c_u32` is empty, the
needle bytes are memcpy'd into `l_seq` / `u_seq`, identical to how the
first character is handled.

This is the same chronic family that PR ClickHouse#100225 ("Fix
use-of-uninitialized-value in `StringSearcher.h`") partially addressed
by adding `if (needle == needle_end) return true;` to `compare` for the
empty-needle case. The empty-needle case is fixed; the invalid-UTF-8-
mid-needle case is what this PR closes.

Reported by `Unit tests (msan, function_prop_fuzzer)` across many
unrelated master and PR runs (chronic noise under umbrella issue
ClickHouse#104877), with stack:

    StringSearcher.h:563:61 in DB::impl::StringSearcher<false, false>::search
    Volnitsky::Volnitsky
    MatchImpl::vectorVector
    FunctionLike (ilike)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MemorySanitizer: use-of-uninitialized-value (STID: 2535-3382)

5 participants