Bump cld2 to fix MSan use-of-uninitialized-value in detectLanguage by Algunenano · Pull Request #104257 · ClickHouse/ClickHouse · GitHub
Skip to content

Bump cld2 to fix MSan use-of-uninitialized-value in detectLanguage#104257

Merged
Algunenano merged 1 commit into
ClickHouse:masterfrom
Algunenano:bump-cld2-msan-oob-fix
May 7, 2026
Merged

Bump cld2 to fix MSan use-of-uninitialized-value in detectLanguage#104257
Algunenano merged 1 commit into
ClickHouse:masterfrom
Algunenano:bump-cld2-msan-oob-fix

Conversation

@Algunenano

@Algunenano Algunenano commented May 6, 2026

Copy link
Copy Markdown
Member

CLD2::ScriptScanner had two paths that read one byte past the input buffer when a consumed UTF-8 character ended exactly at byte_length_. With a ColumnString whose trailing padding bytes are uninitialized, this tripped MSan in detectLanguageMixed (and the other detectLanguage* functions) and showed up as flaky failures of the function_prop_fuzzer unit test on master.

Patched in ClickHouse/cld2#3 — guards the lookahead in GetOneScriptSpan and the post-loop back-up in GetOneTextSpan against take == byte_length_.

Reproduced and confirmed fixed against an MSan build:

SET allow_experimental_nlp_functions = 1;
SELECT detectLanguageMixed(materialize('abcdα'));

Closes #103765

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix MemorySanitizer: use-of-uninitialized-value in detectLanguage* functions when the input contains a UTF-8 character ending exactly at the buffer boundary.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.381

`CLD2::ScriptScanner` had two paths that read one byte past the input
buffer when a consumed UTF-8 character ended exactly at `byte_length_`.
With a `ColumnString` whose trailing padding bytes are uninitialized,
this tripped MSan in `detectLanguageMixed` (and the other `detectLanguage*`
functions) and showed up as flaky failures of the `function_prop_fuzzer`
unit test on master.

Patched in `ClickHouse/cld2#3`. Add a stateless test that exercises the
boundary, including the inputs reported on the issue.

Closes ClickHouse#103765

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 6, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-bugfix Pull request with bugfix, not backported by default submodule changed At least one submodule changed in this PR. labels May 6, 2026
@alexey-milovidov alexey-milovidov self-assigned this May 6, 2026
@Algunenano

Copy link
Copy Markdown
Member Author

@Algunenano Algunenano added this pull request to the merge queue May 7, 2026
Merged via the queue into ClickHouse:master with commit c22bd98 May 7, 2026
162 of 165 checks passed
@Algunenano Algunenano deleted the bump-cld2-msan-oob-fix branch May 7, 2026 10:18
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

MemorySanitizer: use-of-uninitialized-value contrib/cld2/internal/utf8statetable.h

3 participants