iframe-proxy

yakov-olkhovskiy · 2026-05-01T15:52:33Z

The PR now contains only the MultiVolnitsky fix.

For UTF-8 case-insensitive search paths, a failed putNGram call could leave partial hash inserts for a needle and continue with inconsistent state. This change handles the failure explicitly: it rolls back hashes inserted for the current needle and routes that needle to fallback searchers.

Context from CI investigations: #99740 and this PR thread #103864.

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix MultiVolnitsky UTF-8 case-insensitive search handling by rolling back partial putNGram inserts on failure and switching failed needles to fallback searchers to avoid inconsistent state and sanitizer exceptions.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Version info

Merged into: 26.5.1.612

Avoid calling CLD2 language detection on empty UTF-8 strings in detectLanguage and detectLanguageMixed. This prevents a MemorySanitizer use-of-uninitialized-value in UTF8OneCharLen observed in PR CI run #99740. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

CLAassistant · 2026-05-01T15:52:41Z

clickhouse-gh · 2026-05-01T15:53:10Z

Workflow [PR], commit [9fb0cb7]

Summary: ✅

AI Review

Summary

This PR hardens UTF-8 case-insensitive MultiVolnitsky search by adding bounds validation before reading converted UTF-8 n-gram bytes and by rolling back partially inserted hash cells when a needle cannot be converted, then routing that needle to fallback search. I did not find unresolved correctness, safety, or performance issues in the current diff.

Final Verdict

Status: ✅ Approve

alexey-milovidov · 2026-05-01T16:14:41Z

@yakov-olkhovskiy, this needs a test.

Cover empty strings for detectLanguage and detectLanguageMixed, including an empty value produced via reinterpret/toString path that triggered CI sanitizer failure context in #99740. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Keep CLD2 bypass for empty input to avoid sanitizer exceptions while preserving existing detectLanguageMixed behavior: empty strings should return {} rather than {'un':0}. Follow-up to CI investigation in #99740 and review in #103864 (comment). Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Keep CLD2 bypass for empty strings to avoid MemorySanitizer exception, but preserve existing detectLanguageMixed behavior by returning an empty map for empty input instead of {'un':0}. Related CI context: #99740 and follow-up discussion: #103864 (comment). Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Align reference output with query order in 04146_detect_language_empty_inputs.sql: two detectLanguage string results first, then two detectLanguageMixed map results. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Align with existing NLP classification tests: detectLanguage/detectLanguageMixed depend on cld2 + nlp-data and are not available in Fast test jobs. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Fix an MSan use-of-uninitialized-value path in UTF-8 case-insensitive ngram generation by validating seq offset bounds before byte reads in VolnitskyTraits::putNGramUTF8CaseInsensitive. Also make MultiVolnitskyBase::hasMoreToSearch honor putNGram failure for a needle: rollback partial hash inserts for that needle and move it to fallback searchers, matching single-needle Volnitsky behavior. Related CI failure context: #103864 #99740 Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Add a deterministic regression check in 02364_multiSearch_function_family that exercises UTF-8 case-insensitive multi-search fallback handling path and verifies stable successful execution. Related CI context: #103864 Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Capture rollback vector by reference in hasMoreToSearch callback so the rollback path compiles and receives inserted cell indices. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Use a concrete expected-value check for case-insensitive UTF-8 multi-search regression coverage. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

yakov-olkhovskiy · 2026-05-02T17:28:20Z

@vdimir could you please review this - we with AI :) have found unit test failure in Volnitsky's search
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103864&sha=430df9772ba27a96a2d1bdae93cfd4b2119bd374&name_0=PR&name_1=Unit+tests+%28msan%2C+function_prop_fuzzer%29
along with inconsistent behavior in multi-search
I hope you are the right person to ping on this issue

vdimir · 2026-05-04T15:58:43Z

+
+SET allow_experimental_nlp_functions = 1;
+
+SELECT detectLanguage('');


Did any of those test reproduce bug for you? I just tried with msan build, all queries passes: https://pastila.nl/?00158282/c0b0d9427139643194bf2dca658d483e#o62SpnRYdvfXWD8o5lP6Gg==GCM

Maybe it can be reproduced with unit test instead? Or it should be case with illegal utf-8?

hm... the issue was caught by existing unit test here:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=99740&sha=24cadae80401a21c72645ca905fc85fe0d23fac0&name_0=PR
then I proceeded with Cursor making (at the time obvious) fix and producing this test... but now I'm tracing back the call stack and see that there is already a guard against zero-sized string:
https://github.com/ClickHouse/cld2/blob/217ba8b8805b41557faadaa47bb6e99f2242eea3/internal/compact_lang_det_impl.cc#L1734
...my suspicion now is that it's not zero length string but some kind of broken UTF-8...

I've converted this PR to draft - this fix is not proper apparently

after some tinkering I was able to catch and reproduce it with next:

SET allow_experimental_nlp_functions = 1; SELECT detectLanguageMixed(materialize(CAST('Ðь' AS String))); -- or SELECT detectLanguageMixed(materialize(CAST(unhex('C390D18C') AS String)));

ok, I completely removed this fix and test since bug in cld2 library itself - will fix it there
@vdimir could you please review Volnitsky fix - seems you are the last person who dealt with it

PR
ClickHouse/cld2#2

What issue current changes addresses once cld2 is patched, do we have some example?

@vdimir unit test fail in this PR:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103864&sha=430df9772ba27a96a2d1bdae93cfd4b2119bd374&name_0=PR

…eference

alexey-milovidov · 2026-05-10T11:52:27Z

    }

-    void putNGramBase(const VolnitskyTraits::Ngram ngram, const int offset, const size_t num)
+    void putNGramBase(const VolnitskyTraits::Ngram ngram, const int offset, const size_t num, std::vector<size_t> * inserted_cells = nullptr)


Add a comment about inserted_cells, and better to avoid default values.

- Drop the `nullptr` default for the `inserted_cells` parameter in the multi-needle `putNGramBase`. There is exactly one caller and it always passes a real container, so the parameter becomes a non-owning reference rather than an optional pointer. - Add a doc comment describing the rollback contract `inserted_cells` participates in: the caller uses it to undo partially-inserted hashes for a needle whose later n-gram fails to convert, preventing the hash table from leaking into searches for other needles. Per #103864 (comment)... . Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ge-empty-guard-461b

clickhouse-gh · 2026-05-12T23:15:34Z

LLVM Coverage Report

Changed lines: 91.43% (32/35) · Uncovered code

Full report · Diff report

clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label May 1, 2026

yakov-olkhovskiy marked this pull request as ready for review May 1, 2026 15:54

clickhouse-gh Bot reviewed May 1, 2026

View reviewed changes

Comment thread src/Functions/FunctionsLanguageClassification.cpp Outdated

cursoragent and others added 2 commits May 1, 2026 16:21

clickhouse-gh Bot reviewed May 1, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04146_detect_language_empty_inputs.reference Outdated

cursoragent and others added 2 commits May 1, 2026 16:36

Mark empty-language test as no-fasttest

430df97

Align with existing NLP classification tests: detectLanguage/detectLanguageMixed depend on cld2 + nlp-data and are not available in Fast test jobs. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

groeneai mentioned this pull request May 1, 2026

Add web terminal interface to clickhouse-server #100277

Merged

1 task

clickhouse-gh Bot reviewed May 2, 2026

View reviewed changes

Comment thread src/Common/Volnitsky.h Outdated

cursoragent and others added 2 commits May 2, 2026 16:42

Fix inserted_cells lambda capture in MultiVolnitsky

4870cc2

Capture rollback vector by reference in hasMoreToSearch callback so the rollback path compiles and receives inserted cell indices. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

clickhouse-gh Bot reviewed May 2, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/02364_multiSearch_function_family.sql Outdated

cursoragent and others added 2 commits May 2, 2026 16:49

Replace tautological UTF8 multisearch assertion

6c99859

Use a concrete expected-value check for case-insensitive UTF-8 multi-search regression coverage. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

Drop confusing regression comment wording

af6dfc2

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

cursor Bot changed the title ~~Guard empty NLP language inputs before CLD2 calls~~ Harden NLP empty-input handling and MultiVolnitsky UTF-8 fallback May 2, 2026

Add missing reference lines for 02364 fast test

fe254c9

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>

yakov-olkhovskiy requested a review from vdimir May 2, 2026 17:21

vdimir self-assigned this May 4, 2026

vdimir reviewed May 4, 2026

View reviewed changes

yakov-olkhovskiy marked this pull request as draft May 5, 2026 00:11

yakov-olkhovskiy added 2 commits May 5, 2026 10:21

revert FunctionsLanguageClassification.cpp

ebd57d4

Delete tests/queries/0_stateless/04146_detect_language_empty_inputs.sql

14cacb0

Delete tests/queries/0_stateless/04146_detect_language_empty_inputs.r…

f791803

…eference

cursor Bot changed the title ~~Harden NLP empty-input handling and MultiVolnitsky UTF-8 fallback~~ Fix MultiVolnitsky fallback for failed UTF-8 ngram inserts May 5, 2026

Algunenano mentioned this pull request May 7, 2026

Bump cld2 to fix MSan use-of-uninitialized-value in detectLanguage #104257

Merged

1 task

This was referenced May 10, 2026

Implement FIFO admission queue with per-waiter wakeup for ProcessList #101099

Open

Add KQL conformance tests from open-source implementations #102159

Merged

alexey-milovidov reviewed May 10, 2026

View reviewed changes

alexey-milovidov and others added 2 commits May 10, 2026 12:01

Merge remote-tracking branch 'origin/master' into cursor/detectlangua…

48c3ea0

…ge-empty-guard-461b

This was referenced May 10, 2026

Enable count_distinct_optimization by default #104472

Closed

Fix heap-use-after-free in executeAggregateMultiply self-merge #103536

Merged

Fix wrong results with LIMIT BY and DISTINCT over UNION ALL of sorted subqueries #103231

Merged

Merge remote-tracking branch 'origin/master' into cursor/detectlangua…

9fb0cb7

…ge-empty-guard-461b

alexey-milovidov approved these changes May 12, 2026

View reviewed changes

alexey-milovidov marked this pull request as ready for review May 12, 2026 20:30

alexey-milovidov self-assigned this May 12, 2026

yakov-olkhovskiy added this pull request to the merge queue May 13, 2026

Merged via the queue into master with commit a5438f8 May 13, 2026
165 checks passed

yakov-olkhovskiy deleted the cursor/detectlanguage-empty-guard-461b branch May 13, 2026 13:26

robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 13, 2026


		SET allow_experimental_nlp_functions = 1;

		SELECT detectLanguage('');

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

yakov-olkhovskiy commented May 1, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Version info

Uh oh!

CLAassistant commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

alexey-milovidov commented May 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yakov-olkhovskiy commented May 2, 2026

Uh oh!

vdimir May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yakov-olkhovskiy May 5, 2026

Choose a reason for hiding this comment

Uh oh!

yakov-olkhovskiy May 5, 2026

Choose a reason for hiding this comment

Uh oh!

yakov-olkhovskiy May 5, 2026

Choose a reason for hiding this comment

Uh oh!

yakov-olkhovskiy May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yakov-olkhovskiy May 5, 2026

Choose a reason for hiding this comment

Uh oh!

vdimir May 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yakov-olkhovskiy May 7, 2026

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov May 10, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented May 12, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

yakov-olkhovskiy commented May 1, 2026 •

edited by robot-clickhouse

Loading

CLAassistant commented May 1, 2026 •

edited

Loading

clickhouse-gh Bot commented May 1, 2026 •

edited

Loading

vdimir May 4, 2026 •

edited

Loading

yakov-olkhovskiy May 5, 2026 •

edited

Loading

vdimir May 7, 2026 •

edited

Loading