Fix `MultiVolnitsky` fallback for failed UTF-8 ngram inserts by yakov-olkhovskiy · Pull Request #103864 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix MultiVolnitsky fallback for failed UTF-8 ngram inserts#103864

Merged
yakov-olkhovskiy merged 17 commits into
masterfrom
cursor/detectlanguage-empty-guard-461b
May 13, 2026
Merged

Fix MultiVolnitsky fallback for failed UTF-8 ngram inserts#103864
yakov-olkhovskiy merged 17 commits into
masterfrom
cursor/detectlanguage-empty-guard-461b

Conversation

@yakov-olkhovskiy

@yakov-olkhovskiy yakov-olkhovskiy commented May 1, 2026

Copy link
Copy Markdown
Member

The PR now contains only the MultiVolnitsky fix.

For UTF-8 case-insensitive search paths, a failed putNGram call could leave partial hash inserts for a needle and continue with inconsistent state. This change handles the failure explicitly: it rolls back hashes inserted for the current needle and routes that needle to fallback searchers.

Context from CI investigations: #99740 and this PR thread #103864.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix MultiVolnitsky UTF-8 case-insensitive search handling by rolling back partial putNGram inserts on failure and switching failed needles to fallback searchers to avoid inconsistent state and sanitizer exceptions.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)
Open in Web Open in Cursor 

Version info

  • Merged into: 26.5.1.612

Avoid calling CLD2 language detection on empty UTF-8 strings in detectLanguage and detectLanguageMixed. This prevents a MemorySanitizer use-of-uninitialized-value in UTF8OneCharLen observed in PR CI run #99740.

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
@CLAassistant

CLAassistant commented May 1, 2026

Copy link
Copy Markdown

@clickhouse-gh

clickhouse-gh Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [9fb0cb7]

Summary:


AI Review

Summary

This PR hardens UTF-8 case-insensitive MultiVolnitsky search by adding bounds validation before reading converted UTF-8 n-gram bytes and by rolling back partially inserted hash cells when a needle cannot be converted, then routing that needle to fallback search. I did not find unresolved correctness, safety, or performance issues in the current diff.

Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label May 1, 2026
@yakov-olkhovskiy yakov-olkhovskiy marked this pull request as ready for review May 1, 2026 15:54
Comment thread src/Functions/FunctionsLanguageClassification.cpp Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member

@yakov-olkhovskiy, this needs a test.

cursoragent and others added 2 commits May 1, 2026 16:21
Cover empty strings for detectLanguage and detectLanguageMixed, including an empty value produced via reinterpret/toString path that triggered CI sanitizer failure context in #99740.

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Keep CLD2 bypass for empty input to avoid sanitizer exceptions while preserving existing detectLanguageMixed behavior: empty strings should return {} rather than {'un':0}. Follow-up to CI investigation in #99740 and review in #103864 (comment).

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
cursor Bot pushed a commit that referenced this pull request May 1, 2026
Keep CLD2 bypass for empty strings to avoid MemorySanitizer exception, but preserve existing detectLanguageMixed behavior by returning an empty map for empty input instead of {'un':0}. Related CI context: #99740 and follow-up discussion: #103864 (comment).

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Comment thread tests/queries/0_stateless/04146_detect_language_empty_inputs.reference Outdated
cursoragent and others added 2 commits May 1, 2026 16:36
Align reference output with query order in 04146_detect_language_empty_inputs.sql: two detectLanguage string results first, then two detectLanguageMixed map results.

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Align with existing NLP classification tests: detectLanguage/detectLanguageMixed depend on cld2 + nlp-data and are not available in Fast test jobs.

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Fix an MSan use-of-uninitialized-value path in UTF-8 case-insensitive ngram generation by validating seq offset bounds before byte reads in VolnitskyTraits::putNGramUTF8CaseInsensitive.

Also make MultiVolnitskyBase::hasMoreToSearch honor putNGram failure for a needle: rollback partial hash inserts for that needle and move it to fallback searchers, matching single-needle Volnitsky behavior.

Related CI failure context:
#103864
#99740

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Comment thread src/Common/Volnitsky.h Outdated
cursoragent and others added 2 commits May 2, 2026 16:42
Add a deterministic regression check in 02364_multiSearch_function_family that exercises UTF-8 case-insensitive multi-search fallback handling path and verifies stable successful execution.

Related CI context:
#103864

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Capture rollback vector by reference in hasMoreToSearch callback so the rollback path compiles and receives inserted cell indices.

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Comment thread tests/queries/0_stateless/02364_multiSearch_function_family.sql Outdated
cursoragent and others added 2 commits May 2, 2026 16:49
Use a concrete expected-value check for case-insensitive UTF-8 multi-search regression coverage.

Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
@cursor cursor Bot changed the title Guard empty NLP language inputs before CLD2 calls Harden NLP empty-input handling and MultiVolnitsky UTF-8 fallback May 2, 2026
Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
@yakov-olkhovskiy yakov-olkhovskiy requested a review from vdimir May 2, 2026 17:21
@yakov-olkhovskiy

Copy link
Copy Markdown
Member Author

@vdimir could you please review this - we with AI :) have found unit test failure in Volnitsky's search
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103864&sha=430df9772ba27a96a2d1bdae93cfd4b2119bd374&name_0=PR&name_1=Unit+tests+%28msan%2C+function_prop_fuzzer%29
along with inconsistent behavior in multi-search
I hope you are the right person to ping on this issue

@vdimir vdimir self-assigned this May 4, 2026

SET allow_experimental_nlp_functions = 1;

SELECT detectLanguage('');

@vdimir vdimir May 4, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did any of those test reproduce bug for you? I just tried with msan build, all queries passes: https://pastila.nl/?00158282/c0b0d9427139643194bf2dca658d483e#o62SpnRYdvfXWD8o5lP6Gg==GCM

Maybe it can be reproduced with unit test instead? Or it should be case with illegal utf-8?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm... the issue was caught by existing unit test here:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=99740&sha=24cadae80401a21c72645ca905fc85fe0d23fac0&name_0=PR
then I proceeded with Cursor making (at the time obvious) fix and producing this test... but now I'm tracing back the call stack and see that there is already a guard against zero-sized string:
https://github.com/ClickHouse/cld2/blob/217ba8b8805b41557faadaa47bb6e99f2242eea3/internal/compact_lang_det_impl.cc#L1734
...my suspicion now is that it's not zero length string but some kind of broken UTF-8...

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've converted this PR to draft - this fix is not proper apparently

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after some tinkering I was able to catch and reproduce it with next:

SET allow_experimental_nlp_functions = 1;
SELECT detectLanguageMixed(materialize(CAST('Ðь' AS String)));
-- or
SELECT detectLanguageMixed(materialize(CAST(unhex('C390D18C') AS String)));

@yakov-olkhovskiy yakov-olkhovskiy May 5, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I completely removed this fix and test since bug in cld2 library itself - will fix it there
@vdimir could you please review Volnitsky fix - seems you are the last person who dealt with it

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vdimir vdimir May 7, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What issue current changes addresses once cld2 is patched, do we have some example?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yakov-olkhovskiy yakov-olkhovskiy marked this pull request as draft May 5, 2026 00:11
@cursor cursor Bot changed the title Harden NLP empty-input handling and MultiVolnitsky UTF-8 fallback Fix MultiVolnitsky fallback for failed UTF-8 ngram inserts May 5, 2026
Comment thread src/Common/Volnitsky.h Outdated
}

void putNGramBase(const VolnitskyTraits::Ngram ngram, const int offset, const size_t num)
void putNGramBase(const VolnitskyTraits::Ngram ngram, const int offset, const size_t num, std::vector<size_t> * inserted_cells = nullptr)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment about inserted_cells, and better to avoid default values.

alexey-milovidov and others added 2 commits May 10, 2026 12:01
- Drop the `nullptr` default for the `inserted_cells` parameter in the
  multi-needle `putNGramBase`. There is exactly one caller and it always
  passes a real container, so the parameter becomes a non-owning
  reference rather than an optional pointer.
- Add a doc comment describing the rollback contract `inserted_cells`
  participates in: the caller uses it to undo partially-inserted hashes
  for a needle whose later n-gram fails to convert, preventing the
  hash table from leaking into searches for other needles.

Per #103864 (comment)... .

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov marked this pull request as ready for review May 12, 2026 20:30
@alexey-milovidov alexey-milovidov self-assigned this May 12, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 91.10% 91.10% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 91.43% (32/35) · Uncovered code

Full report · Diff report

@yakov-olkhovskiy yakov-olkhovskiy added this pull request to the merge queue May 13, 2026
Merged via the queue into master with commit a5438f8 May 13, 2026
165 checks passed
@yakov-olkhovskiy yakov-olkhovskiy deleted the cursor/detectlanguage-empty-guard-461b branch May 13, 2026 13:26
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants