Fix MultiVolnitsky fallback for failed UTF-8 ngram inserts#103864
Conversation
Avoid calling CLD2 language detection on empty UTF-8 strings in detectLanguage and detectLanguageMixed. This prevents a MemorySanitizer use-of-uninitialized-value in UTF8OneCharLen observed in PR CI run #99740. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
|
Workflow [PR], commit [9fb0cb7] Summary: ✅ AI ReviewSummaryThis PR hardens UTF-8 case-insensitive Final Verdict
|
|
@yakov-olkhovskiy, this needs a test. |
Cover empty strings for detectLanguage and detectLanguageMixed, including an empty value produced via reinterpret/toString path that triggered CI sanitizer failure context in #99740. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Keep CLD2 bypass for empty input to avoid sanitizer exceptions while preserving existing detectLanguageMixed behavior: empty strings should return {} rather than {'un':0}. Follow-up to CI investigation in #99740 and review in #103864 (comment).
Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Keep CLD2 bypass for empty strings to avoid MemorySanitizer exception, but preserve existing detectLanguageMixed behavior by returning an empty map for empty input instead of {'un':0}. Related CI context: #99740 and follow-up discussion: #103864 (comment).
Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Align reference output with query order in 04146_detect_language_empty_inputs.sql: two detectLanguage string results first, then two detectLanguageMixed map results. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Align with existing NLP classification tests: detectLanguage/detectLanguageMixed depend on cld2 + nlp-data and are not available in Fast test jobs. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Fix an MSan use-of-uninitialized-value path in UTF-8 case-insensitive ngram generation by validating seq offset bounds before byte reads in VolnitskyTraits::putNGramUTF8CaseInsensitive. Also make MultiVolnitskyBase::hasMoreToSearch honor putNGram failure for a needle: rollback partial hash inserts for that needle and move it to fallback searchers, matching single-needle Volnitsky behavior. Related CI failure context: #103864 #99740 Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Add a deterministic regression check in 02364_multiSearch_function_family that exercises UTF-8 case-insensitive multi-search fallback handling path and verifies stable successful execution. Related CI context: #103864 Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Capture rollback vector by reference in hasMoreToSearch callback so the rollback path compiles and receives inserted cell indices. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Use a concrete expected-value check for case-insensitive UTF-8 multi-search regression coverage. Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
Co-authored-by: Yakov Olkhovskiy <yakov-olkhovskiy@users.noreply.github.com>
|
@vdimir could you please review this - we with AI :) have found unit test failure in Volnitsky's search |
|
|
||
| SET allow_experimental_nlp_functions = 1; | ||
|
|
||
| SELECT detectLanguage(''); |
There was a problem hiding this comment.
Did any of those test reproduce bug for you? I just tried with msan build, all queries passes: https://pastila.nl/?00158282/c0b0d9427139643194bf2dca658d483e#o62SpnRYdvfXWD8o5lP6Gg==GCM
Maybe it can be reproduced with unit test instead? Or it should be case with illegal utf-8?
There was a problem hiding this comment.
hm... the issue was caught by existing unit test here:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=99740&sha=24cadae80401a21c72645ca905fc85fe0d23fac0&name_0=PR
then I proceeded with Cursor making (at the time obvious) fix and producing this test... but now I'm tracing back the call stack and see that there is already a guard against zero-sized string:
https://github.com/ClickHouse/cld2/blob/217ba8b8805b41557faadaa47bb6e99f2242eea3/internal/compact_lang_det_impl.cc#L1734
...my suspicion now is that it's not zero length string but some kind of broken UTF-8...
There was a problem hiding this comment.
I've converted this PR to draft - this fix is not proper apparently
There was a problem hiding this comment.
after some tinkering I was able to catch and reproduce it with next:
SET allow_experimental_nlp_functions = 1;
SELECT detectLanguageMixed(materialize(CAST('Ðь' AS String)));
-- or
SELECT detectLanguageMixed(materialize(CAST(unhex('C390D18C') AS String)));
There was a problem hiding this comment.
ok, I completely removed this fix and test since bug in cld2 library itself - will fix it there
@vdimir could you please review Volnitsky fix - seems you are the last person who dealt with it
There was a problem hiding this comment.
There was a problem hiding this comment.
What issue current changes addresses once cld2 is patched, do we have some example?
There was a problem hiding this comment.
MultiVolnitsky fallback for failed UTF-8 ngram inserts
| } | ||
|
|
||
| void putNGramBase(const VolnitskyTraits::Ngram ngram, const int offset, const size_t num) | ||
| void putNGramBase(const VolnitskyTraits::Ngram ngram, const int offset, const size_t num, std::vector<size_t> * inserted_cells = nullptr) |
There was a problem hiding this comment.
Add a comment about inserted_cells, and better to avoid default values.
- Drop the `nullptr` default for the `inserted_cells` parameter in the multi-needle `putNGramBase`. There is exactly one caller and it always passes a real container, so the parameter becomes a non-owning reference rather than an optional pointer. - Add a doc comment describing the rollback contract `inserted_cells` participates in: the caller uses it to undo partially-inserted hashes for a needle whose later n-gram fails to convert, preventing the hash table from leaking into searches for other needles. Per #103864 (comment)... . Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ge-empty-guard-461b
…ge-empty-guard-461b
LLVM Coverage ReportChanged lines: 91.43% (32/35) · Uncovered code |

The PR now contains only the
MultiVolnitskyfix.For UTF-8 case-insensitive search paths, a failed
putNGramcall could leave partial hash inserts for a needle and continue with inconsistent state. This change handles the failure explicitly: it rolls back hashes inserted for the current needle and routes that needle to fallback searchers.Context from CI investigations: #99740 and this PR thread #103864.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix
MultiVolnitskyUTF-8 case-insensitive search handling by rolling back partialputNGraminserts on failure and switching failed needles to fallback searchers to avoid inconsistent state and sanitizer exceptions.Documentation entry for user-facing changes
Version info
26.5.1.612