Make the text index postprocessor compatible with positional phrase search (positions = 1 / hasPhrase).#108432
Make the text index postprocessor compatible with positional phrase search (positions = 1 / hasPhrase).#108432Ergus wants to merge 5 commits into
Conversation
1. Add hasPhrase to needApplyPostprocessor 2. Process haystack by postprocessing and rejoining. 3. Add postprocess code to has_positions branch in condition 4. Use a counter to set positions when addDocumentsFromArray.
| /// separator; the function re-tokenizes them. Tokens the postprocessor dropped are empty array | ||
| /// elements that become adjacent separators and produce no token on re-split, reproducing the | ||
| /// index's dense position sequence. hasAnyTokens/hasAllTokens accept the Array(String) directly. | ||
| if (function_name == "hasToken" || function_name == "hasPhrase") |
There was a problem hiding this comment.
This rewrite is unsafe for hasToken when the index tokenizer can emit tokens that are not valid hasToken tokens, for example tokenizer = 'array' or splitByString on values like Foo Bar / foo-bar. The index build stores the whole postprocessed token (foo bar), but this fallback rejoins the token array and hasToken then re-tokenizes it with splitByNonAlpha; hasToken(val, 'Foo') on an unmaterialized part can return true while the materialized text-index condition looks for foo and prunes the part because only foo bar is stored. That breaks the contract that direct read on/off and mixed materialization agree. Please either avoid using the text index for hasToken unless the index/postprocessor guarantees hasToken-safe tokens, or make the fallback and index lookup preserve the same whole-token semantics.
| **Function support**. | ||
|
|
||
| For predicates that consult the text index, the preprocessor and postprocessor are applied to the search value before the granule-level check so that the index lookup uses the same tokens that were stored at index build. | ||
| For most functions (`=`, `IN`, `hasPhrase`, `startsWith`, `endsWith`, `LIKE`, `mapContains*`), the text index is used only to skip irrelevant data blocks; ClickHouse still verifies each surviving row using the original predicate against the original column data. |
There was a problem hiding this comment.
This sentence includes hasPhrase, but the implementation no longer verifies hasPhrase against the original column semantics: processTextIndexFunction rewrites the haystack and phrase through the tokenizer/postprocessor, and with positions = 1 exact direct read can replace the predicate entirely. For example, a stop-word postprocessor can make hasPhrase(col, 'see cat') match see the cat. Please remove hasPhrase from this list or document it as an exception whose semantics are the postprocessed token sequence.
| /// separator; the function re-tokenizes them. Tokens the postprocessor dropped are empty array | ||
| /// elements that become adjacent separators and produce no token on re-split, reproducing the | ||
| /// index's dense position sequence. hasAnyTokens/hasAllTokens accept the Array(String) directly. | ||
| if (function_name == "hasToken" || function_name == "hasPhrase") |
There was a problem hiding this comment.
This row-scan/default-expression path does not preserve the index token stream for hasPhrase. After postprocessing, new_children[0] is already an Array(String), but this joins it with a literal space and then calls hasPhrase with the original index tokenizer. That only round-trips for tokenizers that split on spaces. For example, with tokenizer = splitByString(['|']), a row a|b is indexed as tokens ['a', 'b'], but the fallback turns it into a b; hasPhrase('a b', 'b', 'splitByString([''|''])') sees one haystack token a b and rejects the row while the materialized positional index can match token b. The same artificial-token-stream problem applies to ngrams. Please keep the fallback on token-array semantics, or use a dedicated phrase-over-token-array evaluator, instead of re-tokenizing a space-joined string with the original tokenizer.
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 47/49 (95.92%) | Lost baseline coverage: none · Uncovered code |

Previously a text index could not be defined with both a postprocessor and positions = 1, and hasPhrase ignored the postprocessor entirely (it matched on raw tokens). This change lets the two work together: the per-token postprocessor (e.g. lowercasing, stemming, stop-word removal) is now applied consistently to hasPhrase, so a phrase query matches the postprocessed token sequence.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Text index postprocessor can now be combined with
positions = 1, andhasPhraseapplies the postprocessor to its argument so phrase search works over lowercased/stemmed/stop-word-filtered tokens.Related: #98939
Related: #103172