Make the text index postprocessor compatible with positional phrase search (positions = 1 / hasPhrase). by Ergus · Pull Request #108432 · ClickHouse/ClickHouse · GitHub
Skip to content

Make the text index postprocessor compatible with positional phrase search (positions = 1 / hasPhrase).#108432

Closed
Ergus wants to merge 5 commits into
ClickHouse:masterfrom
Ergus:Compat_hasPhrase_postprocessor
Closed

Make the text index postprocessor compatible with positional phrase search (positions = 1 / hasPhrase).#108432
Ergus wants to merge 5 commits into
ClickHouse:masterfrom
Ergus:Compat_hasPhrase_postprocessor

Conversation

@Ergus

@Ergus Ergus commented Jun 25, 2026

Copy link
Copy Markdown
Member

Previously a text index could not be defined with both a postprocessor and positions = 1, and hasPhrase ignored the postprocessor entirely (it matched on raw tokens). This change lets the two work together: the per-token postprocessor (e.g. lowercasing, stemming, stop-word removal) is now applied consistently to hasPhrase, so a phrase query matches the postprocessed token sequence.

Changelog category (leave one):

  • Experimental Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Text index postprocessor can now be combined with positions = 1, and hasPhrase applies the postprocessor to its argument so phrase search works over lowercased/stemmed/stop-word-filtered tokens.

Related: #98939
Related: #103172

Ergus added 4 commits June 25, 2026 00:45
1. Add hasPhrase to needApplyPostprocessor
2. Process haystack by postprocessing and rejoining.
3. Add postprocess code to has_positions branch in condition
4. Use a counter to set positions when addDocumentsFromArray.
@clickhouse-gh

clickhouse-gh Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-experimental Experimental Feature label Jun 25, 2026
/// separator; the function re-tokenizes them. Tokens the postprocessor dropped are empty array
/// elements that become adjacent separators and produce no token on re-split, reproducing the
/// index's dense position sequence. hasAnyTokens/hasAllTokens accept the Array(String) directly.
if (function_name == "hasToken" || function_name == "hasPhrase")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rewrite is unsafe for hasToken when the index tokenizer can emit tokens that are not valid hasToken tokens, for example tokenizer = 'array' or splitByString on values like Foo Bar / foo-bar. The index build stores the whole postprocessed token (foo bar), but this fallback rejoins the token array and hasToken then re-tokenizes it with splitByNonAlpha; hasToken(val, 'Foo') on an unmaterialized part can return true while the materialized text-index condition looks for foo and prunes the part because only foo bar is stored. That breaks the contract that direct read on/off and mixed materialization agree. Please either avoid using the text index for hasToken unless the index/postprocessor guarantees hasToken-safe tokens, or make the fallback and index lookup preserve the same whole-token semantics.

**Function support**.

For predicates that consult the text index, the preprocessor and postprocessor are applied to the search value before the granule-level check so that the index lookup uses the same tokens that were stored at index build.
For most functions (`=`, `IN`, `hasPhrase`, `startsWith`, `endsWith`, `LIKE`, `mapContains*`), the text index is used only to skip irrelevant data blocks; ClickHouse still verifies each surviving row using the original predicate against the original column data.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sentence includes hasPhrase, but the implementation no longer verifies hasPhrase against the original column semantics: processTextIndexFunction rewrites the haystack and phrase through the tokenizer/postprocessor, and with positions = 1 exact direct read can replace the predicate entirely. For example, a stop-word postprocessor can make hasPhrase(col, 'see cat') match see the cat. Please remove hasPhrase from this list or document it as an exception whose semantics are the postprocessed token sequence.

/// separator; the function re-tokenizes them. Tokens the postprocessor dropped are empty array
/// elements that become adjacent separators and produce no token on re-split, reproducing the
/// index's dense position sequence. hasAnyTokens/hasAllTokens accept the Array(String) directly.
if (function_name == "hasToken" || function_name == "hasPhrase")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This row-scan/default-expression path does not preserve the index token stream for hasPhrase. After postprocessing, new_children[0] is already an Array(String), but this joins it with a literal space and then calls hasPhrase with the original index tokenizer. That only round-trips for tokenizers that split on spaces. For example, with tokenizer = splitByString(['|']), a row a|b is indexed as tokens ['a', 'b'], but the fallback turns it into a b; hasPhrase('a b', 'b', 'splitByString([''|''])') sees one haystack token a b and rejects the row while the materialized positional index can match token b. The same artificial-token-stream problem applies to ngrams. Please keep the fallback on token-array semantics, or use a dedicated phrase-over-token-array evaluator, instead of re-tokenizing a space-joined string with the original tokenizer.

@clickhouse-gh

clickhouse-gh Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 47/49 (95.92%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@Ergus

Ergus commented Jun 26, 2026

Copy link
Copy Markdown
Member Author

@Ergus Ergus closed this Jun 26, 2026
@Ergus Ergus deleted the Compat_hasPhrase_postprocessor branch June 26, 2026 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-experimental Experimental Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant