iframe-proxy

Ergus · 2026-06-25T01:18:26Z

Previously a text index could not be defined with both a postprocessor and positions = 1, and hasPhrase ignored the postprocessor entirely (it matched on raw tokens). This change lets the two work together: the per-token postprocessor (e.g. lowercasing, stemming, stop-word removal) is now applied consistently to hasPhrase, so a phrase query matches the postprocessed token sequence.

Changelog category (leave one):

Experimental Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Text index postprocessor can now be combined with positions = 1, and hasPhrase applies the postprocessor to its argument so phrase search works over lowercased/stemmed/stop-word-filtered tokens.

Related: #98939
Related: #103172

1. Add hasPhrase to needApplyPostprocessor 2. Process haystack by postprocessing and rejoining. 3. Add postprocess code to has_positions branch in condition 4. Use a counter to set positions when addDocumentsFromArray.

clickhouse-gh · 2026-06-25T01:19:03Z

Workflow [PR], commit [e09c198]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info	comment
Fast test (arm_darwin)		FAIL
	03135_keeper_client_find_commands	FAIL	cidb

AI Review

Summary

This PR removes the guard that prevented positions = 1 text indexes from using a postprocessor and rewrites hasPhrase so both the positional index path and row-scan fallback operate on postprocessed tokens. The current implementation still has unresolved correctness gaps where materialized direct reads, unmaterialized/default-expression reads, and direct-read-off execution can disagree, so the PR should not merge as-is.

Findings

❌ Blockers

[src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp:641] The hasToken fallback converts postprocessed index tokens back to a space-joined string and lets hasToken re-tokenize it with splitByNonAlpha. For index tokenizers/postprocessors that store whole tokens containing hasToken separators, such as array or splitByString with Foo Bar, unmaterialized/default-expression evaluation can match Foo while the materialized index stores only foo bar and prunes it. Keep the fallback and index lookup on the same whole-token semantics, or reject exact direct read for cases that cannot round-trip safely.
[src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp:641] The new hasPhrase fallback has the same token-stream mismatch for supported tokenizers that do not split the artificial space-joined string back into the original tokens. With tokenizer = splitByString(['|']), a row a|b is indexed as ['a', 'b'], but the fallback evaluates hasPhrase over a b with the splitByString(['|']) tokenizer and sees one token, so direct-read-off or fallback execution can reject rows that the positional index returns. Use token-array phrase matching or another representation that preserves the postprocessed token sequence instead of re-tokenizing a joined string.

⚠️ Majors

[docs/en/engines/table-engines/mergetree-family/textindexes.md:433] The docs still say hasPhrase is only used as a skipping hint and rows are verified against the original predicate, but this PR intentionally rewrites hasPhrase to postprocessed-token semantics and exact positional direct read can replace the predicate entirely. Update the text index docs, and the hasPhrase source-level function documentation, to describe the postprocessed token-sequence semantics and stop-word/dropped-token behavior.

Tests

⚠️ Add focused coverage for the fallback/direct-read divergence cases above: at minimum compare query_plan_direct_read_from_text_index = 0 and 1, plus a partially materialized index, for hasPhrase with splitByString(['|']) and a postprocessor, and for hasToken where the index stores a whole postprocessed token that hasToken would split.

Final Verdict

Status: ❌ Block

Minimum required actions: fix the fallback/direct-read semantic mismatches for hasToken and hasPhrase, update the documentation contract for hasPhrase, and add regression coverage for the affected tokenizer/postprocessor combinations.

clickhouse-gh · 2026-06-25T01:26:49Z

+            /// separator; the function re-tokenizes them. Tokens the postprocessor dropped are empty array
+            /// elements that become adjacent separators and produce no token on re-split, reproducing the
+            /// index's dense position sequence. hasAnyTokens/hasAllTokens accept the Array(String) directly.
+            if (function_name == "hasToken" || function_name == "hasPhrase")


This rewrite is unsafe for hasToken when the index tokenizer can emit tokens that are not valid hasToken tokens, for example tokenizer = 'array' or splitByString on values like Foo Bar / foo-bar. The index build stores the whole postprocessed token (foo bar), but this fallback rejoins the token array and hasToken then re-tokenizes it with splitByNonAlpha; hasToken(val, 'Foo') on an unmaterialized part can return true while the materialized text-index condition looks for foo and prunes the part because only foo bar is stored. That breaks the contract that direct read on/off and mixed materialization agree. Please either avoid using the text index for hasToken unless the index/postprocessor guarantees hasToken-safe tokens, or make the fallback and index lookup preserve the same whole-token semantics.

clickhouse-gh · 2026-06-25T01:26:49Z

+**Function support**.
+
+For predicates that consult the text index, the preprocessor and postprocessor are applied to the search value before the granule-level check so that the index lookup uses the same tokens that were stored at index build.
+For most functions (`=`, `IN`, `hasPhrase`, `startsWith`, `endsWith`, `LIKE`, `mapContains*`), the text index is used only to skip irrelevant data blocks; ClickHouse still verifies each surviving row using the original predicate against the original column data.


This sentence includes hasPhrase, but the implementation no longer verifies hasPhrase against the original column semantics: processTextIndexFunction rewrites the haystack and phrase through the tokenizer/postprocessor, and with positions = 1 exact direct read can replace the predicate entirely. For example, a stop-word postprocessor can make hasPhrase(col, 'see cat') match see the cat. Please remove hasPhrase from this list or document it as an exception whose semantics are the postprocessed token sequence.

…stprocessor

clickhouse-gh · 2026-06-25T09:53:59Z

+            /// separator; the function re-tokenizes them. Tokens the postprocessor dropped are empty array
+            /// elements that become adjacent separators and produce no token on re-split, reproducing the
+            /// index's dense position sequence. hasAnyTokens/hasAllTokens accept the Array(String) directly.
+            if (function_name == "hasToken" || function_name == "hasPhrase")


This row-scan/default-expression path does not preserve the index token stream for hasPhrase. After postprocessing, new_children[0] is already an Array(String), but this joins it with a literal space and then calls hasPhrase with the original index tokenizer. That only round-trips for tokenizers that split on spaces. For example, with tokenizer = splitByString(['|']), a row a|b is indexed as tokens ['a', 'b'], but the fallback turns it into a b; hasPhrase('a b', 'b', 'splitByString([''|''])') sees one haystack token a b and rejects the row while the materialized positional index can match token b. The same artificial-token-stream problem applies to ngrams. Please keep the fallback on token-array semantics, or use a dedicated phrase-over-token-array evaluator, instead of re-tokenizing a space-joined string with the original tokenizer.

clickhouse-gh · 2026-06-25T13:56:38Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	85.40%	85.40%	+0.00%
Functions	92.60%	92.60%	+0.00%
Branches	77.60%	77.60%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 47/49 (95.92%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

Ergus · 2026-06-26T13:52:03Z

Ergus added 4 commits June 25, 2026 00:45

Remove obsolete guard and test

444a10c

Add new tests that must success after changes

bc74a11

Make hasPhrase work with postprocessor

85bf0ef

1. Add hasPhrase to needApplyPostprocessor 2. Process haystack by postprocessing and rejoining. 3. Add postprocess code to has_positions branch in condition 4. Use a counter to set positions when addDocumentsFromArray.

Update the stem test for new behavior

ef4dfc7

clickhouse-gh Bot added the pr-experimental Experimental Feature label Jun 25, 2026

clickhouse-gh Bot reviewed Jun 25, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into Compat_hasPhrase_po…

e09c198

…stprocessor

clickhouse-gh Bot reviewed Jun 25, 2026

View reviewed changes

Ergus mentioned this pull request Jun 26, 2026

Text index postprocessor resubmit #108606

Merged

Ergus closed this Jun 26, 2026

Ergus deleted the Compat_hasPhrase_postprocessor branch June 26, 2026 13:52

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make the text index postprocessor compatible with positional phrase search (positions = 1 / hasPhrase).#108432

Make the text index postprocessor compatible with positional phrase search (positions = 1 / hasPhrase).#108432
Ergus wants to merge 5 commits into
ClickHouse:masterfrom
Ergus:Compat_hasPhrase_postprocessor

Ergus commented Jun 25, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot Jun 25, 2026

Uh oh!

clickhouse-gh Bot Jun 25, 2026

Uh oh!

clickhouse-gh Bot Jun 25, 2026

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026

Uh oh!

Ergus commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Ergus commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

Final Verdict

Uh oh!

clickhouse-gh Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jun 25, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026

LLVM Coverage Report

Uh oh!

Ergus commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Ergus commented Jun 25, 2026 •

edited

Loading

clickhouse-gh Bot commented Jun 25, 2026 •

edited

Loading