iframe-proxy

Ergus · 2026-03-06T16:15:47Z

Parameters:

postprocessor : The postprocessor is an arbitrary expression that transforms each token after tokenization.

Example use:

CREATE TABLE users (
    str String,
    index idx() type text(tokenizer = 'spliByNonApha', postprocessor = lower(str))
ENGINE = MergeTree ORDER BY tuple();

SELECT * FROM users WHERE hasAnyTokens(name, 'crash');

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added a postprocessor to the text index which transforms the tokens after tokenization.

Version info

Merged into: 26.7.1.47
Backported to: 26.6.1.1191

Added as part of the tokenizer interface to tokenize complete String and Array(String) columns into Array(String).

Created with Claude help after a lot of explications, iterations and tokens... The basic implementation is extremely inefficient and the tunning changes were hard to get by the IA.

This is not a definitive change, but I don't want to break anything and at the moment that's everything I will test.

clickhouse-gh · 2026-03-06T16:16:26Z

Workflow [PR], commit [1deafd2]

AI Review

Summary

This PR adds a postprocessor option for text indexes and wires it into index build, index condition construction, direct-read rewrites, tests, and documentation. The main flow is close, but there are still unresolved result-correctness issues where hasToken can evaluate a different token set on the row-scan/direct-read fallback than the materialized text index path, plus several user-facing documentation mismatches around the same feature. Verdict: block until the hasToken equivalence problems are fixed or explicitly rejected for unsupported cases.

Findings

❌ Blockers

[src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp:645] hasToken fallback joins postprocessed haystack tokens with a space and then lets hasToken split that string again. If a postprocessor intentionally emits a token containing separators, the materialized index stores that value as one token but the fallback turns it into multiple tokens. For example, with postprocessor = if(val = 'foo', 'bar baz', val), a row containing foo can match hasToken(val, 'baz') on the fallback path even though the built index only contains token bar baz. Existing thread: Text index postprocessor #98939 (comment). Suggested fix: keep the fallback on Array(String) with exact element semantics, or reject/disable the optimized hasToken path when final postprocessed tokens may contain separators.
[src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp:656] [dismissed by author -- https://github.com/ClickHouse/ClickHouse/pull/98939#discussion_r3407943138] The fallback postprocesses the raw hasToken needle as one token, while the materialized index path tokenizes first and postprocesses every generated token. This still makes non-splitByNonAlpha tokenizers such as ngrams(3) disagree between materialized-index and fallback/direct-read paths; a query like hasToken(val, 'hello') with postprocessor = substring(val, 1, 2) can compare only he on the fallback path instead of the indexed token set he, el, ll. The linked follow-up is not in this PR, and current code still marks hasToken as exact for all tokenizers. Suggested fix: tokenize before postprocessing and express the fallback as an all-token check, or disable hasToken text-index optimization outside tokenizers whose single-token contract is actually valid.

⚠️ Majors

[docs/en/engines/table-engines/mergetree-family/textindexes.md:327] The docs say hasToken, hasAllTokens, and hasAnyTokens work with any postprocessor and apply it to both haystack and search needle, but they omit the current hasToken restriction that a postprocessed needle containing separators throws BAD_ARGUMENTS. Existing thread: Text index postprocessor #98939 (comment). Suggested fix: document the restriction precisely, or change the implementation so the documented contract is true.
[src/Functions/hasAnyAllTokens.cpp:455,598] The function documentation says an index-defined postprocessor is applied to hasAnyTokens and hasAllTokens needle/input tokens in general. The SQL function only receives the tokenizer argument; index-specific postprocessing is injected by the text-index rewrite/default-expression path, so use_skip_indexes = 0 or a query without a usable index cannot apply that postprocessor. Existing thread: Text index postprocessor #98939 (comment). Suggested fix: scope this wording to the text-index path and spell out the same limitation as the preprocessor note.
[src/Functions/hasTokenCaseInsensitive.cpp:33] The source docs only say hasTokenCaseInsensitive has pitfalls with non-default tokenizers and transforms, but do not state the operational rule users need: this function is not text-index-aware for the new preprocessor/postprocessor rewrite path. Existing thread: Text index postprocessor #98939 (comment). Suggested fix: explicitly direct users to hasAnyTokens or hasAllTokens when they expect index-defined transforms to be applied.
[docs/en/engines/table-engines/mergetree-family/textindexes.md:435] The docs say postprocessor-mapped empty search tokens are ignored, but the current behavior for an all-empty search is no match, not a vacuous match. Existing thread: Text index postprocessor #98939 (comment). Suggested fix: document the all-empty case and the observable result.
[docs/en/engines/table-engines/mergetree-family/textindexes.md:426] The stemming example says a query token stemmed to run matches rows containing ran; the English stem behavior does not conflate ran with run. Existing thread: Text index postprocessor #98939 (comment). Suggested fix: remove ran from the example or choose examples that the documented stemmer actually normalizes together.

💡 Nits

[src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp:650-655] The comment still says separator-containing postprocessed tokens keep the original needle and mentions hasToken*, but the code now throws BAD_ARGUMENTS and the optimizer no longer rewrites hasTokenOrNull with a postprocessor. Existing thread: Text index postprocessor #98939 (comment). Suggested fix: update the comment to match the current branch.

Tests

⚠️ Please add a focused regression for hasToken with a non-splitByNonAlpha tokenizer, for example ngrams(3) plus postprocessor = substring(val, 1, 2), proving that hasToken(val, 'hello') does not match an unrelated value that only shares the first postprocessed token.
⚠️ Please add a focused regression for a separator-emitting postprocessor, for example postprocessor = if(val = 'foo', 'bar baz', val), proving that hasToken(val, 'baz') behaves the same with materialized-index, unmaterialized-part, query_plan_direct_read_from_text_index = 1, and query_plan_direct_read_from_text_index = 0 paths.

Final Verdict

Status: ❌ Block

Minimum required actions: make the hasToken fallback/direct-read behavior equivalent to the materialized text index path, or restrict unsupported tokenizer/postprocessor combinations, then update the source and user-facing docs so the documented contract matches the implementation.

…rocessor

It is possible that the postprocessor empties all the tokens from tokenizer. In that case we need a special handler to be coherent with the has*Token functions' contracts.

alexey-milovidov · 2026-04-09T03:28:10Z

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

# Conflicts: # src/Storages/MergeTree/MergeTreeIndexText.cpp

Use the same approach than preprocessor in that case. MergeTreeIndexTextUtils.h: New file to avoid repeating code common to pre and post processor. optimizeDirectReadFromTextIndex.cpp: Applies the postprocessor to the haystack symmetrically with the needle.

…essor

Ergus · 2026-06-24T17:56:41Z

The last 3 CI runs are failing in tsan builds due to #108393 introduced in #82414

This is totally orthogonal to the changes and seems to be also present in master. Everything else ran correctly.

There is already a fix for the root cause of the failure: #108391

And we don't want to delay the merge into master anymore.

…essor

…processor simultaneously

ahmadov · 2026-06-24T23:01:04Z

+
+    /// A postprocessor may drop or rewrite tokens, which would desynchronize the recorded
+    /// positions from the actual token sequence and break positional phrase search.
+    if (positions && postprocessor.hasActions())


this sounds wrong. Can you give an example where this might not work? Because sostprocessor is a powerful way to avoid stop words and get rid of unnecessary tokens. Having postprocessor without positions support would be slightly less useful for phrase search.

For now we have to disable them together, After postprocesing a phrase the relative positions will/could change, so we need to decide how we prefer to solve the conflict (so what position information will be stored that will be sensible for hasPhrase and any other feature). But that's more a sort of followup development.

We need to test correctness extensively before.

clickhouse-gh · 2026-06-25T01:31:41Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 495/511 (96.87%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

…essor

Cherry pick #98939 to 26.6: Text index postprocessor

Backport #98939 to 26.6: Text index postprocessor

Ergus added 6 commits March 6, 2026 13:06

New auxiliar function: tokenizeToArray

dc69d2c

Added as part of the tokenizer interface to tokenize complete String and Array(String) columns into Array(String).

Add a postprocessor class

c0c3a59

Created with Claude help after a lot of explications, iterations and tokens... The basic implementation is extremely inefficient and the tunning changes were hard to get by the IA.

Integrate postprocessor into index creation and update

1b49076

Fix error when using arrays with has*Tokens

cffd1fc

Limit postprocessor to has*Token functions

b8979ce

This is not a definitive change, but I don't want to break anything and at the moment that's everything I will test.

Document postprocessor

c63dbca

clickhouse-gh Bot added the pr-feature Pull request with new product feature label Mar 6, 2026

Ergus added 2 commits March 9, 2026 15:24

Add 'postprocessor' to aspell

577c69d

Merge remote-tracking branch 'principal/master' into text_index_postp…

e1f46bc

…rocessor

rschu1ze mentioned this pull request Mar 18, 2026

Remove stop_words param for unicode tokenizer #99834

Merged

1 task

Ergus added 3 commits March 18, 2026 13:04

Make clang tidy happy

78d72f9

Merge remote-tracking branch 'principal/master' into text_index_postp…

cb15cff

…rocessor

Merge remote-tracking branch 'principal/master' into text_index_postp…

acccd99

…rocessor

clickhouse-gh Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread src/Interpreters/ITokenizer.cpp

clickhouse-gh Bot reviewed Mar 24, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/MergeTreeIndexText.cpp Outdated

Merge remote-tracking branch 'principal/master' into text_index_postp…

eb73a87

…rocessor

clickhouse-gh Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/MergeTreeIndexConditionText.cpp Outdated

clickhouse-gh Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/Storages/MergeTree/MergeTreeIndexTextPostprocessor.cpp

Ergus added 2 commits April 7, 2026 16:45

Don't lie to the compiler.

7aeaa6a

Add postprocessor test

82e8cad

clickhouse-gh Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp Outdated

Fix empty corner case.

53f675f

It is possible that the postprocessor empties all the tokens from tokenizer. In that case we need a special handler to be coherent with the has*Token functions' contracts.

clickhouse-gh Bot reviewed Apr 7, 2026

View reviewed changes

Comment thread docs/en/engines/table-engines/mergetree-family/textindexes.md Outdated

This comment was marked as resolved.

Sign in to view

Merge remote-tracking branch 'origin/master' into tmp-ergus

fada40e

alexey-milovidov and others added 2 commits April 9, 2026 10:30

Merge remote-tracking branch 'origin/master' into conflict-98939

c89ee2e

# Conflicts: # src/Storages/MergeTree/MergeTreeIndexText.cpp

clickhouse-gh Bot reviewed Apr 9, 2026

View reviewed changes

Comment thread src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp Outdated

Merge remote-tracking branch 'origin/master' into text_index_postproc…

c699f5b

…essor

Ergus added this pull request to the merge queue Jun 24, 2026

Ergus removed this pull request from the merge queue due to a manual request Jun 24, 2026

Ergus added 2 commits June 24, 2026 20:46

Merge remote-tracking branch 'origin/master' into text_index_postproc…

c39cab9

…essor

Add exclusion check in textIndexValidator to avoid positions and post…

2bccdd4

…processor simultaneously

Ergus mentioned this pull request Jun 24, 2026

Text index: store positions for a better phrase search #103172

Merged

3 tasks

ahmadov reviewed Jun 24, 2026

View reviewed changes

Merge remote-tracking branch 'origin/master' into text_index_postproc…

1deafd2

…essor

Ergus mentioned this pull request Jun 25, 2026

Make the text index postprocessor compatible with positional phrase search (positions = 1 / hasPhrase). #108432

Closed

Ergus added this pull request to the merge queue Jun 25, 2026

Ergus added the v26.6-must-backport label Jun 25, 2026

Merged via the queue into ClickHouse:master with commit d7f7159 Jun 25, 2026
16 of 18 checks passed

Ergus deleted the text_index_postprocessor branch June 25, 2026 02:21

robot-ch-test-poll4 mentioned this pull request Jun 25, 2026

Cherry pick #98939 to 26.6: Text index postprocessor #108437

Merged

robot-ch-test-poll4 added a commit that referenced this pull request Jun 25, 2026

Merge pull request #108437 from ClickHouse/cherrypick/26.6/98939

9233932

Cherry pick #98939 to 26.6: Text index postprocessor

robot-clickhouse added a commit that referenced this pull request Jun 25, 2026

Backport #98939 to 26.6: Text index postprocessor

04ead7e

robot-ch-test-poll4 mentioned this pull request Jun 25, 2026

Backport #98939 to 26.6: Text index postprocessor #108438

Merged

robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jun 25, 2026

clickhouse-gh Bot added a commit that referenced this pull request Jun 25, 2026

Merge pull request #108438 from ClickHouse/backport/26.6/98939

479d6dd

Backport #98939 to 26.6: Text index postprocessor

robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 25, 2026

Ergus mentioned this pull request Jun 25, 2026

Revert "Text index postprocessor" #108456

Merged

robot-clickhouse-ci-2 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Jun 25, 2026

Ergus mentioned this pull request Jun 26, 2026

Text index postprocessor resubmit #108606

Merged

fm4v removed pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Jul 3, 2026

robot-ch-test-poll added pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore labels Jul 3, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Text index postprocessor#98939

Text index postprocessor#98939
Ergus merged 170 commits into
ClickHouse:masterfrom
Ergus:text_index_postprocessor

Ergus commented Mar 6, 2026 •

edited by robot-clickhouse

Loading

Uh oh!

clickhouse-gh Bot commented Mar 6, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

alexey-milovidov commented Apr 9, 2026

Uh oh!

Uh oh!

Ergus commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

ahmadov Jun 24, 2026

Uh oh!

Ergus Jun 24, 2026 •

edited

Loading

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Ergus commented Mar 6, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Version info

Uh oh!

clickhouse-gh Bot commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

This comment was marked as resolved.

alexey-milovidov commented Apr 9, 2026

Uh oh!

Uh oh!

Ergus commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

ahmadov Jun 24, 2026

Choose a reason for hiding this comment

Uh oh!

Ergus Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

10 participants

Ergus commented Mar 6, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented Mar 6, 2026 •

edited

Loading

Ergus commented Jun 24, 2026 •

edited

Loading

Ergus Jun 24, 2026 •

edited

Loading