Text index postprocessor by Ergus · Pull Request #98939 · ClickHouse/ClickHouse · GitHub
Skip to content

Text index postprocessor#98939

Merged
Ergus merged 170 commits into
ClickHouse:masterfrom
Ergus:text_index_postprocessor
Jun 25, 2026
Merged

Text index postprocessor#98939
Ergus merged 170 commits into
ClickHouse:masterfrom
Ergus:text_index_postprocessor

Conversation

@Ergus

@Ergus Ergus commented Mar 6, 2026

Copy link
Copy Markdown
Member

Parameters:

postprocessor : The postprocessor is an arbitrary expression that transforms each token after tokenization.

Example use:

CREATE TABLE users (
    str String,
    index idx() type text(tokenizer = 'spliByNonApha', postprocessor = lower(str))
ENGINE = MergeTree ORDER BY tuple();

SELECT * FROM users WHERE hasAnyTokens(name, 'crash');

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added a postprocessor to the text index which transforms the tokens after tokenization.

Version info

  • Merged into: 26.7.1.47
  • Backported to: 26.6.1.1191

Ergus added 6 commits March 6, 2026 13:06
Added as part of the tokenizer interface to tokenize complete String and Array(String) columns into Array(String).
Created with Claude help after a lot of explications, iterations and tokens...
The basic implementation is extremely inefficient and the tunning changes were hard to get by the IA.
This is not a definitive change, but I don't want to break anything and at the moment that's everything I will test.
@clickhouse-gh

clickhouse-gh Bot commented Mar 6, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Mar 6, 2026
Comment thread src/Interpreters/ITokenizer.cpp
Comment thread src/Storages/MergeTree/MergeTreeIndexText.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeIndexConditionText.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeIndexTextPostprocessor.cpp
Comment thread src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp Outdated
It is possible that the postprocessor empties all the tokens from tokenizer.
In that case we need a special handler to be coherent with the has*Token functions' contracts.
Comment thread docs/en/engines/table-engines/mergetree-family/textindexes.md Outdated
@alexey-milovidov

This comment was marked as resolved.

@alexey-milovidov

Copy link
Copy Markdown
Member

The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158

alexey-milovidov and others added 2 commits April 9, 2026 10:30
# Conflicts:
#	src/Storages/MergeTree/MergeTreeIndexText.cpp
Use the same approach than preprocessor in that case.
MergeTreeIndexTextUtils.h: New file to avoid repeating code common to pre and post processor.
optimizeDirectReadFromTextIndex.cpp: Applies the postprocessor to the haystack symmetrically with the needle.
Comment thread src/Processors/QueryPlan/Optimizations/optimizeDirectReadFromTextIndex.cpp Outdated
@Ergus

Ergus commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

The last 3 CI runs are failing in tsan builds due to #108393 introduced in #82414

This is totally orthogonal to the changes and seems to be also present in master. Everything else ran correctly.

There is already a fix for the root cause of the failure: #108391

And we don't want to delay the merge into master anymore.

@Ergus Ergus added this pull request to the merge queue Jun 24, 2026
@Ergus Ergus removed this pull request from the merge queue due to a manual request Jun 24, 2026

/// A postprocessor may drop or rewrite tokens, which would desynchronize the recorded
/// positions from the actual token sequence and break positional phrase search.
if (positions && postprocessor.hasActions())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds wrong. Can you give an example where this might not work? Because sostprocessor is a powerful way to avoid stop words and get rid of unnecessary tokens. Having postprocessor without positions support would be slightly less useful for phrase search.

@Ergus Ergus Jun 24, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we have to disable them together, After postprocesing a phrase the relative positions will/could change, so we need to decide how we prefer to solve the conflict (so what position information will be stored that will be sensible for hasPhrase and any other feature). But that's more a sort of followup development.

We need to test correctness extensively before.

@clickhouse-gh

clickhouse-gh Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.30% 85.30% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.50% 77.50% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 495/511 (96.87%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@Ergus Ergus added this pull request to the merge queue Jun 25, 2026
Merged via the queue into ClickHouse:master with commit d7f7159 Jun 25, 2026
16 of 18 checks passed
@Ergus Ergus deleted the text_index_postprocessor branch June 25, 2026 02:21
robot-ch-test-poll4 added a commit that referenced this pull request Jun 25, 2026
Cherry pick #98939 to 26.6: Text index postprocessor
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jun 25, 2026
clickhouse-gh Bot added a commit that referenced this pull request Jun 25, 2026
Backport #98939 to 26.6: Text index postprocessor
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 25, 2026
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Jun 25, 2026
@fm4v fm4v removed pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR labels Jul 3, 2026
@robot-ch-test-poll robot-ch-test-poll added pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore labels Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-feature Pull request with new product feature pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo submodule changed At least one submodule changed in this PR. v26.6-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants