Text index postprocessor#98939
Conversation
Added as part of the tokenizer interface to tokenize complete String and Array(String) columns into Array(String).
Created with Claude help after a lot of explications, iterations and tokens... The basic implementation is extremely inefficient and the tunning changes were hard to get by the IA.
This is not a definitive change, but I don't want to break anything and at the moment that's everything I will test.
It is possible that the postprocessor empties all the tokens from tokenizer. In that case we need a special handler to be coherent with the has*Token functions' contracts.
This comment was marked as resolved.
This comment was marked as resolved.
|
The MSan stress test failure (MemorySanitizer: use-of-uninitialized-value, STID 4179-5154 or 4148-3044) is a known pre-existing issue unrelated to this PR. Fix: #102158 |
# Conflicts: # src/Storages/MergeTree/MergeTreeIndexText.cpp
Use the same approach than preprocessor in that case. MergeTreeIndexTextUtils.h: New file to avoid repeating code common to pre and post processor. optimizeDirectReadFromTextIndex.cpp: Applies the postprocessor to the haystack symmetrically with the needle.
|
The last 3 CI runs are failing in tsan builds due to #108393 introduced in #82414 This is totally orthogonal to the changes and seems to be also present in master. Everything else ran correctly. There is already a fix for the root cause of the failure: #108391 And we don't want to delay the merge into master anymore. |
|
|
||
| /// A postprocessor may drop or rewrite tokens, which would desynchronize the recorded | ||
| /// positions from the actual token sequence and break positional phrase search. | ||
| if (positions && postprocessor.hasActions()) |
There was a problem hiding this comment.
this sounds wrong. Can you give an example where this might not work? Because sostprocessor is a powerful way to avoid stop words and get rid of unnecessary tokens. Having postprocessor without positions support would be slightly less useful for phrase search.
There was a problem hiding this comment.
For now we have to disable them together, After postprocesing a phrase the relative positions will/could change, so we need to decide how we prefer to solve the conflict (so what position information will be stored that will be sensible for hasPhrase and any other feature). But that's more a sort of followup development.
We need to test correctness extensively before.
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 495/511 (96.87%) | Lost baseline coverage: none · Uncovered code |
Cherry pick #98939 to 26.6: Text index postprocessor
Backport #98939 to 26.6: Text index postprocessor

Parameters:
Example use:
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added a postprocessor to the text index which transforms the tokens after tokenization.
Version info
26.7.1.4726.6.1.1191