WIP: Projection Index Text by amosbird · Pull Request #93114 · ClickHouse/ClickHouse · GitHub
Skip to content

WIP: Projection Index Text#93114

Draft
amosbird wants to merge 1 commit into
ClickHouse:masterfrom
amosbird:projection-index-text-squash
Draft

WIP: Projection Index Text#93114
amosbird wants to merge 1 commit into
ClickHouse:masterfrom
amosbird:projection-index-text-squash

Conversation

@amosbird

Copy link
Copy Markdown
Collaborator

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

TODO

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@clickhouse-gh

clickhouse-gh Bot commented Dec 29, 2025

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR. labels Dec 29, 2025
@amosbird amosbird force-pushed the projection-index-text-squash branch 6 times, most recently from bf78dd7 to 52bb372 Compare December 30, 2025 07:41
@rschu1ze

Copy link
Copy Markdown
Member

Hi @amosbird

I know you put in a lot of work into this, that this is a draft, and I might not have all background information.

Users have a common expectation that full-text search is backed by a inverted text index. Implementing it differently as projection index will mean a lot explaining - projections are actually an unusual feature for databases. Funtionality- and performance-wise, the inverted index and the projection index will be similar but of course, we'll need to maintain both. Also, a lot of tweaking and tuning went into the existing inverted index (different format versions, tokenizers, functions). A projection index means re-engineering a lot of this.

I'm of course happy to hear some good reasons why ClickHouse should offer also a projection index, but for now I would vote for not merging this one.

@amosbird

amosbird commented Dec 30, 2025

Copy link
Copy Markdown
Collaborator Author

Hi @rschu1ze, thanks for the feedback. Let me try to clarify a few points regarding the projection index text PR and its motivation.

Projection index turns index structures into first-class, typed columns.
The core innovation is not a new indexing algorithm, but a change in abstraction: index data is modeled as regular columns that the engine can natively understand and operate on.

This enables cross-layer reuse between the engine and indexes. Engine-level optimizations (e.g. ColumnString FSST encoding, primary key layout improvements, string serialization) automatically apply to index data, while optimizations introduced for index-specific column types (such as PostingList) become available to the engine once expressed through the column and type system.

It also allows index structures to act as reusable analytical primitives, not just lookup accelerators. For example, a PostingList column introduced for text projections can also support union-heavy analytical workloads more efficiently than bitmap-based approaches (see e.g. https://dl.acm.org/doi/10.1145/3035918.3064007), achieving benefits for both indexing and query execution.

Finally, because projection index data is stored as regular columns, it is directly observable via SQL, making index behavior transparent and enabling easier introspection, debugging, and iteration of future index designs.

Regarding the concern that functionality- and performance-wise the inverted index and the projection index would be similar and both would need to be maintained:

First, while this PR could serve as a foundation for a production-ready text index in the future, its primary goal is to demonstrate the projection-based indexing concept and its value. Similar to my earlier performance-focused PR #81944, this work should be viewed as a conceptual and architectural demo. The implementation in this PR will be split into smaller, meaningful components and submitted incrementally.

From a maintenance perspective, a projection-based implementation is in fact well aligned with text indexing needs in several ways. It is naturally part-level (unlike skip indices which are granule-level), it already benefits from projection merge and mutation support (which the current text index merge logic partially relies on as well), and it can be easily extended to support Lucene-like payloads, i.e. attaching additional data to index entries. In this sense, some recent optimizations in the text index—such as direct-read behavior—are closer to materialized expression semantics, which are a natural fit for projections.

The intent here is not to introduce long-term maintenance overhead. Implementing text-index-related capabilities on top of projections requires only minimal adaptation, rather than maintaining a parallel indexing subsystem. In parallel, our team continues to actively optimize the existing text index implementation (e.g. #92871), and this work is orthogonal to the projection-based approach. At the same time, we are also investing in additional index types, such as key–value, vector, and spatial indexes.

@rschu1ze

Copy link
Copy Markdown
Member

@amosbird Ah, it makes more sense to me now, thanks.

@amosbird amosbird force-pushed the projection-index-text-squash branch 2 times, most recently from 99e8b67 to cc56347 Compare January 5, 2026 01:50
@UnamedRus

Copy link
Copy Markdown
Contributor

Users have a common expectation that full-text search is backed by a inverted text index.

There is somewhat common feature called covering index. (basically, when you can specify extra columns for index column, so just reading index will be sufficient for answering query)

I can imagine, that projection index can have "simpler" expansion to cover this feature.
How hard to implement with current inverted index implementation, if it will be required?

@amosbird amosbird force-pushed the projection-index-text-squash branch 4 times, most recently from 069a41a to a75bcec Compare January 10, 2026 22:17
@amosbird amosbird force-pushed the projection-index-text-squash branch 2 times, most recently from 52098a9 to b470cbb Compare January 14, 2026 14:41
@amosbird amosbird force-pushed the projection-index-text-squash branch 6 times, most recently from 0643c1e to 760931a Compare February 4, 2026 15:11
@amosbird amosbird marked this pull request as ready for review February 5, 2026 00:55
@amosbird amosbird marked this pull request as draft February 9, 2026 09:39
@amosbird amosbird force-pushed the projection-index-text-squash branch from db1abde to a0245d4 Compare April 2, 2026 11:15
Comment thread src/Storages/MergeTree/ProjectionIndex/PostingListData.cpp
Comment thread src/Storages/MergeTree/ProjectionIndex/TurboPForBlockDecodeBuffer.h Outdated
Comment thread src/Storages/MergeTree/ProjectionIndex/ProjectionIndexText.cpp Outdated
Comment thread src/Storages/MergeTree/ProjectionIndex/PostingListData.cpp Outdated
@amosbird amosbird force-pushed the projection-index-text-squash branch from 36a61cf to b4bb939 Compare June 6, 2026 10:09
Comment thread src/Storages/MergeTree/ProjectionIndex/PostingListData.cpp
auto & dbuf = pst_stream->decodeBuffer();
dbuf.reset();
const uint8_t * ptr = dbuf.ptr();
if (count == TURBOPFOR_BLOCK_SIZE)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This precise mark-filter path decodes a .pst packed block from offsets supplied by .pidx, but it never validates the decoded doc ids before caching them. If corrupted .pidx points at the wrong bytes, entry->doc_ids can be non-monotonic or outside [delta_base + 1, lb.lastDocIdOf(lo)]; the later lower_bound can then miss a real hit and hasDocInRange returns false, skipping a mark that may contain matching rows.

Please treat the decoded block as untrusted: advance by the decoder-consumed byte count, validate strict monotonicity and range against the packed-block metadata, and throw INCORRECT_DATA instead of caching/using invalid doc ids.

Comment thread src/Storages/MergeTree/ProjectionIndex/TurboPForBlockDecodeBuffer.h Outdated
bool ok = true;
for (UInt32 vi = 1; vi < count && ok; ++vi)
ok = (decode_buf[vi] > decode_buf[vi - 1]);
if (!ok || decode_buf[0] <= delta_base || decode_buf[count - 1] > lb.lastDocIdOf(block_idx))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This detects corrupted decoded doc ids, but only logs and then continues with decode_buf as if it were valid. For projection text index direct reads, those decoded ids drive the virtual-column result, so non-monotonic or out-of-range data can return wrong rows instead of reporting a corrupted index. The same pattern exists in iterateLargeBlock below.

Please make this a hard runtime validation: throw INCORRECT_DATA (or fall back to a safe materialization path that still validates) before decode_buf is used.

Comment thread src/Storages/MergeTree/ProjectionIndex/ProjectionIndexText.cpp Outdated
Comment thread src/Storages/MergeTree/ProjectionIndex/ProjectionIndexText.cpp Outdated
Comment thread src/Storages/MergeTree/ProjectionIndex/PostingListState.cpp
Comment thread src/Storages/MergeTree/ProjectionIndex/PostingListData.cpp Outdated
Comment thread src/Storages/MergeTree/ProjectionIndex/PostingListData.h Outdated
Comment thread src/Storages/MergeTree/ProjectionIndex/PostingListData.cpp
Comment thread src/DataTypes/DataTypeFactory.cpp
@amosbird amosbird force-pushed the projection-index-text-squash branch 2 times, most recently from 05c3af7 to 77c63fd Compare June 7, 2026 10:55
@amosbird amosbird force-pushed the projection-index-text-squash branch from 77c63fd to d8ebab6 Compare June 7, 2026 15:13
Comment thread src/Storages/MergeTree/ProjectionIndex/PositionCursor.cpp
Comment thread src/Storages/MergeTree/ProjectionIndex/ProjectionIndexText.cpp
Comment thread src/Storages/MergeTree/ProjectionIndex/ProjectionIndexText.cpp Outdated
}
else
{
abs_pos = doc_state.last_position + delta + 1;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last_position + delta + 1 is done in UInt32, so corrupted .pos data can wrap an absolute position back to a small value or to NO_MORE_POSITIONS. PhraseCursor treats that sentinel as end-of-doc and otherwise assumes positions are increasing, so this can turn corrupted phrase data into false negatives/positives instead of an INCORRECT_DATA exception.

Please compute the next absolute position in UInt64, require it to be greater than the previous position and < NO_MORE_POSITIONS, and throw on overflow or non-monotonic deltas.

Comment thread src/Storages/MergeTree/MergeTreeReaderWide.cpp Outdated
Comment thread src/Storages/MergeTree/ProjectionIndex/PhraseCursor.cpp Outdated
@clickhouse-gh

clickhouse-gh Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.60% 84.50% -0.10%
Functions 92.30% 92.30% +0.00%
Branches 77.30% 77.20% -0.10%

Changed lines: Changed C/C++ lines covered by tests: 3424/4646 (73.70%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 31 line(s) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants