iframe-proxy

amosbird · 2025-12-29T00:00:29Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

TODO

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

clickhouse-gh · 2025-12-29T00:00:57Z

Workflow [PR], commit [af95673]

Summary: ⏳

Performance Comparison: Performance dashboard

job_name	test_name	status	info
AST fuzzer (amd_debug, targeted)		FAIL
	Logical error: Hash is not set for serialization A (STID: 2521-3cb7)	FAIL	cidb
Stateless tests (amd_asan_ubsan, flaky check)		FAIL
	Server died	FAIL	cidb
	03800_projection_text_index_prewhere_support	UNKNOWN	cidb
	03800_projection_text_index_prewhere_support	UNKNOWN	cidb
	03800_projection_text_index_prewhere_support	UNKNOWN	cidb
	03800_projection_text_index_prewhere_support	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
Stateless tests (amd_tsan, flaky check)		FAIL
	Server died	FAIL	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
Stateless tests (amd_msan, flaky check)		FAIL
	Server died	FAIL	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
	03800_projection_text_index_zookeeper	UNKNOWN	cidb
Stateless tests (amd_debug, flaky check)		FAIL
	03800_projection_text_index_parallel_replicas	FAIL	cidb
Stateless tests (amd_asan_ubsan, distributed plan, parallel, 2/2)		FAIL
	03800_projection_text_index_zookeeper	FAIL	cidb
Stateless tests (amd_debug, parallel)		FAIL
	03800_projection_text_index_zookeeper	FAIL	cidb
Stateless tests (amd_tsan, parallel, 2/2)		FAIL
	03800_projection_text_index_zookeeper	FAIL	cidb
Stateless tests (amd_debug, distributed plan, s3 storage, parallel)		FAIL
	03800_projection_text_index_zookeeper	FAIL	cidb
Stateless tests (amd_tsan, s3 storage, parallel, 2/2)		FAIL
	03800_projection_text_index_zookeeper	FAIL	cidb

AI Review

Summary

This PR adds an experimental projection-backed text index for MergeTree, including on-disk posting/position streams, direct-read virtual columns, phrase-query support, a feature gate, and stateless coverage. I would still block merge: several corrupted-index paths can still read invalid metadata, allocate from impossible counts, or return wrong rows instead of throwing INCORRECT_DATA.

PR Metadata

💡 Improvement is the right category, but the required changelog entry is still TODO. Suggested entry: Add an experimental projection-backed text index for MergeTree tables, gated by allow_experimental_projection_text_index.
💡 The documentation checkbox is still unchecked for a user-facing experimental feature. Either add source/docs coverage for the SQL surface and examples, or fill the template entry that should be added to docs.clickhouse.com.

Missing context / blind spots

⚠️ The PR body still does not describe motivation, expected user workflow, or rollout contract beyond the placeholder changelog. Filling the PR description/docs entry would close this.
⚠️ I did not run a local build or stateless suite in this pass. The Praktika helper currently reports no failed PR checks, but focused corruption tests are still needed for the failure modes below.

Findings

❌ Blockers

[src/Storages/MergeTree/ProjectionIndex/PositionCursor.cpp:218] ensurePosBlockDecoded validates block_start_delta, but still indexes lb.pos_cum_bytes[pos_block_idx - 1] without validating that pos_block_idx <= lb.pos_cum_bytes.size(). Inconsistent .pidx/.pos metadata can therefore read past the vector instead of throwing INCORRECT_DATA. Validate the position-block index against the stored byte-offset table before computing byte_offset.
[src/Storages/MergeTree/ProjectionIndex/PostingListData.cpp:1326] PostingListStream::read still accepts any num_large_blocks <= doc_count and then sizes lb_ranges, pst_offsets, pidx_offsets, and pos_offsets from that on-disk value. The expected count is determined by doc_count - 1 and posting_list_block_size; accepting impossible values can force excessive allocations or decode inconsistent metadata. Compute the expected large-block count before allocation, require an exact match, and check the num_large_blocks * 2 product before using it.
[src/Storages/MergeTree/ProjectionIndex/ProjectionTokenInfo.cpp:155] precise mark filtering decodes doc ids from .pst offsets supplied by .pidx and caches them without validating monotonicity or range. If corrupted metadata points at the wrong bytes, std::lower_bound can miss a real hit and skip a mark containing matching rows. Validate decoded ids are strictly increasing and inside [delta_base + 1, lb.lastDocIdOf(lo)], and throw before caching.
[src/Storages/MergeTree/ProjectionIndex/PostingListCursor.cpp:688] iterateLargeBlock still detects invalid decoded doc ids but only logs, advances the stream, and calls on_decoded with the bad buffer. Those ids drive direct-read virtual columns and materialization, so corrupted postings can return wrong rows. Mirror loadPackedBlock: throw INCORRECT_DATA before pst_decode_buf->advance or on_decoded.
[src/Storages/MergeTree/ProjectionIndex/PositionCursor.cpp:385] the overflow guard still allows last_position + delta + 1 == PositionCursor::NO_MORE_POSITIONS (UINT32_MAX). PhraseCursor treats that sentinel as end-of-doc, so corrupted .pos data can turn a decoded position into a false negative instead of an exception. Require the computed position to be < PositionCursor::NO_MORE_POSITIONS before storing it.

Tests

⚠️ Add focused corruption tests for oversized or inconsistent num_large_blocks, pos_block_idx beyond the pos_cum_bytes table, non-monotonic/out-of-range decoded doc ids in both precise mark filtering and iterateLargeBlock, and decoded positions equal to NO_MORE_POSITIONS. These are the boundaries that would prove the remaining fail-closed behavior.

Final Verdict

Status: ❌ Block

Minimum required actions: fix the remaining corruption-handling paths above, add focused regression coverage for them, and replace the placeholder changelog/documentation metadata before merge.

rschu1ze · 2025-12-30T10:15:41Z

Hi @amosbird

I know you put in a lot of work into this, that this is a draft, and I might not have all background information.

Users have a common expectation that full-text search is backed by a inverted text index. Implementing it differently as projection index will mean a lot explaining - projections are actually an unusual feature for databases. Funtionality- and performance-wise, the inverted index and the projection index will be similar but of course, we'll need to maintain both. Also, a lot of tweaking and tuning went into the existing inverted index (different format versions, tokenizers, functions). A projection index means re-engineering a lot of this.

I'm of course happy to hear some good reasons why ClickHouse should offer also a projection index, but for now I would vote for not merging this one.

amosbird · 2025-12-30T16:48:34Z

Hi @rschu1ze, thanks for the feedback. Let me try to clarify a few points regarding the projection index text PR and its motivation.

Projection index turns index structures into first-class, typed columns.
The core innovation is not a new indexing algorithm, but a change in abstraction: index data is modeled as regular columns that the engine can natively understand and operate on.

This enables cross-layer reuse between the engine and indexes. Engine-level optimizations (e.g. ColumnString FSST encoding, primary key layout improvements, string serialization) automatically apply to index data, while optimizations introduced for index-specific column types (such as PostingList) become available to the engine once expressed through the column and type system.

It also allows index structures to act as reusable analytical primitives, not just lookup accelerators. For example, a PostingList column introduced for text projections can also support union-heavy analytical workloads more efficiently than bitmap-based approaches (see e.g. https://dl.acm.org/doi/10.1145/3035918.3064007), achieving benefits for both indexing and query execution.

Finally, because projection index data is stored as regular columns, it is directly observable via SQL, making index behavior transparent and enabling easier introspection, debugging, and iteration of future index designs.

Regarding the concern that functionality- and performance-wise the inverted index and the projection index would be similar and both would need to be maintained:

First, while this PR could serve as a foundation for a production-ready text index in the future, its primary goal is to demonstrate the projection-based indexing concept and its value. Similar to my earlier performance-focused PR #81944, this work should be viewed as a conceptual and architectural demo. The implementation in this PR will be split into smaller, meaningful components and submitted incrementally.

From a maintenance perspective, a projection-based implementation is in fact well aligned with text indexing needs in several ways. It is naturally part-level (unlike skip indices which are granule-level), it already benefits from projection merge and mutation support (which the current text index merge logic partially relies on as well), and it can be easily extended to support Lucene-like payloads, i.e. attaching additional data to index entries. In this sense, some recent optimizations in the text index—such as direct-read behavior—are closer to materialized expression semantics, which are a natural fit for projections.

The intent here is not to introduce long-term maintenance overhead. Implementing text-index-related capabilities on top of projections requires only minimal adaptation, rather than maintaining a parallel indexing subsystem. In parallel, our team continues to actively optimize the existing text index implementation (e.g. #92871), and this work is orthogonal to the projection-based approach. At the same time, we are also investing in additional index types, such as key–value, vector, and spatial indexes.

rschu1ze · 2025-12-30T17:35:18Z

@amosbird Ah, it makes more sense to me now, thanks.

UnamedRus · 2026-01-06T12:51:06Z

Users have a common expectation that full-text search is backed by a inverted text index.

There is somewhat common feature called covering index. (basically, when you can specify extra columns for index column, so just reading index will be sufficient for answering query)

I can imagine, that projection index can have "simpler" expansion to cover this feature.
How hard to implement with current inverted index implementation, if it will be required?

clickhouse-gh · 2026-06-06T10:20:24Z

+            auto & dbuf = pst_stream->decodeBuffer();
+            dbuf.reset();
+            const uint8_t * ptr = dbuf.ptr();
+            if (count == TURBOPFOR_BLOCK_SIZE)


This precise mark-filter path decodes a .pst packed block from offsets supplied by .pidx, but it never validates the decoded doc ids before caching them. If corrupted .pidx points at the wrong bytes, entry->doc_ids can be non-monotonic or outside [delta_base + 1, lb.lastDocIdOf(lo)]; the later lower_bound can then miss a real hit and hasDocInRange returns false, skipping a mark that may contain matching rows.

Please treat the decoded block as untrusted: advance by the decoder-consumed byte count, validate strict monotonicity and range against the packed-block metadata, and throw INCORRECT_DATA instead of caching/using invalid doc ids.

clickhouse-gh · 2026-06-06T10:20:24Z

+        bool ok = true;
+        for (UInt32 vi = 1; vi < count && ok; ++vi)
+            ok = (decode_buf[vi] > decode_buf[vi - 1]);
+        if (!ok || decode_buf[0] <= delta_base || decode_buf[count - 1] > lb.lastDocIdOf(block_idx))


This detects corrupted decoded doc ids, but only logs and then continues with decode_buf as if it were valid. For projection text index direct reads, those decoded ids drive the virtual-column result, so non-monotonic or out-of-range data can return wrong rows instead of reporting a corrupted index. The same pattern exists in iterateLargeBlock below.

Please make this a hard runtime validation: throw INCORRECT_DATA (or fall back to a safe materialization path that still validates) before decode_buf is used.

clickhouse-gh · 2026-06-10T12:16:35Z

+    }
+    else
+    {
+        abs_pos = doc_state.last_position + delta + 1;


last_position + delta + 1 is done in UInt32, so corrupted .pos data can wrap an absolute position back to a small value or to NO_MORE_POSITIONS. PhraseCursor treats that sentinel as end-of-doc and otherwise assumes positions are increasing, so this can turn corrupted phrase data into false negatives/positives instead of an INCORRECT_DATA exception.

Please compute the next absolute position in UInt64, require it to be greater than the previous position and < NO_MORE_POSITIONS, and throw on overflow or non-monotonic deltas.

…rojection

clickhouse-gh · 2026-06-12T11:53:01Z

LLVM Coverage Report

Changed lines: Changed C/C++ lines covered by tests: 3424/4646 (73.70%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 31 line(s) · Uncovered code

Full report · Diff report

clickhouse-gh Bot added pr-improvement Pull request with some product improvements submodule changed At least one submodule changed in this PR. labels Dec 29, 2025

amosbird force-pushed the projection-index-text-squash branch 6 times, most recently from bf78dd7 to 52bb372 Compare December 30, 2025 07:41

amosbird force-pushed the projection-index-text-squash branch 2 times, most recently from 99e8b67 to cc56347 Compare January 5, 2026 01:50

amosbird force-pushed the projection-index-text-squash branch 4 times, most recently from 069a41a to a75bcec Compare January 10, 2026 22:17

amosbird force-pushed the projection-index-text-squash branch 2 times, most recently from 52098a9 to b470cbb Compare January 14, 2026 14:41

amosbird force-pushed the projection-index-text-squash branch 6 times, most recently from 0643c1e to 760931a Compare February 4, 2026 15:11

amosbird marked this pull request as ready for review February 5, 2026 00:55

amosbird marked this pull request as draft February 9, 2026 09:39

amosbird force-pushed the projection-index-text-squash branch from db1abde to a0245d4 Compare April 2, 2026 11:15