WIP: Projection Index Text#93114
Conversation
bf78dd7 to
52bb372
Compare
|
Hi @amosbird I know you put in a lot of work into this, that this is a draft, and I might not have all background information. Users have a common expectation that full-text search is backed by a inverted text index. Implementing it differently as projection index will mean a lot explaining - projections are actually an unusual feature for databases. Funtionality- and performance-wise, the inverted index and the projection index will be similar but of course, we'll need to maintain both. Also, a lot of tweaking and tuning went into the existing inverted index (different format versions, tokenizers, functions). A projection index means re-engineering a lot of this. I'm of course happy to hear some good reasons why ClickHouse should offer also a projection index, but for now I would vote for not merging this one. |
|
Hi @rschu1ze, thanks for the feedback. Let me try to clarify a few points regarding the projection index text PR and its motivation. Projection index turns index structures into first-class, typed columns. This enables cross-layer reuse between the engine and indexes. Engine-level optimizations (e.g. It also allows index structures to act as reusable analytical primitives, not just lookup accelerators. For example, a Finally, because projection index data is stored as regular columns, it is directly observable via SQL, making index behavior transparent and enabling easier introspection, debugging, and iteration of future index designs. Regarding the concern that functionality- and performance-wise the inverted index and the projection index would be similar and both would need to be maintained: First, while this PR could serve as a foundation for a production-ready text index in the future, its primary goal is to demonstrate the projection-based indexing concept and its value. Similar to my earlier performance-focused PR #81944, this work should be viewed as a conceptual and architectural demo. The implementation in this PR will be split into smaller, meaningful components and submitted incrementally. From a maintenance perspective, a projection-based implementation is in fact well aligned with text indexing needs in several ways. It is naturally part-level (unlike skip indices which are granule-level), it already benefits from projection merge and mutation support (which the current text index merge logic partially relies on as well), and it can be easily extended to support Lucene-like payloads, i.e. attaching additional data to index entries. In this sense, some recent optimizations in the text index—such as direct-read behavior—are closer to materialized expression semantics, which are a natural fit for projections. The intent here is not to introduce long-term maintenance overhead. Implementing text-index-related capabilities on top of projections requires only minimal adaptation, rather than maintaining a parallel indexing subsystem. In parallel, our team continues to actively optimize the existing text index implementation (e.g. #92871), and this work is orthogonal to the projection-based approach. At the same time, we are also investing in additional index types, such as key–value, vector, and spatial indexes. |
|
@amosbird Ah, it makes more sense to me now, thanks. |
99e8b67 to
cc56347
Compare
There is somewhat common feature called covering index. (basically, when you can specify extra columns for index column, so just reading index will be sufficient for answering query) I can imagine, that projection index can have "simpler" expansion to cover this feature. |
069a41a to
a75bcec
Compare
52098a9 to
b470cbb
Compare
0643c1e to
760931a
Compare
db1abde to
a0245d4
Compare
36a61cf to
b4bb939
Compare
| auto & dbuf = pst_stream->decodeBuffer(); | ||
| dbuf.reset(); | ||
| const uint8_t * ptr = dbuf.ptr(); | ||
| if (count == TURBOPFOR_BLOCK_SIZE) |
There was a problem hiding this comment.
This precise mark-filter path decodes a .pst packed block from offsets supplied by .pidx, but it never validates the decoded doc ids before caching them. If corrupted .pidx points at the wrong bytes, entry->doc_ids can be non-monotonic or outside [delta_base + 1, lb.lastDocIdOf(lo)]; the later lower_bound can then miss a real hit and hasDocInRange returns false, skipping a mark that may contain matching rows.
Please treat the decoded block as untrusted: advance by the decoder-consumed byte count, validate strict monotonicity and range against the packed-block metadata, and throw INCORRECT_DATA instead of caching/using invalid doc ids.
| bool ok = true; | ||
| for (UInt32 vi = 1; vi < count && ok; ++vi) | ||
| ok = (decode_buf[vi] > decode_buf[vi - 1]); | ||
| if (!ok || decode_buf[0] <= delta_base || decode_buf[count - 1] > lb.lastDocIdOf(block_idx)) |
There was a problem hiding this comment.
This detects corrupted decoded doc ids, but only logs and then continues with decode_buf as if it were valid. For projection text index direct reads, those decoded ids drive the virtual-column result, so non-monotonic or out-of-range data can return wrong rows instead of reporting a corrupted index. The same pattern exists in iterateLargeBlock below.
Please make this a hard runtime validation: throw INCORRECT_DATA (or fall back to a safe materialization path that still validates) before decode_buf is used.
05c3af7 to
77c63fd
Compare
77c63fd to
d8ebab6
Compare
| } | ||
| else | ||
| { | ||
| abs_pos = doc_state.last_position + delta + 1; |
There was a problem hiding this comment.
last_position + delta + 1 is done in UInt32, so corrupted .pos data can wrap an absolute position back to a small value or to NO_MORE_POSITIONS. PhraseCursor treats that sentinel as end-of-doc and otherwise assumes positions are increasing, so this can turn corrupted phrase data into false negatives/positives instead of an INCORRECT_DATA exception.
Please compute the next absolute position in UInt64, require it to be greater than the previous position and < NO_MORE_POSITIONS, and throw on overflow or non-monotonic deltas.
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 3424/4646 (73.70%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 31 line(s) · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
TODO
Documentation entry for user-facing changes