Add embedded documentation and a system table for data skipping index types#106186
Conversation
… types There was no system table exposing the available data skipping index types (`system.data_skipping_indices` lists per-table index instances, not types). This adds `system.data_skipping_index_types`, and attaches the shared `Documentation` struct (introduced for table engines) to skipping index types. - `MergeTreeIndexFactory::registerCreator` now takes a final `Documentation` argument, stored in a per-type map, with a `getDocumentation` accessor. - The new `system.data_skipping_index_types` table exposes `name` and the embedded documentation columns `description`, `syntax`, `examples`, `introduced_in`, and `related`. - All index types (`minmax`, `set`, `ngrambf_v1`, `tokenbf_v1`, `sparse_grams`, `bloom_filter`, `vector_similarity`, `text`, `hypothesis`) get a description and syntax. This is a follow-up to the embedded-documentation changes for table engines, database engines, data types, formats, dictionary layouts/sources, and aggregate function combinators, and reuses the `Documentation` struct from `src/Common/Documentation.h`. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Three of the embedded `syntax`/`description` strings for data skipping index types described declarations that the validators would reject or referenced functions that do not exist: - `sparse_grams`: `bloomFilterIndexTextValidator` requires 5 or 6 arguments (two or three tokenizer parameters before the bloom filter parameters), but the string showed a 4-argument form. Use `sparse_grams(min_ngram_length, max_ngram_length[, min_cutoff_length], size_in_bytes, num_hash_functions, seed)`. - `vector_similarity`: `vectorSimilarityIndexValidator` requires the `dimensions` argument (3 or 6 arguments total), but the string omitted it. Add `dimensions` after `distance_function`. - `text`: the description referenced `searchAny` and `searchAll`, which are not ClickHouse functions. Reference the actual text-search functions `hasToken`, `hasAnyTokens`, `hasAllTokens`, and `hasPhrase`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address review feedback on the regression for `system.data_skipping_index_types`: - Assert that no registered index type exposes an empty `description` or `syntax`, instead of checking only a handful of hard-coded names. - Add focused checks for the argument-sensitive syntax strings: `sparse_grams` must mention `min_ngram_length`/`max_ngram_length`, and `vector_similarity` must include the required `dimensions` argument. - Guard the `vector_similarity` check so it passes whether or not that type is registered in the current build (it requires USearch). Also list `system.data_skipping_index_types` in `02117_show_create_table_system` as suggested in review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Strengthened the test in
Verified locally on a build with USearch: |
|
The failures of They were introduced by #104690 ("Add #104690 was merged in violation of the ClickHouse team rules: its own CI already showed these two tests failing (10 times between May 12 and June 1) before it was merged. Please update your branch to pick up the revert; the tests should pass again. |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 86/97 (88.66%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 2 line(s) · Uncovered code |

Part of the series adding embedded, runtime-introspectable documentation to ClickHouse component registries (table engines #106177, database engines #106178, data types #106180, formats #106181, dictionary layouts #106182, dictionary sources #106184, aggregate function combinators #106185). This one covers data skipping index types — and, since there was no system table for them, adds one.
What it does:
system.data_skipping_index_typestable listing every data skipping index type with embedded documentation columnsdescription,syntax,examples,introduced_in, andrelated. (This complementssystem.data_skipping_indices, which lists the index instances defined on existing tables.)Documentationstruct (src/Common/Documentation.h) throughMergeTreeIndexFactory::registerCreator, stores it in a per-type map, and adds agetDocumentationaccessor.minmax,set,ngrambf_v1,tokenbf_v1,sparse_grams,bloom_filter,vector_similarity,text,hypothesis.Note: this PR shares
src/Common/Documentation.h/.cppwith the earlier PRs in the series (trivial add/add on merge).Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Added a new
system.data_skipping_index_typestable that lists the available data skipping index types together with embedded documentation (description,syntax,examples,introduced_in,related).Documentation entry for user-facing changes
system.data_skipping_index_typesand a newsystem.data_skipping_index_typessystem-table page).Version info
26.6.1.372