Add embedded documentation and a system table for data skipping index types by alexey-milovidov · Pull Request #106186 · ClickHouse/ClickHouse · GitHub
Skip to content

Add embedded documentation and a system table for data skipping index types#106186

Merged
alexey-milovidov merged 5 commits into
masterfrom
skipping-index-documentation
Jun 4, 2026
Merged

Add embedded documentation and a system table for data skipping index types#106186
alexey-milovidov merged 5 commits into
masterfrom
skipping-index-documentation

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 31, 2026

Copy link
Copy Markdown
Member

Part of the series adding embedded, runtime-introspectable documentation to ClickHouse component registries (table engines #106177, database engines #106178, data types #106180, formats #106181, dictionary layouts #106182, dictionary sources #106184, aggregate function combinators #106185). This one covers data skipping index types — and, since there was no system table for them, adds one.

What it does:

  • Adds a new system.data_skipping_index_types table listing every data skipping index type with embedded documentation columns description, syntax, examples, introduced_in, and related. (This complements system.data_skipping_indices, which lists the index instances defined on existing tables.)
  • Threads the shared Documentation struct (src/Common/Documentation.h) through MergeTreeIndexFactory::registerCreator, stores it in a per-type map, and adds a getDocumentation accessor.
  • Populates all index types: minmax, set, ngrambf_v1, tokenbf_v1, sparse_grams, bloom_filter, vector_similarity, text, hypothesis.

Note: this PR shares src/Common/Documentation.h/.cpp with the earlier PRs in the series (trivial add/add on merge).

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added a new system.data_skipping_index_types table that lists the available data skipping index types together with embedded documentation (description, syntax, examples, introduced_in, related).

Documentation entry for user-facing changes

  • Documentation is provided as part of this change (embedded documentation in system.data_skipping_index_types and a new system.data_skipping_index_types system-table page).

Version info

  • Merged into: 26.6.1.372

… types

There was no system table exposing the available data skipping index types
(`system.data_skipping_indices` lists per-table index instances, not types).
This adds `system.data_skipping_index_types`, and attaches the shared
`Documentation` struct (introduced for table engines) to skipping index types.

- `MergeTreeIndexFactory::registerCreator` now takes a final `Documentation`
  argument, stored in a per-type map, with a `getDocumentation` accessor.
- The new `system.data_skipping_index_types` table exposes `name` and the
  embedded documentation columns `description`, `syntax`, `examples`,
  `introduced_in`, and `related`.
- All index types (`minmax`, `set`, `ngrambf_v1`, `tokenbf_v1`, `sparse_grams`,
  `bloom_filter`, `vector_similarity`, `text`, `hypothesis`) get a description
  and syntax.

This is a follow-up to the embedded-documentation changes for table engines,
database engines, data types, formats, dictionary layouts/sources, and
aggregate function combinators, and reuses the `Documentation` struct from
`src/Common/Documentation.h`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label May 31, 2026
Comment thread src/Storages/MergeTree/MergeTreeIndices.cpp Outdated
Comment thread src/Storages/MergeTree/MergeTreeIndices.cpp Outdated
Comment thread docs/en/operations/system-tables/data_skipping_index_types.md
Comment thread src/Storages/MergeTree/MergeTreeIndices.cpp Outdated
Three of the embedded `syntax`/`description` strings for data skipping
index types described declarations that the validators would reject or
referenced functions that do not exist:

- `sparse_grams`: `bloomFilterIndexTextValidator` requires 5 or 6
  arguments (two or three tokenizer parameters before the bloom filter
  parameters), but the string showed a 4-argument form. Use
  `sparse_grams(min_ngram_length, max_ngram_length[, min_cutoff_length], size_in_bytes, num_hash_functions, seed)`.
- `vector_similarity`: `vectorSimilarityIndexValidator` requires the
  `dimensions` argument (3 or 6 arguments total), but the string omitted
  it. Add `dimensions` after `distance_function`.
- `text`: the description referenced `searchAny` and `searchAll`, which
  are not ClickHouse functions. Reference the actual text-search
  functions `hasToken`, `hasAnyTokens`, `hasAllTokens`, and `hasPhrase`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@jkartseva jkartseva self-assigned this Jun 1, 2026

@jkartseva jkartseva left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alexey-milovidov and others added 2 commits June 3, 2026 04:18
Address review feedback on the regression for
`system.data_skipping_index_types`:
- Assert that no registered index type exposes an empty `description` or
  `syntax`, instead of checking only a handful of hard-coded names.
- Add focused checks for the argument-sensitive syntax strings:
  `sparse_grams` must mention `min_ngram_length`/`max_ngram_length`, and
  `vector_similarity` must include the required `dimensions` argument.
- Guard the `vector_similarity` check so it passes whether or not that
  type is registered in the current build (it requires USearch).

Also list `system.data_skipping_index_types` in
`02117_show_create_table_system` as suggested in review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Strengthened the test in 30f8c84ae3e:

  • Added a check that returns any registered index type with an empty description or syntax (expected output is empty), so it now covers all exposed rows rather than five hard-coded names.
  • Added focused checks for the argument-sensitive syntax strings: sparse_grams must mention both min_ngram_length and max_ngram_length, and vector_similarity must include the required dimensions argument.
  • The vector_similarity check uses countIf(syntax NOT LIKE '%dimensions%') = 0, which yields 1 whether or not the type is registered in the current build (it requires USearch), so the test passes in builds without it and still validates the syntax when it is present.

Verified locally on a build with USearch: vector_similarity is registered and its syntax contains dimensions, and all rows have non-empty description/syntax.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The failures of 00175_obfuscator_schema_inference and 00096_obfuscator_save_load in Stateless tests (amd_tsan, parallel) are NOT caused by this PR.

They were introduced by #104690 ("Add UntrackedMemory asynchronous metric"), which made clickhouse-obfuscator abort (SIGABRT) on process teardown: the query results are correct, but the Aborted line on stderr fails the test. #104690 has now been reverted (#106365).

#104690 was merged in violation of the ClickHouse team rules: its own CI already showed these two tests failing (10 times between May 12 and June 1) before it was merged.

Please update your branch to pick up the revert; the tests should pass again.

@clickhouse-gh

clickhouse-gh Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.40% +0.00%
Functions 92.40% 92.40% +0.00%
Branches 77.00% 77.00% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 86/97 (88.66%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 2 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov merged commit 0579d4b into master Jun 4, 2026
165 of 167 checks passed
@alexey-milovidov alexey-milovidov deleted the skipping-index-documentation branch June 4, 2026 06:28
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants