Add embedded documentation and a system table for dictionary sources by alexey-milovidov · Pull Request #106184 · ClickHouse/ClickHouse · GitHub
Skip to content

Add embedded documentation and a system table for dictionary sources#106184

Merged
alexey-milovidov merged 12 commits into
masterfrom
dictionary-source-documentation
Jun 9, 2026
Merged

Add embedded documentation and a system table for dictionary sources#106184
alexey-milovidov merged 12 commits into
masterfrom
dictionary-source-documentation

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 31, 2026

Copy link
Copy Markdown
Member

Sixth PR in the series that adds embedded, runtime-introspectable documentation to ClickHouse component registries (after table engines #106177, database engines #106178, data types #106180, formats #106181, and dictionary layouts #106182). This one covers dictionary sources — and, since there was no system table for them, adds one.

What it does:

  • Adds a new system.dictionary_sources table listing every dictionary source with embedded documentation columns description, syntax, examples, introduced_in, and related.
  • Threads the shared Documentation struct (src/Common/Documentation.h) through DictionarySourceFactory::registerSource, stores it in a per-source map, and adds a getDocumentation accessor.
  • Populates all 16 sources: clickhouse, mysql, postgresql, mongodb, redis, cassandra, file, executable, executable_pool, http, library, odbc, jdbc, ytsaurus, null, yamlregexptree.

Note on doc embedding: like dictionary layouts, dictionary sources are documented on a single combined reference page rather than one page per source, so this PR uses accurate concise descriptions (with syntax and related) rather than embedding a full per-source markdown page.

Note: this PR shares src/Common/Documentation.h/.cpp with the earlier PRs in the series (trivial add/add on merge).

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added a new system.dictionary_sources table that lists the available dictionary sources together with embedded documentation (description, syntax, examples, introduced_in, related).

Documentation entry for user-facing changes

  • Documentation is provided as part of this change (embedded documentation in system.dictionary_sources and a new system.dictionary_sources system-table page).

Version info

  • Merged into: 26.6.1.528

There was no system table exposing the available dictionary sources. This adds
`system.dictionary_sources`, and attaches the shared `Documentation` struct
(introduced for table engines) to dictionary sources.

- `DictionarySourceFactory::registerSource` now takes a final `Documentation`
  argument, stored in a per-source map, with a `getDocumentation` accessor.
- The new `system.dictionary_sources` table exposes `name` and the embedded
  documentation columns `description`, `syntax`, `examples`, `introduced_in`,
  and `related`.
- All 16 sources (`clickhouse`, `mysql`, `postgresql`, `mongodb`, `redis`,
  `cassandra`, `file`, `executable`, `executable_pool`, `http`, `library`,
  `odbc`, `jdbc`, `ytsaurus`, `null`, `yamlregexptree`) get a description and
  syntax. Sources share a single combined documentation page, so concise
  descriptions are used.

This is a follow-up to the embedded-documentation changes for table engines,
database engines, data types, formats, and dictionary layouts, and reuses the
`Documentation` struct from `src/Common/Documentation.h`.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 31, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label May 31, 2026
Comment thread src/Dictionaries/YTsaurusDictionarySource.cpp Outdated
Comment thread src/Dictionaries/XDBCDictionarySource.cpp Outdated
alexey-milovidov and others added 3 commits June 1, 2026 09:25
The new `system.dictionary_sources` documentation page linked to the
non-existent `/sql-reference/dictionaries` path, breaking the
`Build docusaurus` job. Repoint it to the existing combined dictionary
sources reference page `/sql-reference/statements/create/dictionary/sources`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The embedded `syntax` for the `ytsaurus` dictionary source used
`http_proxy_url`, but the source reads only `http_proxy_urls` from the
configuration, so a user copying the documented `SOURCE(YTSAURUS(...))`
would get a missing-key exception. Use the real key `http_proxy_urls`
and add a regression check in `04304_dictionary_sources_documentation`
that pins the documented key to the one the source actually reads.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `jdbc` dictionary source creator unconditionally throws
`SUPPORT_IS_DISABLED`, so listing it in `system.dictionary_sources`
with a description that reads as a usable source is misleading. State
in the embedded description that the source is currently disabled,
pending consistent support for nullable fields.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Storages/System/StorageSystemDictionarySources.cpp
Comment thread docs/en/operations/system-tables/dictionary_sources.md
alexey-milovidov and others added 3 commits June 3, 2026 00:21
Sources whose support is compiled out (`cassandra`, `mongodb`,
`ytsaurus`, `yamlregexptree`, `mysql`, `postgresql`) are still
registered so that creating such a dictionary throws a helpful
`SUPPORT_IS_DISABLED` exception rather than `UNKNOWN_ELEMENT_IN_CONFIG`.
Before this change, `system.dictionary_sources` still showed a
working-looking description for them, so a user could copy a
`SOURCE(...)` clause that the current build cannot use.

Append a build-time note to the `description` (guarded by the same
`USE_*` macros) so the embedded documentation reflects the disabled
state of the current build, mirroring the existing treatment of the
permanently disabled `jdbc` source.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Storages/System/StorageSystemDictionarySources.cpp Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member Author

The failures of 00175_obfuscator_schema_inference and 00096_obfuscator_save_load in Stateless tests (amd_tsan, parallel) are NOT caused by this PR.

They were introduced by #104690 ("Add UntrackedMemory asynchronous metric"), which made clickhouse-obfuscator abort (SIGABRT) on process teardown: the query results are correct, but the Aborted line on stderr fails the test. #104690 has now been reverted (#106365).

#104690 was merged in violation of the ClickHouse team rules: its own CI already showed these two tests failing (10 times between May 12 and June 1) before it was merged.

Please update your branch to pick up the revert; the tests should pass again.

alexey-milovidov and others added 3 commits June 3, 2026 00:39
…ocumentation

# Conflicts:
#	src/Storages/System/attachSystemTables.cpp
The `syntax` column of `system.dictionary_sources` shows the structure of
the `SOURCE` clause, but some sources are subject to access control when a
dictionary is created from a DDL query rather than from a server
configuration file. Copying such a row into a `CREATE DICTIONARY` query
could fail at runtime.

Reword the `syntax` column description to clarify it documents the clause
structure (not that it is always permitted as DDL), and add per-source
notes to the affected descriptions:
- `executable` and `executable_pool` cannot be created from DDL at all
  (only from a server configuration file), for security reasons.
- `file`, `library`, and `yamlregexptree` accept DDL only when the path is
  inside the configured safe directory (`user_files` or the dictionaries
  library directory).

Keep the `docs/en/operations/system-tables/dictionary_sources.md` page in
sync with the new `syntax` column description.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Picked up the PR:

  • Merged master (was 619 commits behind). The only conflict was the expected add/add in attachSystemTables.cpp against the newly-added data_skipping_index_types attachment line — kept both.
  • Addressed the remaining review thread on the syntax column. The bot correctly noted that several rows are not always usable verbatim as DDL: executable/executable_pool throw DICTIONARY_ACCESS_DENIED for DDL-created dictionaries (allowed only from a server config file), and file/library/yamlregexptree accept DDL only when the path is inside the configured safe directory (user_files / dictionaries library directory). I went with the suggested first option:
    • Reworded the syntax column description to say it documents the structure of the SOURCE clause (not that it is always permitted as DDL), pointing to the per-source description for restrictions.
    • Added concise per-source notes to the affected descriptions.
    • Kept the dictionary_sources.md page in sync.

Built clean and verified locally: system.dictionary_sources shows the updated descriptions/column comment, and 04306_dictionary_sources_documentation matches its reference.

CI was green before the merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged master (was 375 commits behind) into the branch to refresh CI — clean merge, no conflicts. Rebuilt the two changed translation units (StorageSystemDictionarySources.cpp, DictionarySourceFactory.cpp) clean, and verified the test 04306_dictionary_sources_documentation passes locally (output matches the reference).

The only red was 04103_user_network_bandwidth_throttler, a timing-sensitive network-throttler test that is widely flaky on master (it fails on PR=0 master runs and dozens of unrelated PRs over the past month with result differs with reference). It is unrelated to this documentation change.

@groeneai, please take a look at the flakiness of 04103_user_network_bandwidth_throttler and provide a fix in a separate PR if one isn't already in progress.

Comment thread src/Dictionaries/YTsaurusDictionarySource.cpp
@groeneai

groeneai commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov, fix posted as #106680.

Investigation summary. PR #103586 (merged 2026-06-05 by @ tiandiwonder) added no-random-settings to remove S3-prefetch randomization as one source of slow natural read rate. CIDB confirms zero master failures on the test in 14 days, but four post-merge sightings on Stateless tests (amd_tsan, parallel, 2/2) (#106222, #102039, #49966, #106184) all show the same read 1 1 0 shape (duration ok, bytes ok, sleep below threshold). All four PRs already include the no-random-settings tag, so the residual flakiness is contention-driven, not random-settings driven.

Mechanism. Throttler::throttle only enters its sleep path when tokens_value < 0, which requires natural rate to exceed max_speed. On heavily-loaded TSAN parallel runners, natural S3 read rate drops to ~0.9 MB/s (sometimes near or below the previous 1 MB/s limit), so the token bucket never goes negative and UserThrottlerSleepMicroseconds stays low.

Fix in #106680. Lowers the throttle limit and dataset 5x (1 MB/s -> 200 KB/s, 1e6 -> 2e5 rows). Wall-clock stays ~8s, with a 4.5x safety margin against the worst observed CI natural rate. Local 10/10 iterations against S3 produced sleep_us 16.8-17.3s, comfortably above the 3.5s threshold.

pull Bot pushed a commit to sowelswl/ClickHouse that referenced this pull request Jun 7, 2026
The test asserts that ProfileEvents['UserThrottlerSleepMicroseconds'] is
greater than half of the target query duration. The throttler only sleeps
when the natural read rate exceeds the configured throttle limit; when the
natural rate drops close to the limit the token bucket never goes negative
and the throttler skips its sleep path (see `Throttler::throttle` in
`src/Common/Throttler.cpp`).

PR ClickHouse#103586 added `no-random-settings` to remove S3-prefetch settings as one
source of slow natural rate, but contention on the `Stateless tests
(amd_tsan, parallel, 2/2)` runner still occasionally drives the natural S3
read rate close to or below the previous 1 MB/s limit. Post-merge CIDB:
`pull_request_number != 0 AND check_start_time > 2026-06-05` shows 4
sightings (PRs ClickHouse#106222, ClickHouse#102039, ClickHouse#49966, ClickHouse#106184) all on amd_tsan parallel
2/2 with the `read 1 1 0` shape (duration ok, bytes ok, sleep below
threshold). All four PRs already include the `no-random-settings` tag, so
the remaining flakiness is contention-driven, not random-settings driven.

Drop the throttle limit and dataset 5x so the throttle is well below any
plausible natural rate (200 KB/s vs ~0.9 MB/s observed worst case = 4.5x
safety margin). Test wall-clock stays at ~8s. Local 10/10 runs against an
S3 disk produced sleep_us between 16.8s and 17.3s, comfortably above the
3.5s threshold.

Closes: ClickHouse#103422
Related: ClickHouse#103586
Related: ClickHouse#106184

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…ry source

The `ytsaurus` dictionary source is gated by the
`allow_experimental_ytsaurus_dictionary_source` setting: creating a
dictionary with `SOURCE(YTSAURUS(...))` throws `UNKNOWN_STORAGE` unless
that setting is enabled. The `system.dictionary_sources` description did
not mention this, so the row looked as readily usable as the other
remote sources and a user copying the syntax would get an exception.

Mention the experimental setting in the description, and pin that text in
the stateless test alongside the existing `http_proxy_urls` check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.50% 84.50% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.20% 77.20% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 113/116 (97.41%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 1 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@alexey-milovidov alexey-milovidov self-assigned this Jun 9, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 9, 2026
Merged via the queue into master with commit 4de5e4a Jun 9, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the dictionary-source-documentation branch June 9, 2026 01:38
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants