Lazy-load column statistics during query planning by zoomxi · Pull Request #104691 · ClickHouse/ClickHouse · GitHub
Skip to content

Lazy-load column statistics during query planning#104691

Open
zoomxi wants to merge 16 commits into
ClickHouse:masterfrom
zoomxi:statistic-lazy
Open

Lazy-load column statistics during query planning#104691
zoomxi wants to merge 16 commits into
ClickHouse:masterfrom
zoomxi:statistic-lazy

Conversation

@zoomxi

@zoomxi zoomxi commented May 12, 2026

Copy link
Copy Markdown
Contributor

When the query planner needs column statistics — for join reordering, for prewhere selectivity estimation, or for part pruning — it currently loads statistics for every column of the table from disk on the first access, even when the query only filters or joins on a handful of columns. This PR reduces statistics-file I/O during query planning on wide tables.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Column statistics are now loaded on demand for only the columns the query planner needs instead of for every column of the table on first access.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 12, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 12, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label May 12, 2026
Comment thread src/Processors/QueryPlan/Optimizations/optimizeJoin.cpp Outdated
Comment thread src/Processors/QueryPlan/Optimizations/optimizeJoin.cpp
Comment thread src/Storages/MergeTree/IMergeTreeDataPart.cpp
Comment thread src/Storages/Statistics/StatisticsPartPruner.h Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member

This was fixed by #105146. Let's update the branch.

@rschu1ze

Copy link
Copy Markdown
Member

@hanfei1991 Something interesting to review when you are back.

alexey-milovidov and others added 3 commits June 16, 2026 23:08
# Conflicts:
#	src/Processors/QueryPlan/Optimizations/optimizeJoin.cpp
#	src/Storages/MergeTree/IMergeTreeDataPart.cpp
The merge silently combined the PR's `getUsedColumns` (which remapped `.null`
virtual keys to their parent column via `virtual_key_to_parent`) with master's
member list, which no longer declares `virtual_key_to_parent` or
`null_subcolumns_to_normalize`. Master reverted `StatisticsPartPruner` to a
MinMax-only pruner and dropped the `NullCount` / `.null`-subcolumn machinery the
PR's change depended on, so the combined header failed to compile with
`use of undeclared identifier 'virtual_key_to_parent'`.

Take master's `getUsedColumns`, which returns `used_column_names` populated by
master's `StatisticsPartPruner.cpp`. The lazy per-column loading the PR relies on
is preserved: part pruning still loads statistics only for the filter columns via
`getEstimates(getUsedColumns())`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`IMergeTreeDataPart::getEstimates` only recorded a column in
`estimates_attempted_columns` (the negative cache) when the column had no
statistics declared in metadata. A column that *did* declare statistics but
whose statistics file was unreadable was neither cached in `estimates` nor
marked as attempted, so every query re-read, re-failed, and re-logged the same
deterministic miss for that part. This defeats the lazy-loading goal of the PR
and is the issue raised by the AI Review verdict and the `clickhouse-gh`
review thread.

The miss is deterministic by the time the caching block runs: a column absent
from `estimates` at that point either has no statistics file, or a file that
fails to deserialize (corrupted, unsupported version, or a stale stored type
left by an in-progress `MODIFY COLUMN`, for which `ColumnStatistics::deserialize`
now returns a null pointer after merging master). Transient errors while opening
the statistics file propagate out of `loadStatistics` and never reach this point,
so they remain retryable, and a fresh part object (after reload or mutation)
starts with an empty cache. Record every such miss in the negative cache.

Add `04341_stats_lazy_deterministic_miss`: it corrupts the statistics file of a
metadata-declared column (preserving its size so the on-load size check still
passes), then runs the same statistics-using query twice and asserts the first
query logs a `Cannot load statistics for column` warning while the second does
not, proving the column is probed at most once per part object.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Storages/MergeTree/MergeTreeDataSelectExecutor.cpp Outdated
Comment thread src/Processors/QueryPlan/Optimizations/optimizeJoin.cpp
alexey-milovidov and others added 4 commits June 18, 2026 18:44
After merging master, `makeDistributed.cpp` still forward-declared the
two-argument `estimateReadRowsCount`, while the definition in `optimizeJoin.cpp`
had gained a third `estimator` argument. The mismatched declaration referenced a
symbol that is never defined, so linking `clickhouse` failed in the Fast test
with `undefined symbol:
DB::QueryPlanOptimizations::estimateReadRowsCount(DB::QueryPlan::Node&, DB::ActionsDAG::Node const*)`.

Update the forward declaration to the three-argument signature; the callers here
pass no estimator and keep relying on the index-analysis fallback.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104691&sha=e4e143995a3ca517c0665e983af01612733199f7&name_0=PR
PR: ClickHouse#104691

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`filterPartsByStatistics` called `getEstimates(getUsedColumns())`, but
`StatisticsPartPruner::getUsedColumns` is populated only once `checkPartCanMatch`
builds a key condition, which first happens inside the per-part loop. Before the
loop it returns an empty list, and `IMergeTreeDataPart::getEstimates` treats an
empty list as "load every column's statistics" -- so statistics-based part
pruning deserialized the statistics of all columns instead of only the filter
columns, defeating the lazy-loading goal of this pull request.

Add `StatisticsPartPruner::getCandidateColumns`, which returns the filter columns
that carry `MinMax`/`Basic` statistics (known right after construction), and pass
it to `getEstimates`. `getUsedColumns` is kept for the post-loop query-plan
reporting, where it reflects the columns the key conditions actually used.

Covered by `04209_stats_lazy_columns` (case `04209_part_lazy`).

PR: ClickHouse#104691

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`estimateReadRowsCount` recurses through a generic row-preserving
`ITransformingStep` (e.g. a window function), but `collectStatsColumnsForRelation`
-- which decides which columns the estimator should be built for -- did not. For
a relation subtree like `WindowStep -> FilterStep -> ReadFromMergeTree` the
collector returned no `ReadFromMergeTree`, so no estimator was built and the
relation lost its statistics-backed row estimate, even though the fold still
treated the step as a pass-through.

Handle the generic row-preserving transform in `collectStatsColumnsForRelation`
the same way the fold does, and forward the `estimator` in the fold's recursion
so the estimate is actually computed for such subtrees.

Add `04342_stats_lazy_window_relation`: a window function above a filtered
`ReadFromMergeTree` in a join relation must load the same statistics columns as
the equivalent relation without the window.

PR: ClickHouse#104691

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per-column statistics are now stored inside a single `statistics.packed` file
rather than legacy per-column `statistics_<col>.stats` files, and
`materialize_statistics_on_insert` is a query setting, not a table setting. The
test corrupted a non-existent `statistics_b.stats` and never materialized
statistics, so its premise no longer held.

Materialize statistics on insert and zero out only column `b`'s blob inside
`statistics.packed` (parsing the packed index so the container still loads and
column `a` stays intact), which makes deserializing `b` deterministically fail.
Stream the resulting warning away from the client with `--send_logs_level=error`
so the test asserts it via `system.text_log` only.

PR: ClickHouse#104691

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Pushed e4e143995a3..d6016fbd7f8 addressing the red Fast test build and the AI review's "Request changes" actions:

  • Build (undefined symbol) — 0d3104f9ec5. After the master merge makeDistributed.cpp still forward-declared the 2-arg estimateReadRowsCount, but the definition gained a 3rd estimator argument, so the Fast test link failed with undefined symbol: ...estimateReadRowsCount(DB::QueryPlan::Node&, DB::ActionsDAG::Node const*). Updated the forward declaration to the 3-arg signature.
  • Selective column discovery for part pruning — e2609c3872e. getUsedColumns() is empty before the per-part loop (it is only filled once checkPartCanMatch builds a key condition), so getEstimates({}) was loading every column's statistics. Added StatisticsPartPruner::getCandidateColumns() (filter columns carrying MinMax/Basic statistics, known at construction) and pass it to getEstimates. Covered by 04209_stats_lazy_columns.
  • Statistics through row-preserving transforms — a4015918658. collectStatsColumnsForRelation now treats a generic row-preserving ITransformingStep the same as the estimateReadRowsCount fold (which also now forwards the estimator), so a WindowStep -> FilterStep -> ReadFromMergeTree relation keeps its statistics-backed estimate. New regression test 04342_stats_lazy_window_relation.
  • 04341 test — d6016fbd7f8. Repaired the deterministic-miss test for the current packed statistics format (statistics.packed, parsing the packed index to corrupt only column b's blob).

Locally (a build with rust/mongo/azure disabled) 04209_stats_lazy_columns, 04209_statistics_retry_load, 04341_stats_lazy_deterministic_miss, and 04342_stats_lazy_window_relation all pass across repeated runs.

Comment thread src/Processors/QueryPlan/Optimizations/makeDistributed.cpp
After the lazy-statistics change, `estimateReadRowsCount` no longer builds
a `ConditionSelectivityEstimator` by itself; the caller must build one and
pass it. The distributed-plan strategy choices in `makeDistributed.cpp`
(`tryMakeDistributedJoin`, `tryMakeDistributedAggregation`) were only updated
to match the new signature, so they passed no estimator and silently took the
no-statistics fallback. For a filtered relation that fallback returns no row
estimate at all, so the broadcast/shuffle decision degraded (a filtered right
side that should broadcast was shuffled; a filtered aggregation input that
should keep partial aggregation was shuffled).

Build the estimator with the shared `buildEstimatorForRelation` (the same
helper `optimizeJoinLegacy` uses) for the filter/prewhere columns on the path
to the read step, and pass it. This restores statistics-backed row estimation
while still loading statistics for only those columns, matching the goal of
this PR. `buildEstimatorForRelation` is made non-`static` and forward-declared
in `makeDistributed.cpp`.

Per-key NDV is intentionally not loaded for these callers: in the current
distributed plan structure it cannot change the strategy (the read is wrapped
in a `GatherExchange` before the parent is visited in post-order, which the
estimate walk does not traverse, and when the read is not wrapped the per-key
NDV never exceeds the row count, which is already bounded by the same
threshold), so loading it would be exactly the wasted I/O this PR removes.

Add `04343_distributed_plan_lazy_stats_strategy`, which proves the distributed
aggregation strategy still uses column statistics (partial aggregation with
statistics vs shuffle without) and that statistics are read for only the filter
column, not for every column of the table.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Pushed d6016fbd7f8..afb2194e91e to address the AI review's "Request changes" on makeDistributed.cpp (the only unresolved thread).

What changed (afb2194e91e)

  • tryMakeDistributedJoin and tryMakeDistributedAggregation now build a ConditionSelectivityEstimator (via the shared buildEstimatorForRelation, made non-static) for the filter/prewhere columns on the path to the read and pass it to estimateReadRowsCount. After the lazy change, estimateReadRowsCount stopped building its own estimator, so these callers silently took the no-statistics fallback — for a filtered relation that fallback returns no row estimate at all, so a filtered right side that should broadcast was shuffled, and a filtered aggregation input that should keep partial aggregation was shuffled. Only the filter/prewhere columns are loaded, preserving the lazy-loading goal.
  • New test 04343_distributed_plan_lazy_stats_strategy: distributed aggregation keeps partial aggregation with statistics and falls back to shuffle without, and statistics are read for only the filter column (not every column). I verified it locally against a server built with these changes; 04209_stats_lazy_columns, 04209_statistics_retry_load, 04342_stats_lazy_window_relation still pass.

Note on per-key NDV (the AI also asked for this): I intentionally did not load the GROUP BY keys' NDV. In the current distributed plan structure it cannot influence the strategy — the passes run post-order, so the read is wrapped in a GatherExchange before the aggregation is visited, and the estimate walk does not traverse GatherExchange (same on master); when the read is not wrapped, distributed_plan_max_rows_to_broadcast already bounds the row count, which per-key NDV never exceeds. Loading the keys' statistics would be exactly the wasted I/O this PR removes. Making it effective would mean teaching the estimate walk to traverse the distributed exchange steps — a separate optimization that never worked on master, best as a follow-up. Details in the resolved thread.

CI: the two red checks (AST fuzzer (amd_debug, targeted)Block structure mismatch in A stream in UnionStep; Stress test (arm_debug)Inconsistent AST formatting) are unrelated master-wide AST-fuzzer flakes (88 and 44 distinct PRs respectively in the last 30 days, both seen on master today), not in any file this PR touches.

Master integration: the branch is ~1279 commits behind. I did not re-merge: the statistics subsystem evolved meaningfully on master (e.g. countmin estimate fixes, the lazy-column-replication revisit) and that integration is best done by you, who owns the domain — a blind text-clean merge into the very subsystem this PR rewrites is hard to validate without a full rebuild. The redness above is unrelated flakes, so merging would not clear it.

@clickhouse-gh

clickhouse-gh Bot commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.30% 85.30% +0.00%
Functions 92.60% 92.50% -0.10%
Branches 77.50% 77.50% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 267/274 (97.45%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 4 line(s) · Uncovered code

Full report · Diff report

@rschu1ze rschu1ze left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All statistics for the columns in a part are stored in the same file, i.e. we wouldn't save on I/O anyways. Therefore, is it worth to optimize the statistics loading further?

/// next query.
mutable std::mutex estimates_mutex;
mutable std::optional<Estimates> estimates TSA_GUARDED_BY(estimates_mutex);
mutable Estimates estimates TSA_GUARDED_BY(estimates_mutex);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we change Estimates from being a map<String, Estimate>to being amap<String, std::optional>, then we can get rid of extra members estimates_attempted_columnsandestimates_fully_loaded`.

ColumnsStatistics loadStatistics(const Names & required_columns) const;
Estimates getEstimates() const;
/// Returns estimates for the requested columns, loading missing ones and caching them.
/// Empty `required_columns` means "load and cache every column with statistics".

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

L. 182: That's odd semantics, empty should mean "don't load statistics".


ColumnsStatistics IMergeTreeDataPart::loadStatistics(const Names & required_columns) const
{
fiu_do_on(FailPoints::merge_tree_load_statistics_throw,

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont' see any test which exercises this failpoint.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Explicitly enumerating step types we can encounter looks brittle.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants