Lazy-load column statistics during query planning#104691
Conversation
…o statistic-lazy
|
This was fixed by #105146. Let's update the branch. |
|
@hanfei1991 Something interesting to review when you are back. |
# Conflicts: # src/Processors/QueryPlan/Optimizations/optimizeJoin.cpp # src/Storages/MergeTree/IMergeTreeDataPart.cpp
The merge silently combined the PR's `getUsedColumns` (which remapped `.null` virtual keys to their parent column via `virtual_key_to_parent`) with master's member list, which no longer declares `virtual_key_to_parent` or `null_subcolumns_to_normalize`. Master reverted `StatisticsPartPruner` to a MinMax-only pruner and dropped the `NullCount` / `.null`-subcolumn machinery the PR's change depended on, so the combined header failed to compile with `use of undeclared identifier 'virtual_key_to_parent'`. Take master's `getUsedColumns`, which returns `used_column_names` populated by master's `StatisticsPartPruner.cpp`. The lazy per-column loading the PR relies on is preserved: part pruning still loads statistics only for the filter columns via `getEstimates(getUsedColumns())`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`IMergeTreeDataPart::getEstimates` only recorded a column in `estimates_attempted_columns` (the negative cache) when the column had no statistics declared in metadata. A column that *did* declare statistics but whose statistics file was unreadable was neither cached in `estimates` nor marked as attempted, so every query re-read, re-failed, and re-logged the same deterministic miss for that part. This defeats the lazy-loading goal of the PR and is the issue raised by the AI Review verdict and the `clickhouse-gh` review thread. The miss is deterministic by the time the caching block runs: a column absent from `estimates` at that point either has no statistics file, or a file that fails to deserialize (corrupted, unsupported version, or a stale stored type left by an in-progress `MODIFY COLUMN`, for which `ColumnStatistics::deserialize` now returns a null pointer after merging master). Transient errors while opening the statistics file propagate out of `loadStatistics` and never reach this point, so they remain retryable, and a fresh part object (after reload or mutation) starts with an empty cache. Record every such miss in the negative cache. Add `04341_stats_lazy_deterministic_miss`: it corrupts the statistics file of a metadata-declared column (preserving its size so the on-load size check still passes), then runs the same statistics-using query twice and asserts the first query logs a `Cannot load statistics for column` warning while the second does not, proving the column is probed at most once per part object. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After merging master, `makeDistributed.cpp` still forward-declared the two-argument `estimateReadRowsCount`, while the definition in `optimizeJoin.cpp` had gained a third `estimator` argument. The mismatched declaration referenced a symbol that is never defined, so linking `clickhouse` failed in the Fast test with `undefined symbol: DB::QueryPlanOptimizations::estimateReadRowsCount(DB::QueryPlan::Node&, DB::ActionsDAG::Node const*)`. Update the forward declaration to the three-argument signature; the callers here pass no estimator and keep relying on the index-analysis fallback. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104691&sha=e4e143995a3ca517c0665e983af01612733199f7&name_0=PR PR: ClickHouse#104691 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`filterPartsByStatistics` called `getEstimates(getUsedColumns())`, but `StatisticsPartPruner::getUsedColumns` is populated only once `checkPartCanMatch` builds a key condition, which first happens inside the per-part loop. Before the loop it returns an empty list, and `IMergeTreeDataPart::getEstimates` treats an empty list as "load every column's statistics" -- so statistics-based part pruning deserialized the statistics of all columns instead of only the filter columns, defeating the lazy-loading goal of this pull request. Add `StatisticsPartPruner::getCandidateColumns`, which returns the filter columns that carry `MinMax`/`Basic` statistics (known right after construction), and pass it to `getEstimates`. `getUsedColumns` is kept for the post-loop query-plan reporting, where it reflects the columns the key conditions actually used. Covered by `04209_stats_lazy_columns` (case `04209_part_lazy`). PR: ClickHouse#104691 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`estimateReadRowsCount` recurses through a generic row-preserving `ITransformingStep` (e.g. a window function), but `collectStatsColumnsForRelation` -- which decides which columns the estimator should be built for -- did not. For a relation subtree like `WindowStep -> FilterStep -> ReadFromMergeTree` the collector returned no `ReadFromMergeTree`, so no estimator was built and the relation lost its statistics-backed row estimate, even though the fold still treated the step as a pass-through. Handle the generic row-preserving transform in `collectStatsColumnsForRelation` the same way the fold does, and forward the `estimator` in the fold's recursion so the estimate is actually computed for such subtrees. Add `04342_stats_lazy_window_relation`: a window function above a filtered `ReadFromMergeTree` in a join relation must load the same statistics columns as the equivalent relation without the window. PR: ClickHouse#104691 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Per-column statistics are now stored inside a single `statistics.packed` file rather than legacy per-column `statistics_<col>.stats` files, and `materialize_statistics_on_insert` is a query setting, not a table setting. The test corrupted a non-existent `statistics_b.stats` and never materialized statistics, so its premise no longer held. Materialize statistics on insert and zero out only column `b`'s blob inside `statistics.packed` (parsing the packed index so the container still loads and column `a` stays intact), which makes deserializing `b` deterministically fail. Stream the resulting warning away from the client with `--send_logs_level=error` so the test asserts it via `system.text_log` only. PR: ClickHouse#104691 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed
Locally (a build with |
After the lazy-statistics change, `estimateReadRowsCount` no longer builds a `ConditionSelectivityEstimator` by itself; the caller must build one and pass it. The distributed-plan strategy choices in `makeDistributed.cpp` (`tryMakeDistributedJoin`, `tryMakeDistributedAggregation`) were only updated to match the new signature, so they passed no estimator and silently took the no-statistics fallback. For a filtered relation that fallback returns no row estimate at all, so the broadcast/shuffle decision degraded (a filtered right side that should broadcast was shuffled; a filtered aggregation input that should keep partial aggregation was shuffled). Build the estimator with the shared `buildEstimatorForRelation` (the same helper `optimizeJoinLegacy` uses) for the filter/prewhere columns on the path to the read step, and pass it. This restores statistics-backed row estimation while still loading statistics for only those columns, matching the goal of this PR. `buildEstimatorForRelation` is made non-`static` and forward-declared in `makeDistributed.cpp`. Per-key NDV is intentionally not loaded for these callers: in the current distributed plan structure it cannot change the strategy (the read is wrapped in a `GatherExchange` before the parent is visited in post-order, which the estimate walk does not traverse, and when the read is not wrapped the per-key NDV never exceeds the row count, which is already bounded by the same threshold), so loading it would be exactly the wasted I/O this PR removes. Add `04343_distributed_plan_lazy_stats_strategy`, which proves the distributed aggregation strategy still uses column statistics (partial aggregation with statistics vs shuffle without) and that statistics are read for only the filter column, not for every column of the table. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed What changed (
Note on per-key NDV (the AI also asked for this): I intentionally did not load the GROUP BY keys' NDV. In the current distributed plan structure it cannot influence the strategy — the passes run post-order, so the read is wrapped in a CI: the two red checks ( Master integration: the branch is ~1279 commits behind. I did not re-merge: the statistics subsystem evolved meaningfully on master (e.g. countmin estimate fixes, the lazy-column-replication revisit) and that integration is best done by you, who owns the domain — a blind text-clean merge into the very subsystem this PR rewrites is hard to validate without a full rebuild. The redness above is unrelated flakes, so merging would not clear it. |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 267/274 (97.45%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 4 line(s) · Uncovered code |
rschu1ze
left a comment
There was a problem hiding this comment.
All statistics for the columns in a part are stored in the same file, i.e. we wouldn't save on I/O anyways. Therefore, is it worth to optimize the statistics loading further?
| /// next query. | ||
| mutable std::mutex estimates_mutex; | ||
| mutable std::optional<Estimates> estimates TSA_GUARDED_BY(estimates_mutex); | ||
| mutable Estimates estimates TSA_GUARDED_BY(estimates_mutex); |
There was a problem hiding this comment.
I think if we change Estimates from being a map<String, Estimate>to being amap<String, std::optional>, then we can get rid of extra members estimates_attempted_columnsandestimates_fully_loaded`.
| ColumnsStatistics loadStatistics(const Names & required_columns) const; | ||
| Estimates getEstimates() const; | ||
| /// Returns estimates for the requested columns, loading missing ones and caching them. | ||
| /// Empty `required_columns` means "load and cache every column with statistics". |
There was a problem hiding this comment.
L. 182: That's odd semantics, empty should mean "don't load statistics".
|
|
||
| ColumnsStatistics IMergeTreeDataPart::loadStatistics(const Names & required_columns) const | ||
| { | ||
| fiu_do_on(FailPoints::merge_tree_load_statistics_throw, |
There was a problem hiding this comment.
I dont' see any test which exercises this failpoint.
There was a problem hiding this comment.
Explicitly enumerating step types we can encounter looks brittle.

When the query planner needs column statistics — for join reordering, for prewhere selectivity estimation, or for part pruning — it currently loads statistics for every column of the table from disk on the first access, even when the query only filters or joins on a handful of columns. This PR reduces statistics-file I/O during query planning on wide tables.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Column statistics are now loaded on demand for only the columns the query planner needs instead of for every column of the table on first access.