Enable count_distinct_optimization by default#104472
Conversation
`count_distinct_optimization` rewrites `count(DISTINCT col)` queries into a subquery with `GROUP BY col`, which avoids materializing the large `uniq` aggregate state and is generally faster for high-cardinality columns. The pass has been in the codebase for a long time but disabled by default; flip it on. In `CountDistinctPass`, restrict the rewrite to local tables — `count(DISTINCT)` over `Distributed`/remote storages goes through different distributed-aggregation logic that should not be subverted by the local rewrite. The previous workaround (gating on `optimize_distributed_group_by_sharding_key`) is dropped in favor of a direct `IStorage::isRemote` check via `TableNode`. Originally part of WIP PR #81944.
|
If this optimization is not always beneficial or is not always correct, we should have tests for that. |
…le functions
The rewrite was previously skipped only when the join tree was a `TableNode`
backed by remote storage. With the analyzer, `FROM remote(...)` and similar
remote table functions are represented as `TableFunctionNode`, so the check was
bypassed and the rewrite was still applied to remote sources. That contradicts
the PR goal ("local tables only") and could reintroduce the distributed
behavior the previous `optimize_distributed_group_by_sharding_key` workaround
was guarding against.
Now both `TableNode` and `TableFunctionNode` are gated on `IStorage::isRemote`.
A new test verifies the gating and the result correctness for `MergeTree`,
`Distributed`, and `remote(...)`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test uses `EXPLAIN QUERY TREE`, which is only supported with the analyzer, so it failed in the `old analyzer` CI job with: Code: 48. DB::Exception: EXPLAIN QUERY TREE is only supported with the analyzer. SET enable_analyzer = 1.. (NOT_IMPLEMENTED) Force the new analyzer at the top of the test. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104472&sha=ce3fd85107a9019aa8073c32b3e4bc333acecf5a&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20old%20analyzer%2C%20s3%20storage%2C%20DatabaseReplicated%2C%20WasmEdge%2C%20parallel%29 PR: #104472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@groeneai, provide a fix for |
|
@alexey-milovidov — the CIDB cross-check (last 30 days,
The in-flight fix is PR #103864 by @yakov-olkhovskiy — "Fix Suggested path forward:
(Not opening a competing PR — #103864 already has the correct fix and the regression test.) |
It's already detected by the performance tests. I's a duplicate from #88281. I have some ideas to test and improve, but I don't know if I'll end up improving more the countDistinct aggregate function or the group by case, so applying this for the sake of having it doesn't make sense as of right now |
|
The optimization appears to be almost pointless after other optimizations for parallel merge of |
The CI perf comparison for #104472 showed that the rewrite was a net loss for fixed-size numeric columns: +72.4% SELECT uniqExact(a) FROM t_50000 +26.1% SELECT uniqExact(number) FROM numbers_mt(5e7) +23.9% SELECT uniqExact(a) FROM t_10000 +13.2% SELECT count(distinct a) FROM t_10000 (arm) while still giving a clear win on the only `count(DISTINCT)` over a String column in the suite: -10.4% SELECT COUNT(DISTINCT SearchPhrase) FROM hits_100m_single `uniqExact` has highly tuned open-addressing hash tables for fixed-size numeric values, so the GROUP BY rewrite only adds an extra block materialization stage between the inner aggregation and the outer `count()` without saving on hashing. For `String`, `FixedString`, `Array`, etc. the specialized `GROUP BY` hash tables beat the generic ones used inside `uniqExact`, so the rewrite still wins. This gates the rewrite on `IDataType::isValueRepresentedByNumber` — numbers, `Date`, `DateTime`, `Decimal`, `Enum`, `IPv4`/`IPv6`, and `LowCardinality` over any of these are skipped, while `String`, `FixedString`, `UUID`, `LowCardinality(String)`, `Array`, etc. are still rewritten. CI report: #104472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…count-distinct-default-on
|
Now it does not help at all. |
…numeric gate `DataTypeNullable` does not override `isValueRepresentedByNumber` and returns `false` by default, and `DataTypeLowCardinality` delegates to its nested dictionary type. So without unwrapping `Nullable` and `LowCardinality(Nullable(...))` first, `Nullable(UInt64)` and `LowCardinality(Nullable(UInt32))` would slip through the numeric gate in `CountDistinctPass` and still take the `GROUP BY` rewrite path, even though they are exactly the cases the gate is meant to skip. Apply `removeNullableOrLowCardinalityNullable` before checking `isValueRepresentedByNumber`, and extend `04218_count_distinct_optimization_numeric` with focused coverage for `Nullable(UInt64)` and `LowCardinality(Nullable(UInt32))`. CI report: #104472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r path `InterpreterSelectQuery` runs `RewriteCountDistinctFunctionVisitor` when `count_distinct_optimization` is enabled. Until now the visitor did not check `IStorage::isRemote` or the column type, so with `enable_analyzer = 0` the rewrite still applied to `Distributed`/remote-backed tables and to fixed-size numeric columns - violating the same contract the new-analyzer `CountDistinctPass` was tightened to satisfy. Pass the query `Context` to the matcher, resolve the storage from `database_and_table_name`, and apply the same gates as `CountDistinctPass`: skip remote storages and skip columns whose type, after stripping `Nullable` or `LowCardinality(Nullable(...))` wrappers, satisfies `isValueRepresentedByNumber`. If the storage cannot be resolved at this point, skip the rewrite (fail-closed) rather than apply a potentially incorrect transformation. Add `04235_count_distinct_optimization_remote_old_analyzer.sql` to verify that the old-analyzer rewrite still produces correct results across the `String`/`Distributed`/`remote(...)`/`UInt64`/`Nullable(UInt64)`/ `LowCardinality(Nullable(UInt32))` cases gated by the new logic. CI report: #104472 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@groeneai, the Unit tests (tsan, function_prop_fuzzer) job is failing on The fuzzer reports that for the same inputs,
Please investigate and provide a fix in a separate PR. If a fix is already in progress, please link it here. |
| ``` | ||
| )", 0) \ | ||
| DECLARE(Bool, count_distinct_optimization, false, R"( | ||
| DECLARE(Bool, count_distinct_optimization, true, R"( |
There was a problem hiding this comment.
This change flips count_distinct_optimization to default-on globally, and the PR is categorized as Performance Improvement, so we need current-head performance evidence before merge.
For the current head commit (efc66bfe), the CI report still shows all Performance Comparison (*, master_head, */6) jobs as pending, so there is no completed before/after proof yet.
Please attach completed perf-comparison results (or an equivalent focused benchmark) that demonstrate the net effect after the remote/numeric gates.
There was a problem hiding this comment.
The perf comparison has now completed on the current head. Filtering client_time results for DISTINCT/count(DISTINCT) queries (see fetch_perf_report.py):
Faster (4) — all are direct count(DISTINCT) workloads:
| Δ | Before | After | Query |
|---|---|---|---|
| -27.7% | 114.5 ms | 82.8 ms | SELECT count(DISTINCT col_low_cardinality) FROM table_Native_with_nullable_columns |
| -13.2% | 478.8 ms | 416.0 ms | SELECT COUNT(DISTINCT SearchPhrase) FROM hits_100m_single (ClickBench Q5) |
| -10.6% | 161.8 ms | 144.6 ms | SELECT COUNT(DISTINCT Title) FROM test.hits SETTINGS max_threads = 64 |
| -10.3% | 68.3 ms | 61.3 ms | select distinctJSONPaths(json) from test_distinct_json_paths_compact_dynamic_… |
Slower (0) in the DISTINCT filter.
Over the full suite the totals are 20 faster | 11 slower | 135 unstable across 3778 queries. The 11 slower entries are unrelated queries (e.g. count() WHERE URL LIKE '%google%', quantilesExactLow(0.5)(d32), cos(cos(...)), mapPopulateSeries, etc.) — none of them exercise count(DISTINCT) and they look like measurement noise rather than a systematic regression.
So the targeted workloads improve by 10–28% and there is no regression on the count(DISTINCT) queries the rewrite actually affects.
There was a problem hiding this comment.
I re-checked the latest PR head commit (c686ab23a42b319eae63240e03fa3338f4e6a09a) in result_pr.json, and all Performance Comparison (*, master_head, */6) jobs are still PENDING there.
So for the current head we still do not have completed before/after evidence yet. Once those 12 perf shards finish, please attach the updated result (or a focused benchmark on this head), then this thread can be resolved.
…inctPass` The previous `isRemote` guard only handled direct `TableNode` and `TableFunctionNode` join-tree roots. A subquery wrapping a remote source (e.g. `SELECT count(DISTINCT s) FROM (SELECT s FROM remote(...))`) reached the rewrite via a `QueryNode` join tree and slipped past the check, violating the "local tables only" contract. Fail-close: skip the rewrite for any non-table join-tree root. Adds a new test covering the subquery-over-`remote` and subquery-over-local cases. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Performance comparison for the current head (
So after the remote/numeric gates, the net effect on |
|
The The server is killed by That line dereferences the result of @groeneai, please investigate the failure and provide a fix in a separate PR. If a fix is already in progress, link it here. |
| auto metadata_snapshot = storage->getInMemoryMetadataPtr(data.context, false); | ||
| if (!metadata_snapshot) | ||
| return; | ||
| auto physical_column = metadata_snapshot->getColumns().tryGetPhysical(column_name); |
There was a problem hiding this comment.
ASTIdentifier::name preserves qualifiers, so for SELECT count(DISTINCT t.s) FROM data AS t this becomes t.s.
Then tryGetPhysical(column_name) looks up t.s in table metadata, misses (s is the physical column name), and the new fail-close branch returns early. As a result, with enable_analyzer = 0 the rewrite is silently disabled for qualified identifiers on local string-like columns.
That regresses the PR contract (count_distinct_optimization default-on with old/new analyzer parity) for a common query shape.
Can we normalize the identifier to a storage column name before metadata lookup (strip table/db qualifiers but keep nested subcolumn paths), and keep using the original identifier text for the rewritten AST?
LLVM Coverage Report
Changed lines: 88.71% (55/62) · Uncovered code |
…-4165) A `BackgroundJobsAssignee` task on a pool thread can race with `~StorageMergeTree` / `~StorageReplicatedMergeTree` and dispatch `data.scheduleDataProcessingJob(*this)` through a vtable that has already demoted to `MergeTreeData`. `scheduleDataProcessingJob` is pure virtual at the `MergeTreeData` level (only the leaf storages override it), so the dispatch lands on `__cxa_pure_virtual` and aborts the server. This is reachable in practice when `shutdown` throws partway through the cleanup. `StorageReplicatedMergeTree::~StorageReplicatedMergeTree` wraps the `shutdown` call in a try/catch and swallows the exception, so member destruction proceeds with the background scheduler still alive. `~BackgroundJobsAssignee::finish()` then calls `deactivate()`, which blocks on `exec_mutex` waiting for the in-flight `threadFunc` to finish — but `threadFunc`'s next call is the pure-virtual dispatch above. The fix is a defensive `false`-returning override of `scheduleDataProcessingJob` on `MergeTreeData`. In normal operation, the derived overrides win the virtual dispatch and this default is dead code. During the destruction race, the default takes the dispatch: `threadFunc` reads `false` as "no work scheduled" and falls through to `postpone()`, which is a no-op once `finish()` has moved the holder out. The pool thread completes `execute()` normally, the destructor's `deactivate()` returns, and destruction proceeds safely. Affected runs (all identical stack traces, all unrelated PRs): - PR ClickHouse#102343 (Paimon) — Stress test (arm_asan_ubsan) - PR ClickHouse#101027 (obfuscateQuery revert) — Stress test (arm_asan_ubsan, s3) - PR ClickHouse#100272 (h3 → h3o) — Stress test (arm_asan_ubsan, s3) - PR ClickHouse#104472 (count_distinct_optimization) — Stress test (arm_debug) CI report (latest): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104472&sha=ad31d65c8cb512098dc41a848bfa9e42650e123e&name_0=PR&name_1=Stress%20test%20%28arm_debug%29 Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>


Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Enable
count_distinct_optimizationby default. The optimization rewritescount(DISTINCT col)into a subquery withGROUP BY col, which avoids materializing theuniqaggregate state and is generally faster for high-cardinality columns. The rewrite is now restricted to local tables only.Documentation entry for user-facing changes
Why
count_distinct_optimizationhas been in the codebase for a long time but has been off by default. It generally beats theuniqExactpath for high-cardinality columns because the group-by stage avoids the large hash-set state thatuniqaggregate functions accumulate.What changed
count_distinct_optimizationfromfalsetotrue.CountDistinctPass, only apply the rewrite to local tables.count(DISTINCT)over aDistributed/remote storage already goes through distributed-aggregation logic that the local rewrite would interfere with. The previous workaround (gating onoptimize_distributed_group_by_sharding_key) is dropped in favor of a directIStorage::isRemotecheck viaTableNode.Origin
Split out from WIP PR #81944.