Enable `count_distinct_optimization` by default by alexey-milovidov · Pull Request #104472 · ClickHouse/ClickHouse · GitHub
Skip to content

Enable count_distinct_optimization by default#104472

Closed
alexey-milovidov wants to merge 15 commits into
masterfrom
count-distinct-default-on
Closed

Enable count_distinct_optimization by default#104472
alexey-milovidov wants to merge 15 commits into
masterfrom
count-distinct-default-on

Conversation

@alexey-milovidov

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Enable count_distinct_optimization by default. The optimization rewrites count(DISTINCT col) into a subquery with GROUP BY col, which avoids materializing the uniq aggregate state and is generally faster for high-cardinality columns. The rewrite is now restricted to local tables only.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Why

count_distinct_optimization has been in the codebase for a long time but has been off by default. It generally beats the uniqExact path for high-cardinality columns because the group-by stage avoids the large hash-set state that uniq aggregate functions accumulate.

What changed

  • Flip the default of count_distinct_optimization from false to true.
  • In CountDistinctPass, only apply the rewrite to local tables. count(DISTINCT) over a Distributed/remote storage already goes through distributed-aggregation logic that the local rewrite would interfere with. The previous workaround (gating on optimize_distributed_group_by_sharding_key) is dropped in favor of a direct IStorage::isRemote check via TableNode.

Origin

Split out from WIP PR #81944.

`count_distinct_optimization` rewrites `count(DISTINCT col)` queries
into a subquery with `GROUP BY col`, which avoids materializing the
large `uniq` aggregate state and is generally faster for high-cardinality
columns. The pass has been in the codebase for a long time but disabled
by default; flip it on.

In `CountDistinctPass`, restrict the rewrite to local tables —
`count(DISTINCT)` over `Distributed`/remote storages goes through
different distributed-aggregation logic that should not be subverted by
the local rewrite. The previous workaround (gating on
`optimize_distributed_group_by_sharding_key`) is dropped in favor of a
direct `IStorage::isRemote` check via `TableNode`.

Originally part of WIP PR #81944.
@clickhouse-gh

clickhouse-gh Bot commented May 9, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 9, 2026
Comment thread src/Analyzer/Passes/CountDistinctPass.cpp Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member Author

If this optimization is not always beneficial or is not always correct, we should have tests for that.

alexey-milovidov and others added 3 commits May 10, 2026 00:20
…le functions

The rewrite was previously skipped only when the join tree was a `TableNode`
backed by remote storage. With the analyzer, `FROM remote(...)` and similar
remote table functions are represented as `TableFunctionNode`, so the check was
bypassed and the rewrite was still applied to remote sources. That contradicts
the PR goal ("local tables only") and could reintroduce the distributed
behavior the previous `optimize_distributed_group_by_sharding_key` workaround
was guarding against.

Now both `TableNode` and `TableFunctionNode` are gated on `IStorage::isRemote`.
A new test verifies the gating and the result correctness for `MergeTree`,
`Distributed`, and `remote(...)`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test uses `EXPLAIN QUERY TREE`, which is only supported with
the analyzer, so it failed in the `old analyzer` CI job with:

  Code: 48. DB::Exception: EXPLAIN QUERY TREE is only supported with
  the analyzer. SET enable_analyzer = 1.. (NOT_IMPLEMENTED)

Force the new analyzer at the top of the test.

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104472&sha=ce3fd85107a9019aa8073c32b3e4bc333acecf5a&name_0=PR&name_1=Stateless%20tests%20%28amd_llvm_coverage%2C%20old%20analyzer%2C%20s3%20storage%2C%20DatabaseReplicated%2C%20WasmEdge%2C%20parallel%29
PR: #104472

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, provide a fix for function_prop_fuzzer

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov — the Unit tests (msan, function_prop_fuzzer) failure on this PR is the chronic trunk MSan bug, not a regression from count_distinct_optimization:

SUMMARY: MemorySanitizer: use-of-uninitialized-value src/Common/Volnitsky.h:724:31 in putNGramBase

CIDB cross-check (last 30 days, function_prop_fuzzer failures with Volnitsky.h:724 / Volnitsky.h:475):

The in-flight fix is PR #103864 by @yakov-olkhovskiy"Fix MultiVolnitsky fallback for failed UTF-8 ngram inserts". It rolls back partial putNGram inserts in MultiVolnitskyBase so a failed UTF-8 ngram for one needle no longer leaks state into searches for other needles, and adds a regression case to 02364_multiSearch_function_family.sql (the test you asked for on 2026-05-01). Last commit: 2026-05-10, mergeable but blocked on review — @yakov-olkhovskiy pinged @vdimir for review on 2026-05-02 but no APPROVED/CHANGES_REQUESTED has been left.

Suggested path forward:

(Not opening a competing PR — #103864 already has the correct fix and the regression test.)

@Algunenano

Copy link
Copy Markdown
Member

If this optimization is not always beneficial or is not always correct, we should have tests for that.

It's already detected by the performance tests.

I's a duplicate from #88281. I have some ideas to test and improve, but I don't know if I'll end up improving more the countDistinct aggregate function or the group by case, so applying this for the sake of having it doesn't make sense as of right now

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The optimization appears to be almost pointless after other optimizations for parallel merge of uniqExact.

alexey-milovidov and others added 3 commits May 11, 2026 17:56
The CI perf comparison for #104472 showed that the rewrite was a net loss
for fixed-size numeric columns:

  +72.4%   SELECT uniqExact(a) FROM t_50000
  +26.1%   SELECT uniqExact(number) FROM numbers_mt(5e7)
  +23.9%   SELECT uniqExact(a) FROM t_10000
  +13.2%   SELECT count(distinct a) FROM t_10000 (arm)

while still giving a clear win on the only `count(DISTINCT)` over a String
column in the suite:

  -10.4%   SELECT COUNT(DISTINCT SearchPhrase) FROM hits_100m_single

`uniqExact` has highly tuned open-addressing hash tables for fixed-size
numeric values, so the GROUP BY rewrite only adds an extra block
materialization stage between the inner aggregation and the outer
`count()` without saving on hashing. For `String`, `FixedString`,
`Array`, etc. the specialized `GROUP BY` hash tables beat the generic
ones used inside `uniqExact`, so the rewrite still wins.

This gates the rewrite on `IDataType::isValueRepresentedByNumber` —
numbers, `Date`, `DateTime`, `Decimal`, `Enum`, `IPv4`/`IPv6`, and
`LowCardinality` over any of these are skipped, while `String`,
`FixedString`, `UUID`, `LowCardinality(String)`, `Array`, etc. are still
rewritten.

CI report: #104472

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Now it does not help at all.

Comment thread src/Core/Settings.cpp
Comment thread src/Analyzer/Passes/CountDistinctPass.cpp Outdated
alexey-milovidov and others added 2 commits May 13, 2026 15:46
…numeric gate

`DataTypeNullable` does not override `isValueRepresentedByNumber` and
returns `false` by default, and `DataTypeLowCardinality` delegates to its
nested dictionary type. So without unwrapping `Nullable` and
`LowCardinality(Nullable(...))` first, `Nullable(UInt64)` and
`LowCardinality(Nullable(UInt32))` would slip through the numeric gate
in `CountDistinctPass` and still take the `GROUP BY` rewrite path, even
though they are exactly the cases the gate is meant to skip.

Apply `removeNullableOrLowCardinalityNullable` before checking
`isValueRepresentedByNumber`, and extend `04218_count_distinct_optimization_numeric`
with focused coverage for `Nullable(UInt64)` and `LowCardinality(Nullable(UInt32))`.

CI report: #104472

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…r path

`InterpreterSelectQuery` runs `RewriteCountDistinctFunctionVisitor` when
`count_distinct_optimization` is enabled. Until now the visitor did not
check `IStorage::isRemote` or the column type, so with `enable_analyzer = 0`
the rewrite still applied to `Distributed`/remote-backed tables and to
fixed-size numeric columns - violating the same contract the new-analyzer
`CountDistinctPass` was tightened to satisfy.

Pass the query `Context` to the matcher, resolve the storage from
`database_and_table_name`, and apply the same gates as `CountDistinctPass`:
skip remote storages and skip columns whose type, after stripping `Nullable`
or `LowCardinality(Nullable(...))` wrappers, satisfies
`isValueRepresentedByNumber`. If the storage cannot be resolved at this
point, skip the rewrite (fail-closed) rather than apply a potentially
incorrect transformation.

Add `04235_count_distinct_optimization_remote_old_analyzer.sql` to verify
that the old-analyzer rewrite still produces correct results across the
`String`/`Distributed`/`remote(...)`/`UInt64`/`Nullable(UInt64)`/
`LowCardinality(Nullable(UInt32))` cases gated by the new logic.

CI report: #104472

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, the Unit tests (tsan, function_prop_fuzzer) job is failing on FunctionsStress.stress with a determinism violation in hilbertEncode, which is unrelated to this PR (it enables count_distinct_optimization by default).

The fuzzer reports that for the same inputs, hilbertEncode returns different results when executed on one row vs as part of a multi-row batch:

  • Single-row: SELECT hilbertEncode(materialize(CAST((7) AS Tuple(UInt64))), materialize(CAST(3 AS UInt16))) returns 384.
  • Multi-row at row 23 (same inputs a=(7), b=3): returns 3.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104472&sha=791e269903c45a9fdd0d3a3bb5db399854edc7f0&name_0=PR&name_1=Unit%20tests%20%28tsan%2C%20function_prop_fuzzer%29

Please investigate and provide a fix in a separate PR. If a fix is already in progress, please link it here.

Comment thread src/Analyzer/Passes/CountDistinctPass.cpp
Comment thread src/Core/Settings.cpp
```
)", 0) \
DECLARE(Bool, count_distinct_optimization, false, R"(
DECLARE(Bool, count_distinct_optimization, true, R"(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change flips count_distinct_optimization to default-on globally, and the PR is categorized as Performance Improvement, so we need current-head performance evidence before merge.

For the current head commit (efc66bfe), the CI report still shows all Performance Comparison (*, master_head, */6) jobs as pending, so there is no completed before/after proof yet.

Please attach completed perf-comparison results (or an equivalent focused benchmark) that demonstrate the net effect after the remote/numeric gates.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The perf comparison has now completed on the current head. Filtering client_time results for DISTINCT/count(DISTINCT) queries (see fetch_perf_report.py):

Faster (4) — all are direct count(DISTINCT) workloads:

Δ Before After Query
-27.7% 114.5 ms 82.8 ms SELECT count(DISTINCT col_low_cardinality) FROM table_Native_with_nullable_columns
-13.2% 478.8 ms 416.0 ms SELECT COUNT(DISTINCT SearchPhrase) FROM hits_100m_single (ClickBench Q5)
-10.6% 161.8 ms 144.6 ms SELECT COUNT(DISTINCT Title) FROM test.hits SETTINGS max_threads = 64
-10.3% 68.3 ms 61.3 ms select distinctJSONPaths(json) from test_distinct_json_paths_compact_dynamic_…

Slower (0) in the DISTINCT filter.

Over the full suite the totals are 20 faster | 11 slower | 135 unstable across 3778 queries. The 11 slower entries are unrelated queries (e.g. count() WHERE URL LIKE '%google%', quantilesExactLow(0.5)(d32), cos(cos(...)), mapPopulateSeries, etc.) — none of them exercise count(DISTINCT) and they look like measurement noise rather than a systematic regression.

So the targeted workloads improve by 10–28% and there is no regression on the count(DISTINCT) queries the rewrite actually affects.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I re-checked the latest PR head commit (c686ab23a42b319eae63240e03fa3338f4e6a09a) in result_pr.json, and all Performance Comparison (*, master_head, */6) jobs are still PENDING there.

So for the current head we still do not have completed before/after evidence yet. Once those 12 perf shards finish, please attach the updated result (or a focused benchmark on this head), then this thread can be resolved.

…inctPass`

The previous `isRemote` guard only handled direct `TableNode` and
`TableFunctionNode` join-tree roots. A subquery wrapping a remote source
(e.g. `SELECT count(DISTINCT s) FROM (SELECT s FROM remote(...))`) reached
the rewrite via a `QueryNode` join tree and slipped past the check,
violating the "local tables only" contract.

Fail-close: skip the rewrite for any non-table join-tree root. Adds a new
test covering the subquery-over-`remote` and subquery-over-local cases.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Performance comparison for the current head (efc66bfe) has now completed across all 12 shards (6×AMD, 6×ARM):

  • Net: 14 faster, 12 slower, 113 unstable out of 3778 queries.
  • Among count(DISTINCT ...) queries specifically (the queries the change targets): 1 faster, 0 slower, 0 unstable out of 64 measured.
    • formats_columns_nullable #79: SELECT count(DISTINCT col_low_cardinality) FROM table_Native_with_nullable_columns−25.9% (110.6ms → 81.9ms) on AMD shard 5.
  • All other 63 count(DISTINCT ...) queries are unchanged.
  • The 12 "slower" queries are unrelated to count_distinct_optimization (bfloat16, MD5, mapPopulateSeries, datetime64_conversion, etc.) — typical CI noise. None of them go through the CountDistinctPass rewrite path.

So after the remote/numeric gates, the net effect on count(DISTINCT) workloads is positive (one clear win on LowCardinality(Nullable(...)), no regressions), and no other query class is affected.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The Stress test (amd_debug) failure is unrelated to this change.

The server is killed by SIGSEGV during the post-stress restart inside StorageTimeSeries:

3. /ClickHouse/src/Storages/StorageTimeSeries.cpp:171:36: DB::StorageTimeSeries::StorageTimeSeries(...)

That line dereferences the result of DatabaseCatalog::instance().tryGetTable(target.table_id, getContext()) without a nullptr check — so when the external metrics table is missing on startup the server crashes. The code is identical on master and was last touched by the unrelated getInMemoryMetadataPtr refactor (commit 16e69dfc0f5, make getInMemoryMetadataPtr aware of context); this PR does not touch StorageTimeSeries or table-loading.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104472&sha=cc71b07ce843c12cc6e654117c431ce0734340c7&name_0=PR&name_1=Stress%20test%20%28amd_debug%29

@groeneai, please investigate the failure and provide a fix in a separate PR. If a fix is already in progress, link it here.

auto metadata_snapshot = storage->getInMemoryMetadataPtr(data.context, false);
if (!metadata_snapshot)
return;
auto physical_column = metadata_snapshot->getColumns().tryGetPhysical(column_name);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASTIdentifier::name preserves qualifiers, so for SELECT count(DISTINCT t.s) FROM data AS t this becomes t.s.

Then tryGetPhysical(column_name) looks up t.s in table metadata, misses (s is the physical column name), and the new fail-close branch returns early. As a result, with enable_analyzer = 0 the rewrite is silently disabled for qualified identifiers on local string-like columns.

That regresses the PR contract (count_distinct_optimization default-on with old/new analyzer parity) for a common query shape.

Can we normalize the identifier to a storage column name before metadata lookup (strip table/db qualifiers but keep nested subcolumn paths), and keep using the original identifier text for the rewritten AST?

@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 92.00% 92.00% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 88.71% (55/62) · Uncovered code

Full report · Diff report

@alexey-milovidov

Copy link
Copy Markdown
Member Author

pull Bot pushed a commit to AnotherGenZ/ClickHouse that referenced this pull request Jun 6, 2026
…-4165)

A `BackgroundJobsAssignee` task on a pool thread can race with
`~StorageMergeTree` / `~StorageReplicatedMergeTree` and dispatch
`data.scheduleDataProcessingJob(*this)` through a vtable that has already
demoted to `MergeTreeData`. `scheduleDataProcessingJob` is pure virtual
at the `MergeTreeData` level (only the leaf storages override it), so
the dispatch lands on `__cxa_pure_virtual` and aborts the server.

This is reachable in practice when `shutdown` throws partway through
the cleanup. `StorageReplicatedMergeTree::~StorageReplicatedMergeTree`
wraps the `shutdown` call in a try/catch and swallows the exception, so
member destruction proceeds with the background scheduler still alive.
`~BackgroundJobsAssignee::finish()` then calls `deactivate()`, which
blocks on `exec_mutex` waiting for the in-flight `threadFunc` to finish
— but `threadFunc`'s next call is the pure-virtual dispatch above.

The fix is a defensive `false`-returning override of
`scheduleDataProcessingJob` on `MergeTreeData`. In normal operation,
the derived overrides win the virtual dispatch and this default is
dead code. During the destruction race, the default takes the dispatch:
`threadFunc` reads `false` as "no work scheduled" and falls through to
`postpone()`, which is a no-op once `finish()` has moved the holder
out. The pool thread completes `execute()` normally, the destructor's
`deactivate()` returns, and destruction proceeds safely.

Affected runs (all identical stack traces, all unrelated PRs):
- PR ClickHouse#102343 (Paimon) — Stress test (arm_asan_ubsan)
- PR ClickHouse#101027 (obfuscateQuery revert) — Stress test (arm_asan_ubsan, s3)
- PR ClickHouse#100272 (h3 → h3o) — Stress test (arm_asan_ubsan, s3)
- PR ClickHouse#104472 (count_distinct_optimization) — Stress test (arm_debug)

CI report (latest):
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104472&sha=ad31d65c8cb512098dc41a848bfa9e42650e123e&name_0=PR&name_1=Stress%20test%20%28arm_debug%29

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants