Fix uniqExact concurrent-merge crash with GROUPING SETS/ROLLUP/CUBE by Algunenano · Pull Request #108928 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix uniqExact concurrent-merge crash with GROUPING SETS/ROLLUP/CUBE#108928

Merged
Algunenano merged 8 commits into
ClickHouse:masterfrom
Algunenano:fix-uniqexact-grouping-sets-concurrent-merge
Jul 2, 2026
Merged

Fix uniqExact concurrent-merge crash with GROUPING SETS/ROLLUP/CUBE#108928
Algunenano merged 8 commits into
ClickHouse:masterfrom
Algunenano:fix-uniqexact-grouping-sets-concurrent-merge

Conversation

@Algunenano

@Algunenano Algunenano commented Jun 30, 2026

Copy link
Copy Markdown
Member

Closes: #108912

uniqExact's getTwoLevelSet() is const but performed a copy-on-write of the source set (doDeepCopyIfNeeded) on a path that is read concurrently. With GROUPING SETS/ROLLUP/CUBE the same state is merged into several destinations from multiple threads, so concurrent calls mutated the shared source's two_level_set pointer without synchronization, corrupting it and crashing with a segmentation fault.

The fix moves the copy-on-write to the destination's mutable accessor (asTwoLevelChecked), which runs on a thread-local accumulator and forks before any in-place mutation. getTwoLevelSet() becomes genuinely read-only, so an already-two-level source is now shared across threads via shared_ptr instead of being copied. This removes copies rather than adding them: the old read path deep-copied the source the first time it was shared, whereas the new code shares it; the necessary copy is deferred to the destination side and bounded by the number of mutated destinations, not the thread count. mutable is dropped from two_level_set so the compiler enforces that a const set cannot be modified.

Changelog category (leave one):

  • Critical Bug Fix (crash, data loss, RBAC)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a segmentation fault when merging uniqExact aggregate states with GROUPING SETS, ROLLUP or CUBE and max_threads > 1.

Version info

  • Merged into: 26.7.1.423
  • Backported to: 26.6.2.22, 26.5.6.11, 26.4.5.96, 26.3.17.18, 25.8.26.11

The const getTwoLevelSet mutated the shared source set via doDeepCopyIfNeeded;
concurrent merges of the same state corrupted it. Do the copy-on-write only on
the destination's mutable path. Closes ClickHouse#108912
@Algunenano Algunenano requested a review from nickitat June 30, 2026 11:34
@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Jun 30, 2026
-- Each (a, b, c) group accumulates > 100000 distinct values, so the uniqExact sets become two-level.
INSERT INTO st_04489 SELECT number % 4, number % 3, number % 2, uniqExactState(number) FROM numbers_mt(4000000) GROUP BY 1, 2, 3;

-- Repeat a few times: the race is timing dependent but used to crash almost immediately.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that even though it was timing dependent in release mode, TSAN showed the race every time running this test. That's why I didn't add a race condition test (well, I added it and then removed it because it wasn't needed)

@UnamedRus

UnamedRus commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

We have two-level enable/disable/trigger settings for group by, but not for uniqExactHash optimization, which looks a bit lacking. (it would allow to have W/A for when this more "complex" optimization is getting triggered)

@nickitat nickitat self-assigned this Jun 30, 2026
Correctness is checked in other tests, we only want to fix the race conditions / crashes
Comment thread src/AggregateFunctions/UniqExactSet.h
Algunenano and others added 2 commits July 1, 2026 17:42
The 4M-row insert made the test exceed the 180s per-run limit in the
amd_debug flaky check. 1.5M rows keeps each of the 12 groups at 125000
distinct values, still above the 100000 uniqExact auto-two-level
threshold, so the concurrent two-level merge path is still exercised.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered: 5/5 (100.00%) · Uncovered code

Full report · Diff report

@robot-ch-test-poll3 robot-ch-test-poll3 added pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-synced-to-cloud The PR is synced to the cloud repo labels Jul 2, 2026
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Jul 2, 2026
clickhouse-gh Bot added a commit that referenced this pull request Jul 2, 2026
Backport #108928 to 26.6: Fix uniqExact concurrent-merge crash with GROUPING SETS/ROLLUP/CUBE
Algunenano added a commit that referenced this pull request Jul 2, 2026
Backport #108928 to 26.5: Fix uniqExact concurrent-merge crash with GROUPING SETS/ROLLUP/CUBE
Algunenano added a commit that referenced this pull request Jul 2, 2026
Backport #108928 to 26.4: Fix uniqExact concurrent-merge crash with GROUPING SETS/ROLLUP/CUBE
Algunenano added a commit that referenced this pull request Jul 2, 2026
Backport #108928 to 26.3: Fix uniqExact concurrent-merge crash with GROUPING SETS/ROLLUP/CUBE
Algunenano added a commit that referenced this pull request Jul 2, 2026
Backport #108928 to 25.8: Fix uniqExact concurrent-merge crash with GROUPING SETS/ROLLUP/CUBE
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jul 4, 2026
STID 1464-31ed. The two-level pointee of a UniqExactSet is shared across
several UniqExactSet objects: the merge fast path (two_level_set =
other.getTwoLevelSet()) aliases one pointee into an empty destination, and
parallelizeMergeMulti pre-fetches raw TwoLevelSet* to every state and reads
those buckets across threads. The parallel merge of ROLLUP/CUBE/GROUPING SETS
states thus reads one pointee on one thread while another thread merges into a
state that owns the same pointee.

doDeepCopyIfNeeded forked (copied the pointee before mutating) only when
shared_ptr::use_count() > 1. use_count() is not a synchronization primitive: a
sibling holder being destroyed drops the count 2 -> 1 with no happens-before
against the pointee's memory, so one state can observe use_count == 1 and write
the pointee's buckets in place while another thread is still reading the same
pointee. That is a data race (TSan reports it; issue ClickHouse#108912 was a segfault
from the same in-place write). PR ClickHouse#108928 moved the copy-on-write from the
read-intent getTwoLevelSet() to the destination path but kept the racy
use_count() guard, so the race persisted.

Attach a monotonic is_shared atomic flag to the pointee (SharedTwoLevelSet
wrapper). getTwoLevelSet() sets it (release) before the pointee escapes;
doDeepCopyIfNeeded() forks when it is set (acquire) instead of consulting
use_count(). A pointee that another thread can reach is therefore never mutated
in place; a freshly forked copy is solely owned and unshared, so the common
single-consumer case keeps mutating in place (preserving ClickHouse#108928's copy
elision).

Reproduced under TSan with a unit test that merges into and reads one shared
two-level pointee from two threads: races on the pointee buckets before the fix
(WRITE emplaceNonZeroImpl HashTable.h <- UniqExactSet::merge; READ
HashTable::size), TSan-clean after. Existing UniqExactParallelMerge gtests and
04489_uniq_exact_grouping_sets_concurrent_merge stay green, and parallel vs
single-thread GROUPING SETS results are identical.

Related: ClickHouse#108912

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
groeneai added a commit to groeneai/ClickHouse that referenced this pull request Jul 4, 2026
STID 1464-31ed. The two-level pointee of a UniqExactSet is shared across
several UniqExactSet objects: the merge fast path (two_level_set =
other.getTwoLevelSet()) aliases one pointee into an empty destination, and
parallelizeMergeMulti pre-fetches raw TwoLevelSet* to every state and reads
those buckets across threads. The parallel merge of ROLLUP/CUBE/GROUPING SETS
states thus reads one pointee on one thread while another thread merges into a
state that owns the same pointee.

doDeepCopyIfNeeded forked (copied the pointee before mutating) only when
shared_ptr::use_count() > 1. use_count() is not a synchronization primitive: a
sibling holder being destroyed drops the count 2 -> 1 with no happens-before
against the pointee's memory, so one state can observe use_count == 1 and write
the pointee's buckets in place while another thread is still reading the same
pointee. That is a data race (TSan reports it; issue ClickHouse#108912 was a segfault
from the same in-place write). PR ClickHouse#108928 moved the copy-on-write from the
read-intent getTwoLevelSet() to the destination path but kept the racy
use_count() guard, so the race persisted.

Attach a monotonic is_shared atomic flag to the pointee (SharedTwoLevelSet
wrapper). getTwoLevelSet() sets it (release) before the pointee escapes;
doDeepCopyIfNeeded() forks when it is set (acquire) instead of consulting
use_count(). A pointee that another thread can reach is therefore never mutated
in place; a freshly forked copy is solely owned and unshared, so the common
single-consumer case keeps mutating in place (preserving ClickHouse#108928's copy
elision).

Regression test (gtest UniqExactSharedPointeeCoW): a direct, deterministic test
of the exact lifetime transition of STID 1464-31ed. A pointee P escapes to a
second UniqExactSet via the merge fast path (use_count 2), the sibling holder is
dropped (use_count 2 -> 1), then a merge mutates the surviving state. The test
asserts the pointee address changes, i.e. the mutation forks P instead of
writing the escaped instance in place. It fails before this change (in-place
mutation, same address) and passes after it (fork, new address). A concurrent
reader cannot form a valid fails-before/passes-after gate: any holder that keeps
P alive for the reader also bumps use_count so the buggy guard forks too, while
dropping it to force use_count == 1 makes the fix free P under the reader. So
the fork-vs-in-place decision, the necessary and sufficient cause of the race,
is asserted directly. Existing UniqExactParallelMerge gtests and
04489_uniq_exact_grouping_sets_concurrent_merge stay green, and parallel vs
single-thread GROUPING SETS results are identical.

Related: ClickHouse#108912
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Crash getTwoLevelSet + uniqExact + GROUPING SETS

5 participants