Fix TSan data race in uniqExact two-level parallel merge by groeneai · Pull Request #109389 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix TSan data race in uniqExact two-level parallel merge#109389

Open
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-uniqexact-two-level-merge-race
Open

Fix TSan data race in uniqExact two-level parallel merge#109389
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:groeneai/fix-uniqexact-two-level-merge-race

Conversation

@groeneai

@groeneai groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Related: #108912

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a data race (and resulting rare crash) in the parallel merge of uniqExact aggregate states with two-level sets, which could occur with GROUPING SETS, ROLLUP and CUBE.

Description

TSan finding STID 1464-31ed on Stress test (arm_tsan) / Stress test (amd_tsan), still firing after the previous fix in #108928.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=109285&sha=aa124d023377a63a7e25cde902e4a2594150f256&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29

Stack (write of size 8 by one thread):

HashTableCell::HashTableCell           src/Common/HashTable/HashTable.h:165
HashTable::emplaceNonZeroImpl          src/Common/HashTable/HashTable.h:1054
TwoLevelHashSetTable::merge            src/Common/HashTable/HashSet.h:90
UniqExactSet::merge                    src/AggregateFunctions/UniqExactSet.h:233
AggregateFunctionUniq::mergeImpl       src/AggregateFunctions/AggregateFunctionUniq.h:496
Aggregator::mergeBucketImpl / ConvertingAggregatedToChunksWithMergingSource::generate

racing a concurrent read of the same two-level pointee.

Root cause. A UniqExactSet's two-level pointee is shared across several UniqExactSet objects: the merge fast path (two_level_set = other.getTwoLevelSet()) aliases one pointee into an empty destination, and parallelizeMergeMulti pre-fetches raw TwoLevelSet* to every state and reads those buckets across threads. So the parallel merge of GROUPING SETS/ROLLUP/CUBE states reads one pointee on one thread while another thread merges into a state that owns the same pointee.

doDeepCopyIfNeeded() forked (copied the pointee before mutating) only when shared_ptr::use_count() > 1. use_count() is not a synchronization primitive: a sibling holder being destroyed drops the count 2 -> 1 with no happens-before against the pointee's memory, so one state can observe use_count == 1 and write the pointee's buckets in place while another thread is still reading the same pointee. #108928 moved the copy-on-write from getTwoLevelSet() to the destination path but kept the racy use_count() guard, so the race persisted (this PR's branch — and the failing CI runs — already contain #108928).

Fix. Attach a monotonic is_shared atomic flag to the pointee (a small SharedTwoLevelSet wrapper). getTwoLevelSet() sets it (release) before the pointee escapes; doDeepCopyIfNeeded() forks when it is set (acquire) instead of consulting use_count(). A pointee that another thread can reach is therefore never mutated in place. A freshly forked copy is solely owned and unshared, so the common single-consumer case keeps mutating in place, preserving #108928's copy elision.

Testing. Added a unit test (gtest_uniq_exact_concurrent_alias_merge) that merges into and reads one shared two-level pointee from two threads. It reproduces the race on the pointee buckets before the fix (WRITE emplaceNonZeroImpl HashTable.h:1054 <- UniqExactSet::merge; READ HashTable::size HashTable.h:1460; same TwoLevelHashSetTable heap block) and is TSan-clean after. The existing UniqExactParallelMerge unit tests and 04489_uniq_exact_grouping_sets_concurrent_merge stay green, and parallel vs single-thread GROUPING SETS results are identical.

@groeneai

groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

@groeneai

groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

cc @Algunenano @nihalzp — could you review? This continues #108928: the uniqExact two-level parallel-merge TSan race (STID 1464-31ed) still fired on branches that already contain #108928, because doDeepCopyIfNeeded kept the use_count() guard. This replaces it with a monotonic per-pointee is_shared flag so a pointee that another thread can reach is never mutated in place.

@clickhouse-gh

clickhouse-gh Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [34b38b9]

Summary:

job_name test_name status info comment
Stress test (amd_tsan) FAIL
Logical error: Join is supported only for pipelines with one output port, got A and B (STID: 4488-4e11) FAIL cidb

AI Review

Summary

This PR replaces the racy shared_ptr::use_count() copy-on-write check in UniqExactSet with a monotonic per-pointee is_shared flag and swaps the earlier concurrent repro for a deterministic regression that asserts the exact use_count 2 -> 1 lifetime transition behind STID 1464-31ed. I did not find a remaining correctness issue in the current diff. The main residual blind spot is verification: most sanitizer/unit CI jobs are still pending, and there is no local build directory in this checkout to run the focused gtests directly.

Missing context / blind spots
  • ⚠️ Most of the high-signal CI for this change is still pending (Build sanitizer jobs, Unit tests, and the Stress test shards), so the review cannot yet confirm the fix under the exact toolchains that previously reported the race.
  • ⚠️ There is no local build* directory in this workspace, so I could not run unit_tests_dbms here to independently re-check UniqExactParallelMerge / UniqExactSharedPointeeCoW.
Final Verdict

✅ No blocking findings in the current code. The previous inline issue about the old regression test no longer applies and is now resolved.

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jul 4, 2026
@groeneai groeneai force-pushed the groeneai/fix-uniqexact-two-level-merge-race branch from 9855bc4 to 43567f8 Compare July 4, 2026 04:31
Comment thread src/AggregateFunctions/tests/gtest_uniq_exact_concurrent_alias_merge.cpp Outdated
STID 1464-31ed. The two-level pointee of a UniqExactSet is shared across
several UniqExactSet objects: the merge fast path (two_level_set =
other.getTwoLevelSet()) aliases one pointee into an empty destination, and
parallelizeMergeMulti pre-fetches raw TwoLevelSet* to every state and reads
those buckets across threads. The parallel merge of ROLLUP/CUBE/GROUPING SETS
states thus reads one pointee on one thread while another thread merges into a
state that owns the same pointee.

doDeepCopyIfNeeded forked (copied the pointee before mutating) only when
shared_ptr::use_count() > 1. use_count() is not a synchronization primitive: a
sibling holder being destroyed drops the count 2 -> 1 with no happens-before
against the pointee's memory, so one state can observe use_count == 1 and write
the pointee's buckets in place while another thread is still reading the same
pointee. That is a data race (TSan reports it; issue ClickHouse#108912 was a segfault
from the same in-place write). PR ClickHouse#108928 moved the copy-on-write from the
read-intent getTwoLevelSet() to the destination path but kept the racy
use_count() guard, so the race persisted.

Attach a monotonic is_shared atomic flag to the pointee (SharedTwoLevelSet
wrapper). getTwoLevelSet() sets it (release) before the pointee escapes;
doDeepCopyIfNeeded() forks when it is set (acquire) instead of consulting
use_count(). A pointee that another thread can reach is therefore never mutated
in place; a freshly forked copy is solely owned and unshared, so the common
single-consumer case keeps mutating in place (preserving ClickHouse#108928's copy
elision).

Regression test (gtest UniqExactSharedPointeeCoW): a direct, deterministic test
of the exact lifetime transition of STID 1464-31ed. A pointee P escapes to a
second UniqExactSet via the merge fast path (use_count 2), the sibling holder is
dropped (use_count 2 -> 1), then a merge mutates the surviving state. The test
asserts the pointee address changes, i.e. the mutation forks P instead of
writing the escaped instance in place. It fails before this change (in-place
mutation, same address) and passes after it (fork, new address). A concurrent
reader cannot form a valid fails-before/passes-after gate: any holder that keeps
P alive for the reader also bumps use_count so the buggy guard forks too, while
dropping it to force use_count == 1 makes the fix free P under the reader. So
the fork-vs-in-place decision, the necessary and sufficient cause of the race,
is asserted directly. Existing UniqExactParallelMerge gtests and
04489_uniq_exact_grouping_sets_concurrent_merge stay green, and parallel vs
single-thread GROUPING SETS results are identical.

Related: ClickHouse#108912
@groeneai groeneai force-pushed the groeneai/fix-uniqexact-two-level-merge-race branch from 43567f8 to 34b38b9 Compare July 4, 2026 07:33

@Algunenano Algunenano left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn’t this already fixed by #108928 ?

@Algunenano Algunenano self-assigned this Jul 4, 2026
void doDeepCopyIfNeeded()
{
if (two_level_set && two_level_set.use_count() > 1)
if (two_level_set && two_level_set->is_shared.load(std::memory_order_acquire))

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I don’t know why we use use_count() as a way to handle parallelism. Seems it’s broken by design

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. use_count() is not a happens-before edge: a sibling holder's destructor dropping the count 2 -> 1 does not synchronize with the pointee's memory, so a writer that observes use_count() == 1 can mutate in place while a raw TwoLevelSet* reader is still in flight (STID 1464-31ed).

This PR removes that guard. The pointee now carries a monotonic is_shared atomic, set (release) in getTwoLevelSet() the moment it escapes to another UniqExactSet, and checked (acquire) in doDeepCopyIfNeeded(). A pointee that has ever been handed out is never mutated in place. A freshly-forked or solely-owned pointee keeps is_shared == false, so it still mutates in place and #108928's copy elision for that case is preserved.

@clickhouse-gh

clickhouse-gh Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.60% 85.60% +0.00%
Functions 92.70% 92.70% +0.00%
Branches 77.80% 77.70% -0.10%

Changed lines: Changed C/C++ lines covered: 55/55 (100.00%) · Uncovered code

Full report · Diff report

@groeneai

groeneai commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 34b38b9

CI fully finished on this head (Finish Workflow + Config Workflow green, no checks queued/in_progress). Every failure below has an owner.

Check / test Reason Owner / fixing PR
Stress test (amd_tsan) / Join is supported only for pipelines with one output port, got 1 and 8 (STID 4488-4e11) trunk logical error (full_sorting_merge JOIN + query_plan_join_shard_by_pk_ranges=1 stream divergence); unrelated to this PR's UniqExactSet/HashTable TSan-race fix; 6 distinct sources in 45d (master x2 + PRs #108014, #105945, #104431, #109374), 0 caused here #109393 (ours, open)
CH Inc sync private upstream sync CH Inc sync (private, not actionable by us)

Note: the earlier Build (arm_tidy) failure was on the previous head 9855bc4db4fa (the clang-tidy lint the current head 34b38b9686f9 already fixed); it is not present on this head.

Session id: cron:our-pr-ci-monitor:20260704-110000

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants