Fix TSan data race in uniqExact two-level parallel merge#109389
Fix TSan data race in uniqExact two-level parallel merge#109389groeneai wants to merge 1 commit into
Conversation
|
cc @Algunenano @nihalzp — could you review? This continues #108928: the |
|
Workflow [PR], commit [34b38b9] Summary: ❌
AI ReviewSummaryThis PR replaces the racy Missing context / blind spots
Final Verdict✅ No blocking findings in the current code. The previous inline issue about the old regression test no longer applies and is now resolved. |
9855bc4 to
43567f8
Compare
STID 1464-31ed. The two-level pointee of a UniqExactSet is shared across several UniqExactSet objects: the merge fast path (two_level_set = other.getTwoLevelSet()) aliases one pointee into an empty destination, and parallelizeMergeMulti pre-fetches raw TwoLevelSet* to every state and reads those buckets across threads. The parallel merge of ROLLUP/CUBE/GROUPING SETS states thus reads one pointee on one thread while another thread merges into a state that owns the same pointee. doDeepCopyIfNeeded forked (copied the pointee before mutating) only when shared_ptr::use_count() > 1. use_count() is not a synchronization primitive: a sibling holder being destroyed drops the count 2 -> 1 with no happens-before against the pointee's memory, so one state can observe use_count == 1 and write the pointee's buckets in place while another thread is still reading the same pointee. That is a data race (TSan reports it; issue ClickHouse#108912 was a segfault from the same in-place write). PR ClickHouse#108928 moved the copy-on-write from the read-intent getTwoLevelSet() to the destination path but kept the racy use_count() guard, so the race persisted. Attach a monotonic is_shared atomic flag to the pointee (SharedTwoLevelSet wrapper). getTwoLevelSet() sets it (release) before the pointee escapes; doDeepCopyIfNeeded() forks when it is set (acquire) instead of consulting use_count(). A pointee that another thread can reach is therefore never mutated in place; a freshly forked copy is solely owned and unshared, so the common single-consumer case keeps mutating in place (preserving ClickHouse#108928's copy elision). Regression test (gtest UniqExactSharedPointeeCoW): a direct, deterministic test of the exact lifetime transition of STID 1464-31ed. A pointee P escapes to a second UniqExactSet via the merge fast path (use_count 2), the sibling holder is dropped (use_count 2 -> 1), then a merge mutates the surviving state. The test asserts the pointee address changes, i.e. the mutation forks P instead of writing the escaped instance in place. It fails before this change (in-place mutation, same address) and passes after it (fork, new address). A concurrent reader cannot form a valid fails-before/passes-after gate: any holder that keeps P alive for the reader also bumps use_count so the buggy guard forks too, while dropping it to force use_count == 1 makes the fix free P under the reader. So the fork-vs-in-place decision, the necessary and sufficient cause of the race, is asserted directly. Existing UniqExactParallelMerge gtests and 04489_uniq_exact_grouping_sets_concurrent_merge stay green, and parallel vs single-thread GROUPING SETS results are identical. Related: ClickHouse#108912
43567f8 to
34b38b9
Compare
Algunenano
left a comment
There was a problem hiding this comment.
Isn’t this already fixed by #108928 ?
| void doDeepCopyIfNeeded() | ||
| { | ||
| if (two_level_set && two_level_set.use_count() > 1) | ||
| if (two_level_set && two_level_set->is_shared.load(std::memory_order_acquire)) |
There was a problem hiding this comment.
Yeah, I don’t know why we use use_count() as a way to handle parallelism. Seems it’s broken by design
There was a problem hiding this comment.
Agreed. use_count() is not a happens-before edge: a sibling holder's destructor dropping the count 2 -> 1 does not synchronize with the pointee's memory, so a writer that observes use_count() == 1 can mutate in place while a raw TwoLevelSet* reader is still in flight (STID 1464-31ed).
This PR removes that guard. The pointee now carries a monotonic is_shared atomic, set (release) in getTwoLevelSet() the moment it escapes to another UniqExactSet, and checked (acquire) in doDeepCopyIfNeeded(). A pointee that has ever been handed out is never mutated in place. A freshly-forked or solely-owned pointee keeps is_shared == false, so it still mutates in place and #108928's copy elision for that case is preserved.
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 55/55 (100.00%) · Uncovered code |
CI finish ledger — 34b38b9CI fully finished on this head (Finish Workflow + Config Workflow green, no checks queued/in_progress). Every failure below has an owner. Note: the earlier Session id: cron:our-pr-ci-monitor:20260704-110000 |

Related: #108912
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix a data race (and resulting rare crash) in the parallel merge of
uniqExactaggregate states with two-level sets, which could occur withGROUPING SETS,ROLLUPandCUBE.Description
TSan finding STID 1464-31ed on
Stress test (arm_tsan)/Stress test (amd_tsan), still firing after the previous fix in #108928.Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=109285&sha=aa124d023377a63a7e25cde902e4a2594150f256&name_0=PR&name_1=Stress%20test%20%28arm_tsan%29
Stack (write of size 8 by one thread):
racing a concurrent read of the same two-level pointee.
Root cause. A
UniqExactSet's two-level pointee is shared across severalUniqExactSetobjects: the merge fast path (two_level_set = other.getTwoLevelSet()) aliases one pointee into an empty destination, andparallelizeMergeMultipre-fetches rawTwoLevelSet*to every state and reads those buckets across threads. So the parallel merge ofGROUPING SETS/ROLLUP/CUBEstates reads one pointee on one thread while another thread merges into a state that owns the same pointee.doDeepCopyIfNeeded()forked (copied the pointee before mutating) only whenshared_ptr::use_count() > 1.use_count()is not a synchronization primitive: a sibling holder being destroyed drops the count 2 -> 1 with no happens-before against the pointee's memory, so one state can observeuse_count == 1and write the pointee's buckets in place while another thread is still reading the same pointee. #108928 moved the copy-on-write fromgetTwoLevelSet()to the destination path but kept the racyuse_count()guard, so the race persisted (this PR's branch — and the failing CI runs — already contain #108928).Fix. Attach a monotonic
is_sharedatomic flag to the pointee (a smallSharedTwoLevelSetwrapper).getTwoLevelSet()sets it (release) before the pointee escapes;doDeepCopyIfNeeded()forks when it is set (acquire) instead of consultinguse_count(). A pointee that another thread can reach is therefore never mutated in place. A freshly forked copy is solely owned and unshared, so the common single-consumer case keeps mutating in place, preserving #108928's copy elision.Testing. Added a unit test (
gtest_uniq_exact_concurrent_alias_merge) that merges into and reads one shared two-level pointee from two threads. It reproduces the race on the pointee buckets before the fix (WRITEemplaceNonZeroImplHashTable.h:1054<-UniqExactSet::merge; READHashTable::sizeHashTable.h:1460; sameTwoLevelHashSetTableheap block) and is TSan-clean after. The existingUniqExactParallelMergeunit tests and04489_uniq_exact_grouping_sets_concurrent_mergestay green, and parallel vs single-threadGROUPING SETSresults are identical.