Fix recursive MSan report in HandledSignals::reset on shutdown#107442
Fix recursive MSan report in HandledSignals::reset on shutdown#107442groeneai wants to merge 1 commit into
Conversation
|
cc @azat @Algunenano — could you review this? It marks |
|
Workflow [PR], commit [fe7e8f8] Summary: ❌
AI ReviewSummaryThe PR changes Final VerdictStatus: ✅ Approve |
36f9f92 to
f057d41
Compare
Pre-PR validation gate (revised fix)
Session id: cron:clickhouse-worker-slot-3:20260614-184900 |
The HandledSignals singleton is reachable from signal handlers and the sanitizer death callback, both of which can run after static destructors. With the previous Meyers singleton it was destroyed at process exit, so HandledSignals::reset() iterated the destroyed handled_signals vector. Under MSan that read fired a use-of-uninitialized-value warning, whose death callback called reset() again, recursing until the original report was lost. Leak the singleton (same idiom as DateLUT::getInstance) so its members outlive all teardown and reset() always reads valid memory. The explicit ~HandledSignals (which only called reset() during static teardown, the exact window removed) is dropped; normal shutdown still resets via the explicit instance().reset() calls in ~BaseDaemon and ~ClientApplicationBase. Do not disable instrumentation on reset(): an uninstrumented reset() does not write defined shadow for its loop-iterator locals, so they inherit stale poison from just-destroyed locals in the preceding writeSignalIDtoSignalPipe call; the still-instrumented operator== then reports a false positive on every client/local exit. The leak alone removes the recursion seed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
f057d41 to
fe7e8f8
Compare
Pre-PR validation gate (updated for head fe7e8f8)
Session id: cron:clickhouse-worker-slot-0:20260614-225000 |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered by tests: 3/3 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 107 line(s) · Uncovered code |
`test_implicit_index_upgrade_alter_replay` creates a second replica `node2` and asserts `SELECT count() FROM test_alter_replay` equals `10001` immediately after `wait_for_active_replica`. But `wait_for_active_replica` only waits for `is_readonly = 0`, not for the freshly-joined replica to finish fetching parts. So `node2` can observe only the small `VALUES` part (1 row, `key = 99999`) before the 10000-row part is fetched, and the assertion fails with `assert '1' == '10001'`. This raced once each on several unrelated PRs over the last 30 days (e.g. ClickHouse#108084, ClickHouse#99280, ClickHouse#107566, ClickHouse#107442). Add `SYSTEM SYNC REPLICA test_alter_replay` after the replica becomes active, which blocks until all parts are fetched, before checking the row count. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108770&sha=bb760616d54e3d8e5d20968022085506c31b5f37&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%201%2F4%29 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
Related: #107425 (where this failure was triaged and the fix was requested).
Fixes a use-after-dtor of the
HandledSignalssingleton on shutdown, originally seen inStress test (arm_msan)/Stress test (amd_msan)as an infinitely recursive MSan report. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107425&sha=b2f68b66f7510cdaaec64fb58e7c5601b58180c0&name_0=PR&name_1=Stress%20test%20%28arm_msan%29Root cause:
HandledSignals::instance()was a function-local static, so the singleton was destroyed at process exit. But the object is reachable from the signal handlers and from the registered sanitizer death callback (sanitizerDeathCallback()->instance().reset(false)), both of which can run after static destructors. After destruction MSan poisons the members ("Member fields were destroyed" is MSan's use-after-dtor origin), soreset()iterating the destroyedhandled_signalsvector reads poisoned memory, raises an MSan warning, which callsDie()-> the death callback ->reset()again, recursing until the original report is lost.Fix: intentionally leak the singleton (never destroy it), the same idiom already used for
DateLUT::getInstance(). Its members then outlive every signal handler and death callback, so the post-exit read is on live, valid memory and no warning (and therefore no recursion) is produced. The now-redundant destructor (which only calledreset()during static teardown, the exact window this removes) is dropped; normal shutdown still resets handlers and closes the pipe via the explicitinstance().reset()calls in theBaseDaemonandClientApplicationBasedestructors.No behavior change in non-sanitizer builds.