Fix recursive MSan report in HandledSignals::reset on shutdown by groeneai · Pull Request #107442 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix recursive MSan report in HandledSignals::reset on shutdown#107442

Open
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-msan-recursion-handledsignals-reset
Open

Fix recursive MSan report in HandledSignals::reset on shutdown#107442
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-msan-recursion-handledsignals-reset

Conversation

@groeneai

@groeneai groeneai commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

Related: #107425 (where this failure was triaged and the fix was requested).

Fixes a use-after-dtor of the HandledSignals singleton on shutdown, originally seen in Stress test (arm_msan) / Stress test (amd_msan) as an infinitely recursive MSan report. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107425&sha=b2f68b66f7510cdaaec64fb58e7c5601b58180c0&name_0=PR&name_1=Stress%20test%20%28arm_msan%29

Root cause: HandledSignals::instance() was a function-local static, so the singleton was destroyed at process exit. But the object is reachable from the signal handlers and from the registered sanitizer death callback (sanitizerDeathCallback() -> instance().reset(false)), both of which can run after static destructors. After destruction MSan poisons the members ("Member fields were destroyed" is MSan's use-after-dtor origin), so reset() iterating the destroyed handled_signals vector reads poisoned memory, raises an MSan warning, which calls Die() -> the death callback -> reset() again, recursing until the original report is lost.

Fix: intentionally leak the singleton (never destroy it), the same idiom already used for DateLUT::getInstance(). Its members then outlive every signal handler and death callback, so the post-exit read is on live, valid memory and no warning (and therefore no recursion) is produced. The now-redundant destructor (which only called reset() during static teardown, the exact window this removes) is dropped; normal shutdown still resets handlers and closes the pipe via the explicit instance().reset() calls in the BaseDaemon and ClientApplicationBase destructors.

No behavior change in non-sanitizer builds.

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @azat @Algunenano — could you review this? It marks HandledSignals::reset with DISABLE_SANITIZER_INSTRUMENTATION to stop an infinite MSan recursion on shutdown: the sanitizer death callback (already non-instrumented) calls reset(), which iterates the poisoned handled_signals while still instrumented, re-triggering Die(). Same callback hardened earlier in #78178.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jun 14, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [fe7e8f8]

Summary:

job_name test_name status info comment
Performance Comparison (arm_release, master_head, 3/6) FAIL Performance dashboard
group_by_fixed_keys #5::old FAIL query history
group_by_fixed_keys #5::new FAIL query history
group_by_fixed_keys #6::old FAIL query history
group_by_fixed_keys #6::new FAIL query history
group_by_fixed_keys #9::old FAIL query history
group_by_fixed_keys #9::new FAIL query history
group_by_fixed_keys #10::old FAIL query history
group_by_fixed_keys #10::new FAIL query history
group_by_fixed_keys #11::old FAIL query history
group_by_fixed_keys #11::new FAIL query history
10 more test cases not shown

AI Review

Summary

The PR changes HandledSignals::instance from a destructed Meyers singleton to an intentionally leaked singleton and removes the destructor that reset handlers during static teardown. That matches the shutdown invariant the PR is trying to restore: signal handlers and the sanitizer death callback can still reach HandledSignals after static destructors, so the storage they read must not be destroyed. I did not find any remaining code issue that warrants an inline review comment.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 14, 2026
Comment thread src/Common/SignalHandlers.cpp Outdated
@groeneai groeneai force-pushed the fix-msan-recursion-handledsignals-reset branch from 36f9f92 to f057d41 Compare June 14, 2026 19:00
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (revised fix)

# Question Answer
a Deterministic repro? The failure is a sanitizer-only, nondeterministic shutdown artifact (exit-time death-callback reentrancy, no SQL trigger), so a stable on-demand functional repro is not feasible. Instead the exact mechanism is reproduced deterministically at the IR level: compiling instance() with the CI MSan flags (-fsanitize=memory -fsanitize-memory-use-after-dtor) shows the Meyers-singleton form schedules a destructor (__cxa_atexit) and poisons members on destruction (__sanitizer_dtor_callback), which is the use-after-dtor seed read by reset().
b Root cause explained? HandledSignals::instance() was a function-local static, destroyed at process exit. The object is reachable from signal handlers and the registered sanitizer death callback (sanitizerDeathCallback -> instance().reset(false)), both of which can run after static destructors. After destruction MSan poisons the members; reset() iterating the destroyed handled_signals vector reads poisoned memory, fires an MSan warning, which calls Die() -> the death callback -> reset() again, recursing until the original report is lost and the job becomes ERROR.
c Fix matches root cause? Yes. The bad state is a destroyed-then-read singleton. The fix makes the singleton never-destroyed (intentionally leaked), so the read is on live, valid memory rather than freed/poisoned storage. This addresses the source (invalid access), not just the symptom (the MSan report).
d Test intent preserved / new tests added? No existing test is modified or weakened. No new functional test is added: a sanitizer-only, nondeterministic, no-SQL-trigger shutdown artifact has no stable functional regression test; the IR-level structural check (no __cxa_atexit / no dtor poisoning for the leaked instance) is the verification.
e Both directions demonstrated? Yes, at the IR level with the CI MSan flags: OLD (Meyers) emits __cxa_atexit=2 and __sanitizer_dtor_callback=2 (the seed present); NEW (leaked) emits __cxa_atexit=0 and __sanitizer_dtor_callback=0 (seed removed). The modified TU also compiles clean in the debug build (RC=0).
f Fix is general, not a narrow patch? Yes. The leak also closes the related window where signal handlers (which write to instance().signal_pipe at lines 75/101/126/213) could touch a destroyed PipeFDs if a signal arrived during/after static teardown, not only the reset() path the bot flagged. All other instance().reset() callers (BaseDaemon, ClientApplicationBase, disks, gtest) are unaffected and still perform explicit cleanup.

Session id: cron:clickhouse-worker-slot-3:20260614-184900

The HandledSignals singleton is reachable from signal handlers and the
sanitizer death callback, both of which can run after static destructors.
With the previous Meyers singleton it was destroyed at process exit, so
HandledSignals::reset() iterated the destroyed handled_signals vector. Under
MSan that read fired a use-of-uninitialized-value warning, whose death
callback called reset() again, recursing until the original report was lost.

Leak the singleton (same idiom as DateLUT::getInstance) so its members
outlive all teardown and reset() always reads valid memory. The explicit
~HandledSignals (which only called reset() during static teardown, the exact
window removed) is dropped; normal shutdown still resets via the explicit
instance().reset() calls in ~BaseDaemon and ~ClientApplicationBase.

Do not disable instrumentation on reset(): an uninstrumented reset() does
not write defined shadow for its loop-iterator locals, so they inherit stale
poison from just-destroyed locals in the preceding writeSignalIDtoSignalPipe
call; the still-instrumented operator== then reports a false positive on every
client/local exit. The leak alone removes the recursion seed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the fix-msan-recursion-handledsignals-reset branch from f057d41 to fe7e8f8 Compare June 14, 2026 23:54
@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate (updated for head fe7e8f8)

# Question Answer
a Deterministic repro? Yes. clickhouse local -q "SELECT 1" under MSan aborts on every exit (exit 134). Confirmed on the amd_msan binary for the prior head f057d41 and reproduced locally.
b Root cause explained? Two layers. (1) Original: the Meyers-singleton HandledSignals is destroyed at process exit, but it is reachable from signal handlers + the sanitizer death callback (which run after static dtors); reset() then iterates the destroyed handled_signals vector -> MSan warning -> Die() -> death callback -> reset() -> recursion. (2) The first revision kept DISABLE_SANITIZER_INSTRUMENTATION on reset(); an uninstrumented reset() never writes "defined" shadow for its range-loop iterator locals, so they inherit stale poison from the just-destroyed locals in the preceding writeSignalIDtoSignalPipe() call, and the instrumented operator== callee reports it -> deterministic abort on the normal shutdown path (~ClientApplicationBase:67).
c Fix matches root cause? Yes. Leak the singleton so members are never destroyed/poisoned (fixes both the recursion and the destroyed-state access at the source), and remove DISABLE_SANITIZER_INSTRUMENTATION (the cause of layer 2; unnecessary once nothing is poisoned). Not a suppression.
d Test intent preserved / new tests added? No assertions weakened. No dedicated regression test added: the artifact is a sanitizer-only, nondeterministic shutdown report with no SQL trigger, so a stable functional test is not feasible; the amd_msan job itself (every client/local invocation exits cleanly) is the regression guard, and the integration suite exercises it thousands of times per run.
e Both directions demonstrated? Yes, with amd_msan binaries built from source. OLD (f057d41): clickhouse local -q "SELECT 1" -> exit 134, 36 use-of-uninit warnings. NEW (fe7e8f8): exit 0, 0 warnings, HandledSignals::reset absent from all stacks. Also clean: numbers(1000), version(), client --help. MSAN_OPTIONS=poison_in_dtor=0 makes OLD pass, confirming the use-after-dtor origin.
f Fix is general, not a narrow patch? Yes. The fix is at the source (object lifetime), not a guard at the crash site. The only other callers of HandledSignals::instance().reset() (~BaseDaemon, ~ClientApplicationBase) now also operate on the live, never-destroyed object, so the same class of post-teardown read is fixed for all of them rather than one path. No sibling implementations exist.

Session id: cron:clickhouse-worker-slot-0:20260614-225000

@clickhouse-gh

clickhouse-gh Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.10% 85.10% +0.00%
Functions 92.40% 92.40% +0.00%
Branches 77.40% 77.30% -0.10%

Changed lines: Changed C/C++ lines covered by tests: 3/3 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 107 line(s) · Uncovered code

Full report · Diff report

pull Bot pushed a commit to sowelswl/ClickHouse that referenced this pull request Jun 30, 2026
`test_implicit_index_upgrade_alter_replay` creates a second replica `node2`
and asserts `SELECT count() FROM test_alter_replay` equals `10001` immediately
after `wait_for_active_replica`. But `wait_for_active_replica` only waits for
`is_readonly = 0`, not for the freshly-joined replica to finish fetching parts.

So `node2` can observe only the small `VALUES` part (1 row, `key = 99999`)
before the 10000-row part is fetched, and the assertion fails with
`assert '1' == '10001'`. This raced once each on several unrelated PRs over
the last 30 days (e.g. ClickHouse#108084, ClickHouse#99280, ClickHouse#107566, ClickHouse#107442).

Add `SYSTEM SYNC REPLICA test_alter_replay` after the replica becomes active,
which blocks until all parts are fetched, before checking the row count.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108770&sha=bb760616d54e3d8e5d20968022085506c31b5f37&name_0=PR&name_1=Integration%20tests%20%28arm_binary%2C%20distributed%20plan%2C%201%2F4%29

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-ci

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants