iframe-proxy

r1viollet · 2026-06-23T13:25:58Z

What does this PR do?:

Closes two TOCTOU races between the SIGPROF signal handler and JFR lifecycle transitions that could cause SIGSEGV or hangs in the test JVM during the 60-second recording cycle rotation.

Race 1 — stop() side (ctimer_linux.cpp):

disableEngines() sets _enabled=false, but a handler that already passed the _enabled=true check could still be executing inside recordSample() when _jfr.stop() freed JFR buffers → use-after-free → SIGSEGV (or hang if the crash is caught by crashtracking).

Fix: add an _inflight counter, incremented on every handler entry before the _enabled check, decremented on every exit path. CTimer::stop() calls drainInflight() after deleting per-thread timers, spinning until _inflight==0 before returning. The caller (Profiler::stop) then proceeds to _jfr.stop() only once all handlers have fully exited.

Race 2 — start() side (profiler.cpp):

enableEngines() set _enabled=true before _jfr.start() had completed. A SIGPROF delivered in that window would see _enabled=true and call recordSample() on partially-initialized JFR structures.

Fix: move enableEngines() to after both _jfr.start() and _cpu_engine->start() have returned successfully (immediately before _state.store(RUNNING)).

Motivation:

Discovered while investigating intermittent SIGSEGV (exit 139) and hang failures in DataDog/profiling-backend CI. Bisected to a dd-trace-java commit that changed instrumentation initialization timing, shifting when the 60-second recording cycle boundary fell relative to test thread activity — exposing both races reliably enough to isolate.

How to test the change?:

Controlled reproducer in DataDog/profiling-backend using AnalysisEndpointTest.testResourceExhausted with the bad dd-trace-java agent (0e13e90dac) and a patched libjavaProfiler.so:

Without fix: ~60% failure rate per iteration (SIGSEGV / hang)
Race 1 fix only (drainInflight): ~20% failure rate — Race 2 still active
Race 2 fix only (move enableEngines): ~40% failure rate — Race 1 still active
Both fixes together: 12/12 iterations clean against v_1.44.0 baseline

Additional Notes:

drainInflight() is an unbounded spin. In practice recordSample() completes in microseconds so this is safe, but a bounded spin with a log warning could be added as a follow-up.
The _inflight counter is incremented even when CriticalSection fails (handler returns early without touching JFR). This is intentional: it makes the drain conservative and guarantees the counter reaches zero only after all code paths between the counter increment and any potential JFR access have completed.
Related: Revert "Ignore capturing connection continuation for armeria (#11657)" dd-trace-java#11685 (revert of the dd-trace-java commit that exposed these races).

For Datadog employees:

This PR doesn't touch any of that.
JIRA: [PROF-XXXX]

… lifecycle The CPU profiler sends SIGPROF to all threads via per-thread kernel timers. The signal handler checks _enabled and, if true, calls recordSample() which accesses JFR buffers. Two races existed around the recording cycle transition (default every 60 s) where JFR structures could be in mid-init or mid-teardown while the handler was active: Race 1 — stop() side (TOCTOU on _enabled vs _jfr.stop()): A handler that passed the _enabled=true check could still be executing inside recordSample() when disableEngines() set _enabled=false and _jfr.stop() freed JFR buffers — use-after-free → SIGSEGV. Fix: add an _inflight counter (incremented on handler entry, decremented on all exits). CTimer::stop() calls drainInflight() after deleting per- thread timers, spinning until _inflight==0 before returning to the caller that proceeds to _jfr.stop(). Any handler that fires after disableEngines() sees _enabled=false and returns early without touching JFR. Race 2 — start() side (enableEngines() before _jfr.start()): enableEngines() set _enabled=true before _jfr.start() had completed. A SIGPROF in that window would see _enabled=true and call recordSample() on partially-initialized JFR structures. Fix: move enableEngines() to after _jfr.start() and _cpu_engine->start() have both returned successfully (just before _state.store(RUNNING)). Validated empirically: a controlled reproducer in DataDog/profiling-backend (AnalysisEndpointTest.testResourceExhausted with a 60 s recording period) showed ~60% failure rate without the fix (SIGSEGV / hang), 0% with both fixes applied (12/12 iterations clean). Each fix alone only partially addressed the failures, confirming both races were independently active. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

datadog-prod-us1-3 · 2026-06-23T13:32:57Z

dd-octo-sts · 2026-06-23T13:45:31Z

CI Test Results

Run: #28235552371 | Commit: cb4dca9 | Duration: 13m 19s (longest job)

❌ 5 of 32 test jobs failed

Status Overview

JDK	glibc-aarch64/debug	glibc-amd64/debug	musl-aarch64/debug	musl-amd64/debug
8	-	✅	-	-
8-ibm	-	✅	-	-
8-j9	✅	✅	-	-
8-librca	-	-	❌	✅
8-orcl	-	✅	-	-
11	-	✅	-	-
11-j9	✅	✅	-	-
11-librca	-	-	❌	✅
17	✅	✅	-	-
17-graal	✅	✅	-	-
17-j9	✅	✅	-	-
17-librca	-	-	❌	✅
21	✅	✅	-	-
21-graal	✅	✅	-	-
21-librca	-	-	❌	✅
25	✅	✅	-	-
25-graal	✅	✅	-	-
25-librca	-	-	❌	✅

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled