Fix jemalloc verifySetup on builds without background_thread mallctl by sayondeep · Pull Request #104330 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix jemalloc verifySetup on builds without background_thread mallctl#104330

Merged
alexey-milovidov merged 14 commits into
ClickHouse:masterfrom
sayondeep:fix-jemalloc-verify-setup-darwin
May 17, 2026
Merged

Fix jemalloc verifySetup on builds without background_thread mallctl#104330
alexey-milovidov merged 14 commits into
ClickHouse:masterfrom
sayondeep:fix-jemalloc-verify-setup-darwin

Conversation

@sayondeep

@sayondeep sayondeep commented May 8, 2026

Copy link
Copy Markdown
Contributor

Motivation

On macOS (and any jemalloc build without JEMALLOC_BACKGROUND_THREAD), je_mallctl for background_thread and max_background_threads returns ENOENT. Jemalloc::setup already no-ops for those writes, but Jemalloc::verifySetup still read via getValue and compared the result to ServerSettings, using uninitialized memory when the read failed. That produced a false jemalloc_enable_background_threads mismatch and aborted debug builds (see #102183).

Changes

  • Add tryGetValue and use it in verifySetup so background-thread mallctls are only checked when je_mallctl succeeds.
  • Zero-initialize the local in getValue when mallctl fails.

Changelog category:

  • Bug Fix

Changelog entry:

Fix a fatal logical error at server startup on macOS (and similar jemalloc builds) where Jemalloc::verifySetup incorrectly reported a jemalloc_enable_background_threads mismatch because optional background_thread / max_background_threads mallctls are absent; verification is skipped when those mallctls are unavailable, and getValue no longer leaves the output uninitialized on failure. Closes #102183.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

@CLAassistant

CLAassistant commented May 8, 2026

Copy link
Copy Markdown

@Algunenano Algunenano added the can be tested Allows running workflows for external contributors label May 8, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 8, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [8196609]

Summary:


AI Review

Summary

This PR hardens jemalloc handling on builds where optional mallctls are unavailable by introducing fallible reads (tryGetValue), making strict reads assert in debug mode, migrating optional call sites away from strict reads, and adding a regression test for verifySetup on missing background_thread mallctls. I reviewed the current diff and all existing discussion threads; previously raised correctness concerns appear addressed, and I did not find new live correctness/performance/safety issues.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 8, 2026
@antonio2368 antonio2368 self-assigned this May 8, 2026
Comment thread src/Common/Jemalloc.h Outdated
sayondeep added a commit to sayondeep/ClickHouse that referenced this pull request May 9, 2026
Per review (ClickHouse#104330): collapse the je_mallctl boilerplate by implementing
getValue on top of tryGetValue, and chassert that the read succeeded.
This makes the contracts explicit:

  - tryGetValue is the fallible API for optional mallctls.
  - getValue means "this mallctl must exist; failing is a programming
    error" and trips chassert in debug if violated.

Drop the now-incompatible "GetValueReturnsZeroOnUnknownMallctl" gtest
since calling getValue on a missing mallctl is no longer defined.
The remaining tests still cover the regression from ClickHouse#102183.
sayondeep added a commit to sayondeep/ClickHouse that referenced this pull request May 9, 2026
Previously implemented Jemalloc::getValue on top of tryGetValue and
chasserted that the read succeeded. CI for PR ClickHouse#104330 showed that this
contract aborts the server at startup: AsynchronousMetrics::update reads
jemalloc stats via getValue, and on builds where any individual mallctl
is unavailable (e.g. background-thread mallctls on macOS, or any future
optional name) the chassert fires inside saveJemallocMetricImpl and
debug builds abort with signal 6.

Switch back to a non-fatal contract that better matches the call sites:

  - getValue returns a zero-initialized value when je_mallctl fails.
    AsynchronousMetrics will report 0 for an unreadable metric instead
    of aborting the entire server.
  - tryGetValue stays the explicit API for callers that must distinguish
    "missing" from "present" (verifySetup already uses it).
  - MibCache::getValue zero-initializes its local for the same reason,
    so a failing je_mallctlbymib no longer returns uninitialized memory.

Restore the gtest pinning the zero-default contract for getValue, which
was dropped in 2da2008 when that contract was briefly undefined.

Closes ClickHouse#102183.
Comment thread src/Common/Jemalloc.h Outdated
size_t value_size = sizeof(T);
je_mallctl(name, &value, &value_size, nullptr, 0);
T value{};
tryGetValue(name, value);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tryGetValue(name, value);
[[maybe_unused]] auto success = tryGetValue(name, value);
chassert(success, fmt::format("Failed to get jemalloc value for '{}'", name));

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried adding this in previous commit (here).

It aborts the server because callers like AsynchronousMetrics::saveJemallocMetricImpl can hit optional/build-dependent mallctls.

CI reference:
Job
Report

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They should be converted to tryGetValue and not add value to async metrics if they don't exist. Otherwise we are pushing random values there.

sayondeep added 6 commits May 11, 2026 17:24
Jemalloc verifySetup compared getValue for background_thread and
max_background_threads against server settings. On Darwin jemalloc is
built without JEMALLOC_BACKGROUND_THREAD, so je_mallctl returns ENOENT
and the old code read uninitialized memory, tripping chassert in debug.

Skip those checks when the mallctl is unavailable; zero-initialize getValue
on failure.

Closes ClickHouse#102183.
…thread

Covers the regression from issue ClickHouse#102183: on builds without
JEMALLOC_BACKGROUND_THREAD (e.g. macOS), je_mallctl for "background_thread"
returns ENOENT. The new gtest exercises three guarantees from the fix:

  - tryGetValue reports failure for an unknown mallctl and does not
    write through the output pointer.
  - getValue returns a zero-initialized value when the mallctl is missing.
  - verifySetup does not abort or warn when the only potential mismatch
    comes from optional background-thread mallctls that are absent on
    the running jemalloc build.
Per review (ClickHouse#104330): collapse the je_mallctl boilerplate by implementing
getValue on top of tryGetValue, and chassert that the read succeeded.
This makes the contracts explicit:

  - tryGetValue is the fallible API for optional mallctls.
  - getValue means "this mallctl must exist; failing is a programming
    error" and trips chassert in debug if violated.

Drop the now-incompatible "GetValueReturnsZeroOnUnknownMallctl" gtest
since calling getValue on a missing mallctl is no longer defined.
The remaining tests still cover the regression from ClickHouse#102183.
Previously implemented Jemalloc::getValue on top of tryGetValue and
chasserted that the read succeeded. CI for PR ClickHouse#104330 showed that this
contract aborts the server at startup: AsynchronousMetrics::update reads
jemalloc stats via getValue, and on builds where any individual mallctl
is unavailable (e.g. background-thread mallctls on macOS, or any future
optional name) the chassert fires inside saveJemallocMetricImpl and
debug builds abort with signal 6.

Switch back to a non-fatal contract that better matches the call sites:

  - getValue returns a zero-initialized value when je_mallctl fails.
    AsynchronousMetrics will report 0 for an unreadable metric instead
    of aborting the entire server.
  - tryGetValue stays the explicit API for callers that must distinguish
    "missing" from "present" (verifySetup already uses it).
  - MibCache::getValue zero-initializes its local for the same reason,
    so a failing je_mallctlbymib no longer returns uninitialized memory.

Restore the gtest pinning the zero-default contract for getValue, which
was dropped in 2da2008 when that contract was briefly undefined.

Closes ClickHouse#102183.
saveJemallocMetricImpl now uses Jemalloc::tryGetValue and only inserts
into AsynchronousMetricValues when je_mallctl succeeds. Optional or
build-dependent stats (e.g. background_thread.* on macOS) no longer
appear as misleading zeros.

jemalloc.epoch is only published when epoch can be read without forcing
an update (same tryGetValue rule).

MergeTree arena derived bytes metrics are emitted only when both
pactive and pdirty reads succeed.
chassert when je_mallctl fails, so absent mallctl names are only used where
call sites explicitly tolerate failure.

Use tryGetValue for best-effort reads that must not abort debug builds:
MemoryTracker (prof.active, opt.prof_prefix when flushing profiles),
MemoryWorker (arenas.narenas before per-arena dirty_decay_ms), and
KeeperJemallocStatusHandler (opt.prof, prof.active, prof.lg_sample for JSON).
Remove the try/catch around the old getValue path there because getValue
does not throw.

AsynchronousMetrics jemalloc stats already used tryGetValue.

Drop the gtest that expected getValue to return zero for a bogus mallctl;
that conflicts with the strict getValue contract.
@sayondeep sayondeep force-pushed the fix-jemalloc-verify-setup-darwin branch from 02cb141 to 758c206 Compare May 11, 2026 13:31
Comment thread src/Server/KeeperJemallocHandler.cpp Outdated
sayondeep added 2 commits May 11, 2026 19:14
…alue,

so a failed tryGetValue still produced an engaged optional with value-
initialized data (false/0) and never appended to errors.

On failure, return std::nullopt and add the mallctl name to the JSON errors
array so /jemalloc/status does not report fake profiling state and clients
can see which mallctls failed.
Comment thread src/Common/Jemalloc.h Outdated
sayondeep added 2 commits May 12, 2026 14:09
MibCache::getValue ignored the je_mallctlbymib return code, so after the
previous value{} zero-initialization a failed MIB read silently returned
0/false (e.g. /jemalloc/status reporting thread_active_init=false on
builds without prof support).
Track the MIB validity from je_mallctlnametomib at construction, add a
fallible MibCache::tryGetValue, and make MibCache::getValue chassert on
read failure to match the free-function Jemalloc::getValue contract.
KeeperJemallocStatusHandler switches to tryGetValue for prof.thread_active_init
so an absent mallctl is surfaced via the JSON "errors" array instead of
asserting in debug.
Comment thread src/Server/KeeperJemallocHandler.cpp Outdated
Comment thread src/Common/Jemalloc.cpp
`prof.lg_sample`; reading them through the strict
`MibCache::getValue` / `Jemalloc::getValue` introduced earlier in
PR ClickHouse#104330 chasserts during debug-build startup before the existing
`background_thread` guards are reached.

Match the `background_thread` pattern: read with `tryGetValue` and only
compare against the expected value when the mallctl exists. setup() is
already symmetric (the corresponding writes are no-ops when the
mallctls are absent), so skipping verification is the correct behavior
rather than a workaround.
Comment thread src/Common/Jemalloc.h
Comment thread src/Common/tests/gtest_jemalloc_verify_setup.cpp Outdated
antonio2368 added a commit that referenced this pull request May 14, 2026
Expose `FailPointInjection::hasAnyFailPointBeenRegistered` so hot paths can reuse the existing libfiu relaxed-atomic guard before doing local failpoint-specific checks.

Use it in `QueryMetricLog::finishQuery` to skip the test-only query-id prefix comparison when no failpoint has ever been enabled in the process.

CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104330&sha=d53d6108634f5740d1ff713e32213dbec4267213&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%201%2F2%29
PR: #104330
pull Bot pushed a commit to sonvt1710/ClickHouse that referenced this pull request May 14, 2026
A late periodic `QueryMetricLog::collectMetric` can update `last_collect_time` past the `BlockIO::onFinish` finish timestamp before `QueryMetricLog::finishQuery` emits the final row. Do not apply periodic timestamp deduplication to final rows.

Update `03203_system_query_metric_log` to check that a `system.query_metric_log` row exists at the `system.query_log` `QueryFinish` timestamp, instead of requiring that timestamp to be the maximum metric timestamp. Add a failpoint regression test for the race.

CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104330&sha=d53d6108634f5740d1ff713e32213dbec4267213&name_0=PR&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%201%2F2%29
PR: ClickHouse#104330
@antonio2368

Copy link
Copy Markdown
Member

@sayondeep can you address the last 2 comments from reviewer bot?

sayondeep added 2 commits May 15, 2026 17:13
…of.*

The strict MibCache::getValue / Jemalloc::getValue contract introduced
earlier in PR ClickHouse#104330 leaves several optional `prof.*` reads aborting
debug builds on jemalloc builds without JEMALLOC_PROF:
- `setProfileSamplingRate` reads `prof.lg_sample` (reachable from
  non-default `jemalloc_profiler_sampling_rate` config in `setup`).
- `ThreadStatus::detachFromGroup` reads `prof.thread_active_init`
  (reachable from query-level `jemalloc_enable_profiler=1`).
- The gtest `VerifySetupNoAbortWhenBackgroundThreadMallctlMissing`
  unconditionally reads `prof.thread_active_init`, so the test itself
  asserts on the exact build profile it is meant to cover.
Route all three through tryGetValue: treat a missing mallctl as a
no-op (the matching writes in `setup` are already no-ops), and let the
test fall back to `default_enable_global_profiler` when prof is absent.
@alexey-milovidov

Copy link
Copy Markdown
Member

This was fixed by #105146. Let's update the branch.

@clickhouse-gh

clickhouse-gh Bot commented May 17, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.20% +0.00%
Functions 91.40% 91.40% +0.00%
Branches 76.60% 76.70% +0.10%

Changed lines: 78.15% (118/151) | lost baseline coverage: 3 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 17, 2026
Merged via the queue into ClickHouse:master with commit 4e69101 May 17, 2026
167 checks passed
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clickhouse Debug build failing on MAC OS 26.3.1

6 participants