Dynamically adjust memory limit based on system free/cached memory#104964
Conversation
`MemoryWorker` now periodically reads `MemFree + Cached` from `/proc/meminfo` and updates the global hard memory limit to `(tracked + free + cached) * max_server_memory_usage_to_ram_ratio`. This shrinks ClickHouse's effective limit when other processes on the host consume memory, reducing the risk of OOM-killing. The ratio is the existing `max_server_memory_usage_to_ram_ratio` server setting (default 0.9 in Server and Local); setting it to 0 disables the dynamic adjustment. Keeper does not expose the ratio, so dynamic adjustment is off there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The dynamic hard-limit adjustment added in the previous commit read `MemFree + Cached` from `/proc/meminfo`. That source describes the **host**, not the cgroup, so when ClickHouse runs in a small cgroup on a big host (8 GiB cgroup on a 192 GiB host) the formula `(tracked + free + cached) * ratio` inflates the global hard limit to ~145 GiB on every tick, overriding the cgroup-derived 7.2 GiB that `Server.cpp` set at startup. `MemoryTracker` then never throws `MEMORY_LIMIT_EXCEEDED` and the kernel OOM-killer fires at 8 GiB RSS, killing the server. The same `setHardLimit` call also overrode an explicit `max_server_memory_usage` from the config, since the dynamic adjustment ran every 50 ms regardless of the value `Server.cpp` had just installed. Two changes: * When `MemoryWorker` has a cgroup reader, the dynamic adjustment reads the cgroup's `memory.max` paired with the cgroup-aware usage from `cgroups_reader->readMemoryUsage` -- the same sources `AsynchronousMetrics` reports as `CGroupMemoryTotal` and `CGroupMemoryUsed`. The available-memory term becomes `memory.max - cgroup_used`. The `/proc/meminfo` path is kept as a fallback for the no-cgroup case. * `Server.cpp` and `LocalServer.cpp` now hand the configured `max_server_memory_usage` to `MemoryWorker` via the new `setExternalHardLimit` setter. The dynamic adjustment caps its computed value at this ceiling, so it can only shrink the budget below the user's setting, never raise it above. The setter is re-called from the config-reload callback whenever the setting changes. Before it is called for the first time, the dynamic adjustment is suppressed, so the worker cannot inflate the limit during the brief window between `MemoryWorker::start` and the first reload. Reproduced and verified with an 8 GiB systemd-run scope on a 192 GiB host: before the fix, five memory-heavy queries all triggered the cgroup OOM-killer; after the fix, all five are caught cleanly by `MemoryTracker` and the server stays alive across the battery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commits used `total_memory_tracker.get()` as the "memory we already own" term of the `(used + available) * ratio` formula. That value counts only allocations that went through `Allocator` -- jemalloc-internal fragmentation, mmap'd pages, page cache, and any untracked allocation are excluded. Under load, the tracker can be orders of magnitude smaller than the actual RSS, which makes the formula compute a new hard limit barely above `tracked` while RSS is far higher -- pinning the limit at (or below) current RSS and rejecting every subsequent allocation. Reproduced on `Stateless tests (amd_asan_ubsan, distributed plan, parallel, 2/2)` (https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104964&sha=ced8215d96e5342ac0619ba94e7582618c0b60ca&name_0=PR&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20distributed%20plan%2C%20parallel%2C%202%2F2%29): 168 tests failed with `Connection reset by peer` after the lead query got `(total) memory limit exceeded: would use 780.35 MiB (attempt to allocate chunk of 1.01 MiB), current RSS: 28.67 GiB, maximum: 28.67 GiB`. The hard limit had been driven to 29 GiB by `(tracked: 780 MiB + available: 31 GiB) * 0.9 = 28.6 GiB` while actual RSS was 28.67 GiB. PR: #104964 Three changes: * Use `resident` (the jemalloc/cgroup RSS already computed at the top of the worker tick) as the baseline. `(resident + available) * ratio` always exceeds `resident` when `available > resident * (1/ratio - 1)` and tracks actual memory pressure faithfully. * Refuse to apply a new hard limit that lies at or below `resident`, even after the ceiling clamp. Shrinking under RSS cannot succeed (the server has no way to release memory instantly) and would only break in-flight queries -- the purpose of the adjustment is to leave room for *other* processes, not to throttle our own work. * Roll the previous `setExternalHardLimit(ceiling)` setter into `setDynamicHardLimitSettings(ceiling, ratio)`, so a config-reload change to `max_server_memory_usage_to_ram_ratio` takes effect on the next worker tick. The previous design captured the ratio once in the constructor; subsequent reloads updated `total_memory_tracker`'s hard limit but left the worker's formula running with the stale ratio. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…flags The `meminfo_warnings_printed` and `cgroup_memory_max_warnings_printed` fields in `MemoryWorker` are only read or assigned from `#if defined(OS_LINUX)` blocks, so on Darwin and FreeBSD they triggered `-Werror,-Wunused-private-field`. Mark them `[[maybe_unused]]` so the non-Linux builds compile cleanly without changing the Linux code paths. Build failure reports: - https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_amd_darwin/build_clickhouse/build_clickhouse.log - https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_arm_darwin/build_clickhouse/build_clickhouse.log - https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_amd_freebsd/build_clickhouse/build_clickhouse.log PR: #104964 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…er` dynamic limit Two correctness issues flagged by the AI review on #104964: 1. In cgroup v2, when the leaf cgroup has `memory.max = max` but an ancestor cgroup has a finite limit, the previous code parsed only the leaf's `memory.max`, fell back to `/proc/meminfo`, and computed `available` from host-scoped memory. That could inflate the dynamic hard limit above the effective cgroup budget. Fix: walk all ancestors up to the cgroup mount root, open every `memory.max`, and on each tick take the minimum finite value. Only fall back to `/proc/meminfo` when no ancestor has a finite limit (i.e. the cgroup truly imposes no memory cap). 2. Reload race: a worker tick that read a positive `ratio` could compute `new_hard_limit` and write it via `setHardLimit` after a concurrent reload set the ratio to `0` (and reapplied its own hard limit from `total_memory_tracker.setHardLimit(max_server_memory_usage)`), leaving a stale positive limit. Fix: add a `settings_generation` counter bumped by `setDynamicHardLimitSettings` after writing the new ratio/ceiling. The worker captures the generation when it starts a dynamic update tick and re-checks just before `setHardLimit`, skipping the write if a reload happened in flight. PR: #104964 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
azat
left a comment
There was a problem hiding this comment.
This looks questionable to me, do we really want to take care of this?
Maybe it is better to configure maximum allowed memory usage for server properly?
Isn't it the job of cgroups? (i.e. run clickhouse-server in a separate cgroup)
Or instead of configure hard limit we need to shrink memory that we can on memory pressure, i.e. jemalloc dirty pages, marks cache, e.t.c.?
Many ClickHouse users run it on machines with other stuff, like their web servers, PHP, Perl, etc. |
Three feedback items on `MemoryWorker`'s dynamic limit:
* Use `MemAvailable` from `/proc/meminfo` instead of `MemFree + Cached`.
`MemAvailable` is the kernel's own estimate of memory available for
new allocations; it already accounts for the *reclaimable* portion of
the page cache and slab, so we don't have to add `Cached` ourselves
and we don't claim memory the kernel considers pinned (dirty pages,
low watermarks, mlocked).
* Handle the cgroup-v2 `memory.max` value `"max"` explicitly. The
previous code relied on `tryReadIntText` returning false for the
non-numeric string; this is the same effect but doesn't rely on
parse-failure semantics and is easier to read.
* Drop the redundant `> 0` check on a `uint64_t` value.
The dynamic-limit formula is now
`(resident + MemAvailable) * max_server_memory_usage_to_ram_ratio`
in the no-cgroup-limit fallback path, instead of
`(resident + MemFree + Cached) * ratio`.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
A huge amount of memory can be eaten by the OS for sockets. |
|
This was fixed by #105146. Let's update the branch. |
…p instead of skip
Three follow-up fixes for `MemoryWorker`'s dynamic hard-limit logic, based on
re-review comments from `clickhouse-gh`.
Reload race in `setHardLimit` (TOCTOU between generation check and apply).
The previous design captured `settings_generation` at the start of the worker
tick and re-checked it just before `setHardLimit`. But the actual apply was
not under any lock, and `Server.cpp` performed its own `setHardLimit` outside
the worker's view, so a worker tick that observed the old generation could
still call `setHardLimit` *after* a reload had installed its own value,
overwriting it with a stale number computed against the old ratio. With
`max_server_memory_usage_to_ram_ratio = 0`, subsequent ticks skip the
adjustment and the stale limit could persist indefinitely.
Fixed by:
* Adding `dynamic_hard_limit_apply_mutex` that both `setDynamicHardLimitSettings`
and the worker's apply step take. The mutex serializes the two writers, so
the worker's "re-check generation then `setHardLimit`" is now atomic with
respect to a reload's "store new settings then `setHardLimit`".
* Moving the `total_memory_tracker.setHardLimit(max_server_memory_usage)`
call from `Server.cpp` / `LocalServer.cpp` into `setDynamicHardLimitSettings`
so the hard limit is installed under the same mutex.
Cgroup v1 "no limit" sentinel.
`memory.limit_in_bytes` uses a huge numeric sentinel (`PAGE_COUNTER_MAX`,
~2^63) to mean "no limit". The previous code treated any numeric `> 0` as
finite, so the sentinel made `any_finite == true` and the dynamic limit
stayed pinned to the startup ceiling on v1-unlimited hosts instead of
falling back to `/proc/meminfo`. Captured host RAM at construction (raw
`sysconf(_SC_PHYS_PAGES) * _SC_PAGESIZE`, not the cgroup-aware
`getMemoryAmount`) and skip any limit `>= host_memory_bytes`, the same way
the v2 `"max"` token is skipped.
Skip-when-pressure path actually leaves us with the old, larger limit.
Under high memory pressure the formula can produce `new_hard_limit <= used`.
The previous code skipped the tick to avoid rejecting in-flight allocations
— but that left the previous (often much larger) limit in place, breaking
the contract of shrinking the budget as available memory drops. Replaced
the skip with a clamp: `new_hard_limit = max(formula, used + 64 MiB)`,
still capped at `ceiling`, so the shrink takes effect with a small headroom
for in-flight queries instead of being thrown away.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address the unresolved review thread on `src/Common/MemoryWorker.cpp:727`: returning `0` from `readAvailableForDynamicLimit` was overloaded — it meant both "we could not read the metric" (legitimate to skip this tick) and "there is genuinely no memory available" (the highest-pressure case, where the dynamic limit must still shrink). The previous `if (available)` guard collapsed those into a single "skip" branch, so under maximum pressure the worker would leave the previous (often larger) hard limit in place — exactly when ClickHouse should be tightening its budget the most. Switch `readAvailableForDynamicLimit` and `readSystemAvailableMemory` to `std::optional<uint64_t>` so `std::nullopt` signals "no source could be read" and a returned `0` signals "no memory available right now". The caller now treats `0` as valid input: the existing clamp-to-`used + safety_margin` path will shrink the hard limit at the high-pressure boundary instead of skipping the adjustment.
Addresses the unresolved bot thread on `MemoryWorker.cpp:556`. When running in cgroup mode (`cgroups_reader && !cgroup_memory_max_bufs.empty()`) and every `memory.max` read throws, the old code fell through to host-wide `/proc/meminfo`. On a containerized deployment with a finite effective cgroup limit, that is fail-open: the host's `MemAvailable` can be far above the cgroup budget, and using it as headroom can let `total_memory_tracker` exceed the cgroup limit and trigger a cgroup OOM kill. Distinguish "cgroup has no finite limit" (safe to use host meminfo) from "failed to read cgroup limits" (skip this tick): track a `any_read_failure` flag in the read loop; when no limit was parsed finite and at least one read threw, return `std::nullopt` so the worker leaves the hard limit alone and retries on the next tick. The "all reads succeeded but every value was `max`/sentinel" case continues to fall through to host `MemAvailable` as before, since that is the genuinely-unbounded cgroup case.
…'t raise shrunk limit on reload, add unit tests Three fixes for the dynamic `total_memory_tracker` hard-limit adjustment in `MemoryWorker`, addressing review feedback: - Constructor-time cgroup open failures no longer drop a level permanently. `CgroupMemoryLevel` now retains the `memory.max`/`memory.current` paths, and `readAvailableForDynamicLimit` lazily (re)opens any level whose buffers are null and fails the whole tick closed on a persistent open/read failure. This prevents computing the headroom minimum from an incomplete ancestor set and overestimating the budget when the dropped ancestor was the tighter one. - `setDynamicHardLimitSettings` no longer raises an already-shrunk dynamic limit back to the static ceiling on an unrelated config reload. When dynamic adjustment is enabled and the worker has previously shrunk the limit under memory pressure, the reload only lowers the limit (never raises it); the next worker tick recomputes from live headroom. `0` is treated as "unlimited" on both sides of the comparison. - Factored the per-level decision into the pure helper `MemoryWorkerHelpers::decideCgroupLevelAvailability` and added focused unit tests covering finite `memory.max` with same-level usage, the `"max"` token, the cgroup v1 "no limit" sentinel, at/over-limit (zero headroom), and zero/unparseable values.
`clang-tidy` (`cppcoreguidelines-pro-type-member-init`) flagged `MemoryWorkerHelpers::CgroupLevelAvailability::kind` as uninitialized by the implicit constructor. Give it a default member initializer of `CgroupLevelKind::Unbounded`, matching the sibling `available = 0`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
If a path operation in the cgroup v2 ancestor walk threw after `cgroups_reader` and `source` were assigned and some (but not all) levels were pushed, the outer `catch` only logged `Cannot use cgroups reader` and left the partial state in place. `readAvailableForDynamicLimit` gates the cgroup branch on `cgroups_reader && !cgroup_memory_levels.empty()`, so it would then compute the headroom minimum from an incomplete ancestor set and overestimate the real budget when the dropped ancestor was the tighter one. Reset `cgroups_reader` / `cgroup_memory_levels` and clear `source` in the `catch` so the worker fails closed and falls back to the jemalloc source and host-wide `/proc/meminfo`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`setDynamicHardLimitSettings` kept `min(current, ceiling)` on every reload after the first whenever `ratio > 0`, on the assumption that the worker will raise the limit again on a later tick. That assumption is false when the dynamic adjustment is a no-op: on non-Linux `readAvailableForDynamicLimit` always returns `std::nullopt`, and when `source == None` (e.g. no jemalloc, no cgroups) `start` runs no worker tick at all. In those cases raising `max_server_memory_usage` (e.g. from 8 GiB to 16 GiB) on reload would leave `total_memory_tracker` pinned at the old, smaller value indefinitely — whereas before this PR the direct `setHardLimit` applied the new config immediately. Only preserve a previously-shrunk value when the worker can actually recompute it: a live `source` on Linux. Otherwise apply the new `ceiling` directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Continued work on this PR:
Verified locally: The design question raised in review (whether the server should manage this vs. relying on cgroups / memory-pressure reclaim) is left for you to decide. |
The cgroup branch of `MemoryWorker::readAvailableForDynamicLimit` computed per-level headroom as `memory.max - memory.current`. `memory.current` includes reclaimable page cache (`active_file` + `inactive_file`) and reclaimable slab (`slab_reclaimable`), which the kernel frees under pressure before invoking the OOM killer. Counting them as usage shrinks the headroom on a read-heavy server with a warm page cache (`memory.current` close to `memory.max`) and can make ClickHouse throw `MEMORY_LIMIT_EXCEEDED` even though most of that memory is reclaimable. Read each cgroup v2 level's `memory.stat` and subtract its reclaimable bytes from `memory.current`, so the cgroup headroom mirrors the host-wide `MemAvailable` path. `resident`, the formula's baseline, already excludes reclaimable memory, so on a dedicated cgroup `resident + available` converges to `memory.max`. The cgroup v1 path is unaffected: leaf usage from `cgroups_reader` already excludes reclaimable memory. Add a unit test for the new `reclaimableFromCgroupV2Stat` helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Merged latest Addressed the two open
CI: the only failures are perf-comparison "slower" queries, 7 of 12 clustered on the |
…e limit adjustment Address review: a positive max_server_memory_usage_to_ram_ratio used to opt the server into runtime hard-limit shrinking with no way to keep the previous static cap, while setting the ratio to 0 removed the cap entirely. The new memory_worker_dynamic_hard_limit server setting (enabled by default) gates only the runtime adjustment: when disabled, max_server_memory_usage_to_ram_ratio caps the hard memory limit statically at startup and on configuration reload, exactly as in previous versions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Pushed The previous |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 256/311 (82.32%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 23 line(s) · Uncovered code |
alexey-milovidov
left a comment
There was a problem hiding this comment.
Sounds very much okay.
There was a problem hiding this comment.
This is a constant log spam:
2026.06.15 17:17:57.967192 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.97 GiB to 89.97 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.017390 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.97 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.067590 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.117877 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.168146 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.268706 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.419694 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.469854 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.721332 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.771520 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.972527 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.022732 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.072957 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.123229 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.374414 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.675770 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.776425 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.826696 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.927067 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.977327 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.027615 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.077792 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.128096 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.278766 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.580050 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.780905 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
The background `MemoryWorker` recomputes the total memory hard limit on
every tick (every ~50ms) from `resident` and `MemAvailable`, both of
which jitter by a few MiB on each read. The previous guard re-applied and
logged the new limit whenever it differed from the current one by even a
single byte, so on an otherwise idle server the trace log was flooded:
<Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
repeated dozens of times per second, and `setHardLimit` was called every
tick to no practical effect.
Adjust only when the change exceeds 1% of the current limit. A genuine
memory-pressure shift moves the limit by far more than this, so the
dynamic adjustment still reacts promptly when it actually needs to.
Discussion: ClickHouse#104964 (comment)
Related: ClickHouse#104964
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
The background
MemoryWorkernow periodically updates the server's hard memory limit based on the current memory usage and the amount of memory the kernel reports as available, so ClickHouse leaves room for other processes running on the same host. The formula is(resident memory + system MemAvailable) * max_server_memory_usage_to_ram_ratio. The samemax_server_memory_usage_to_ram_ratioserver setting controls both the startup cap and the dynamic adjustment; set it to0to disable both. To keep only the static startup/reload cap (the behavior of previous versions), set the new server settingmemory_worker_dynamic_hard_limitto0.Documentation entry for user-facing changes
Motivation
The server's hard memory limit is currently computed once at startup (and on config reload) as
total_RAM * max_server_memory_usage_to_ram_ratio. On hosts shared with other processes (sidecars, system services, other workloads), this can be too permissive: as the other processes grow, ClickHouse keeps using its full configured share and the host gets OOM-killed.This change makes
MemoryWorkercontinuously recompute the limit from a more honest baseline — what we already own plus what the kernel reports as free or reclaimable — and apply the same ratio to leave headroom for the rest of the system.Behavior
On each tick of
MemoryWorker(default ~50ms on cgroups, ~100ms on jemalloc), in addition to the existing RSS/allocated bookkeeping, the worker readsMemAvailablefrom/proc/meminfo(or, when running in a cgroup with a finitememory.max, derives the per-cgroup equivalent) and computes:It then calls
total_memory_tracker.setHardLimit(new_hard_limit).MemAvailableshrinks (or the cgroup's free budget shrinks) and the dynamic limit follows, capping our growth before the host runs out of memory.max_server_memory_usage_to_ram_ratio = 0(no startup cap by ratio) also disables the dynamic adjustment.memory_worker_dynamic_hard_limit(enabled by default) gates only the runtime adjustment: when disabled,max_server_memory_usage_to_ram_ratiocaps the hard memory limit statically at startup and on configuration reload, exactly as in previous versions.max_server_memory_usage_to_ram_ratio, so the adjustment is off there./proc/meminfois unavailable, so the adjustment is a no-op.Files
src/Common/MemoryWorker.{h,cpp}— newMemoryWorkerConfig::dynamic_hard_limit_ratio; lazy/proc/meminforeader forMemAvailable; cgroup-awarememory.maxwalk for cgroup v2; per-tick limit update withLOG_TRACEwhenever the limit changes.programs/server/Server.cpp,programs/local/LocalServer.cpp— feedmax_server_memory_usage_to_ram_ratiointo the new field, gated bymemory_worker_dynamic_hard_limit.src/Core/ServerSettings.cpp— newmemory_worker_dynamic_hard_limitsetting.Note
High Risk
Changes how
total_memory_trackerhard limits are applied and updated at runtime based on/proc/meminfo/cgroup headroom, which can affect query admission and memory-related stability under load; also introduces new concurrency/atomic logic around config reload vs worker ticks.Overview
Adds a dynamic total-memory hard limit driven by
MemoryWorker, recomputing the server ceiling each tick as(resident + available) * max_server_memory_usage_to_ram_ratioand applying it tototal_memory_tracker.Updates
Server.cppandLocalServer.cppto pass the ratio intoMemoryWorkerand to applymax_server_memory_usagevia the newMemoryWorker::setDynamicHardLimitSettings(removing directsetHardLimitcalls) to avoid races with the worker.Extends
MemoryWorkerwith Linux-aware available-memory reading (including cgroup v2 ancestormemory.maxmin selection, v1 sentinel handling, and fail-close behavior) plus synchronization (settings_generation+ mutex) so config reloads and worker adjustments cannot overwrite each other.Reviewed by Cursor Bugbot for commit 9a207ef. Bugbot is set up for automated code reviews on this repo. Configure here.
Version info
26.6.1.728