Dynamically adjust memory limit based on system free/cached memory by alexey-milovidov · Pull Request #104964 · ClickHouse/ClickHouse · GitHub
Skip to content

Dynamically adjust memory limit based on system free/cached memory#104964

Merged
alexey-milovidov merged 27 commits into
masterfrom
memory-worker-dynamic-limit
Jun 12, 2026
Merged

Dynamically adjust memory limit based on system free/cached memory#104964
alexey-milovidov merged 27 commits into
masterfrom
memory-worker-dynamic-limit

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 14, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

The background MemoryWorker now periodically updates the server's hard memory limit based on the current memory usage and the amount of memory the kernel reports as available, so ClickHouse leaves room for other processes running on the same host. The formula is (resident memory + system MemAvailable) * max_server_memory_usage_to_ram_ratio. The same max_server_memory_usage_to_ram_ratio server setting controls both the startup cap and the dynamic adjustment; set it to 0 to disable both. To keep only the static startup/reload cap (the behavior of previous versions), set the new server setting memory_worker_dynamic_hard_limit to 0.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Motivation

The server's hard memory limit is currently computed once at startup (and on config reload) as total_RAM * max_server_memory_usage_to_ram_ratio. On hosts shared with other processes (sidecars, system services, other workloads), this can be too permissive: as the other processes grow, ClickHouse keeps using its full configured share and the host gets OOM-killed.

This change makes MemoryWorker continuously recompute the limit from a more honest baseline — what we already own plus what the kernel reports as free or reclaimable — and apply the same ratio to leave headroom for the rest of the system.

Behavior

On each tick of MemoryWorker (default ~50ms on cgroups, ~100ms on jemalloc), in addition to the existing RSS/allocated bookkeeping, the worker reads MemAvailable from /proc/meminfo (or, when running in a cgroup with a finite memory.max, derives the per-cgroup equivalent) and computes:

new_hard_limit = (max(0, resident) + available) * max_server_memory_usage_to_ram_ratio

It then calls total_memory_tracker.setHardLimit(new_hard_limit).

  • On a dedicated host with no other significant memory consumers, this converges to roughly the value the existing startup logic would produce.
  • When other processes grow, MemAvailable shrinks (or the cgroup's free budget shrinks) and the dynamic limit follows, capping our growth before the host runs out of memory.
  • max_server_memory_usage_to_ram_ratio = 0 (no startup cap by ratio) also disables the dynamic adjustment.
  • The new server setting memory_worker_dynamic_hard_limit (enabled by default) gates only the runtime adjustment: when disabled, max_server_memory_usage_to_ram_ratio caps the hard memory limit statically at startup and on configuration reload, exactly as in previous versions.
  • Keeper does not expose max_server_memory_usage_to_ram_ratio, so the adjustment is off there.
  • On non-Linux systems /proc/meminfo is unavailable, so the adjustment is a no-op.

Files

  • src/Common/MemoryWorker.{h,cpp} — new MemoryWorkerConfig::dynamic_hard_limit_ratio; lazy /proc/meminfo reader for MemAvailable; cgroup-aware memory.max walk for cgroup v2; per-tick limit update with LOG_TRACE whenever the limit changes.
  • programs/server/Server.cpp, programs/local/LocalServer.cpp — feed max_server_memory_usage_to_ram_ratio into the new field, gated by memory_worker_dynamic_hard_limit.
  • src/Core/ServerSettings.cpp — new memory_worker_dynamic_hard_limit setting.

Note

High Risk
Changes how total_memory_tracker hard limits are applied and updated at runtime based on /proc/meminfo/cgroup headroom, which can affect query admission and memory-related stability under load; also introduces new concurrency/atomic logic around config reload vs worker ticks.

Overview
Adds a dynamic total-memory hard limit driven by MemoryWorker, recomputing the server ceiling each tick as (resident + available) * max_server_memory_usage_to_ram_ratio and applying it to total_memory_tracker.

Updates Server.cpp and LocalServer.cpp to pass the ratio into MemoryWorker and to apply max_server_memory_usage via the new MemoryWorker::setDynamicHardLimitSettings (removing direct setHardLimit calls) to avoid races with the worker.

Extends MemoryWorker with Linux-aware available-memory reading (including cgroup v2 ancestor memory.max min selection, v1 sentinel handling, and fail-close behavior) plus synchronization (settings_generation + mutex) so config reloads and worker adjustments cannot overwrite each other.

Reviewed by Cursor Bugbot for commit 9a207ef. Bugbot is set up for automated code reviews on this repo. Configure here.

Version info

  • Merged into: 26.6.1.728

`MemoryWorker` now periodically reads `MemFree + Cached` from
`/proc/meminfo` and updates the global hard memory limit to
`(tracked + free + cached) * max_server_memory_usage_to_ram_ratio`.
This shrinks ClickHouse's effective limit when other processes on
the host consume memory, reducing the risk of OOM-killing.

The ratio is the existing `max_server_memory_usage_to_ram_ratio`
server setting (default 0.9 in Server and Local); setting it to 0
disables the dynamic adjustment.

Keeper does not expose the ratio, so dynamic adjustment is off there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label May 14, 2026
Comment thread src/Common/MemoryWorker.cpp
Comment thread src/Common/MemoryWorker.cpp
@alexey-milovidov alexey-milovidov added the memory When memory usage is higher than expected label May 14, 2026
alexey-milovidov and others added 3 commits May 14, 2026 17:10
The dynamic hard-limit adjustment added in the previous commit read
`MemFree + Cached` from `/proc/meminfo`. That source describes the
**host**, not the cgroup, so when ClickHouse runs in a small cgroup
on a big host (8 GiB cgroup on a 192 GiB host) the formula
`(tracked + free + cached) * ratio` inflates the global hard limit
to ~145 GiB on every tick, overriding the cgroup-derived 7.2 GiB
that `Server.cpp` set at startup. `MemoryTracker` then never throws
`MEMORY_LIMIT_EXCEEDED` and the kernel OOM-killer fires at 8 GiB
RSS, killing the server.

The same `setHardLimit` call also overrode an explicit
`max_server_memory_usage` from the config, since the dynamic
adjustment ran every 50 ms regardless of the value `Server.cpp`
had just installed.

Two changes:

* When `MemoryWorker` has a cgroup reader, the dynamic adjustment
  reads the cgroup's `memory.max` paired with the cgroup-aware
  usage from `cgroups_reader->readMemoryUsage` -- the same sources
  `AsynchronousMetrics` reports as `CGroupMemoryTotal` and
  `CGroupMemoryUsed`. The available-memory term becomes
  `memory.max - cgroup_used`. The `/proc/meminfo` path is kept as
  a fallback for the no-cgroup case.

* `Server.cpp` and `LocalServer.cpp` now hand the configured
  `max_server_memory_usage` to `MemoryWorker` via the new
  `setExternalHardLimit` setter. The dynamic adjustment caps its
  computed value at this ceiling, so it can only shrink the budget
  below the user's setting, never raise it above. The setter is
  re-called from the config-reload callback whenever the setting
  changes. Before it is called for the first time, the dynamic
  adjustment is suppressed, so the worker cannot inflate the limit
  during the brief window between `MemoryWorker::start` and the
  first reload.

Reproduced and verified with an 8 GiB systemd-run scope on a 192
GiB host: before the fix, five memory-heavy queries all triggered
the cgroup OOM-killer; after the fix, all five are caught cleanly
by `MemoryTracker` and the server stays alive across the battery.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous commits used `total_memory_tracker.get()` as the "memory we
already own" term of the `(used + available) * ratio` formula. That value
counts only allocations that went through `Allocator` -- jemalloc-internal
fragmentation, mmap'd pages, page cache, and any untracked allocation are
excluded. Under load, the tracker can be orders of magnitude smaller than
the actual RSS, which makes the formula compute a new hard limit barely
above `tracked` while RSS is far higher -- pinning the limit at (or below)
current RSS and rejecting every subsequent allocation.

Reproduced on `Stateless tests (amd_asan_ubsan, distributed plan, parallel,
2/2)` (https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104964&sha=ced8215d96e5342ac0619ba94e7582618c0b60ca&name_0=PR&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20distributed%20plan%2C%20parallel%2C%202%2F2%29):
168 tests failed with `Connection reset by peer` after the lead query
got `(total) memory limit exceeded: would use 780.35 MiB (attempt to
allocate chunk of 1.01 MiB), current RSS: 28.67 GiB, maximum: 28.67 GiB`.
The hard limit had been driven to 29 GiB by `(tracked: 780 MiB +
available: 31 GiB) * 0.9 = 28.6 GiB` while actual RSS was 28.67 GiB. PR:
#104964

Three changes:

* Use `resident` (the jemalloc/cgroup RSS already computed at the top of
  the worker tick) as the baseline. `(resident + available) * ratio`
  always exceeds `resident` when `available > resident * (1/ratio - 1)`
  and tracks actual memory pressure faithfully.

* Refuse to apply a new hard limit that lies at or below `resident`,
  even after the ceiling clamp. Shrinking under RSS cannot succeed (the
  server has no way to release memory instantly) and would only break
  in-flight queries -- the purpose of the adjustment is to leave room
  for *other* processes, not to throttle our own work.

* Roll the previous `setExternalHardLimit(ceiling)` setter into
  `setDynamicHardLimitSettings(ceiling, ratio)`, so a config-reload
  change to `max_server_memory_usage_to_ram_ratio` takes effect on the
  next worker tick. The previous design captured the ratio once in the
  constructor; subsequent reloads updated `total_memory_tracker`'s hard
  limit but left the worker's formula running with the stale ratio.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…flags

The `meminfo_warnings_printed` and `cgroup_memory_max_warnings_printed`
fields in `MemoryWorker` are only read or assigned from
`#if defined(OS_LINUX)` blocks, so on Darwin and FreeBSD they triggered
`-Werror,-Wunused-private-field`. Mark them `[[maybe_unused]]` so the
non-Linux builds compile cleanly without changing the Linux code paths.

Build failure reports:
- https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_amd_darwin/build_clickhouse/build_clickhouse.log
- https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_arm_darwin/build_clickhouse/build_clickhouse.log
- https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_amd_freebsd/build_clickhouse/build_clickhouse.log

PR: #104964

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/MemoryWorker.cpp
alexey-milovidov and others added 2 commits May 15, 2026 15:34
…er` dynamic limit

Two correctness issues flagged by the AI review on
#104964:

1. In cgroup v2, when the leaf cgroup has `memory.max = max` but an
   ancestor cgroup has a finite limit, the previous code parsed only the
   leaf's `memory.max`, fell back to `/proc/meminfo`, and computed
   `available` from host-scoped memory. That could inflate the dynamic
   hard limit above the effective cgroup budget. Fix: walk all ancestors
   up to the cgroup mount root, open every `memory.max`, and on each
   tick take the minimum finite value. Only fall back to `/proc/meminfo`
   when no ancestor has a finite limit (i.e. the cgroup truly imposes
   no memory cap).

2. Reload race: a worker tick that read a positive `ratio` could compute
   `new_hard_limit` and write it via `setHardLimit` after a concurrent
   reload set the ratio to `0` (and reapplied its own hard limit from
   `total_memory_tracker.setHardLimit(max_server_memory_usage)`),
   leaving a stale positive limit. Fix: add a `settings_generation`
   counter bumped by `setDynamicHardLimitSettings` after writing the
   new ratio/ceiling. The worker captures the generation when it starts
   a dynamic update tick and re-checks just before `setHardLimit`,
   skipping the write if a reload happened in flight.

PR: #104964

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@azat azat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks questionable to me, do we really want to take care of this?

Maybe it is better to configure maximum allowed memory usage for server properly?
Isn't it the job of cgroups? (i.e. run clickhouse-server in a separate cgroup)
Or instead of configure hard limit we need to shrink memory that we can on memory pressure, i.e. jemalloc dirty pages, marks cache, e.t.c.?

Comment thread src/Common/MemoryWorker.cpp Outdated
Comment thread src/Common/MemoryWorker.cpp Outdated
Comment thread src/Common/MemoryWorker.cpp Outdated
@azat azat self-assigned this May 15, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member Author

This looks questionable to me, do we really want to take care of this?

Many ClickHouse users run it on machines with other stuff, like their web servers, PHP, Perl, etc.

Three feedback items on `MemoryWorker`'s dynamic limit:

  * Use `MemAvailable` from `/proc/meminfo` instead of `MemFree + Cached`.
    `MemAvailable` is the kernel's own estimate of memory available for
    new allocations; it already accounts for the *reclaimable* portion of
    the page cache and slab, so we don't have to add `Cached` ourselves
    and we don't claim memory the kernel considers pinned (dirty pages,
    low watermarks, mlocked).

  * Handle the cgroup-v2 `memory.max` value `"max"` explicitly. The
    previous code relied on `tryReadIntText` returning false for the
    non-numeric string; this is the same effect but doesn't rely on
    parse-failure semantics and is easier to read.

  * Drop the redundant `> 0` check on a `uint64_t` value.

The dynamic-limit formula is now
`(resident + MemAvailable) * max_server_memory_usage_to_ram_ratio`
in the no-cgroup-limit fallback path, instead of
`(resident + MemFree + Cached) * ratio`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/MemoryWorker.cpp Outdated
Comment thread src/Common/MemoryWorker.cpp Outdated
@alexey-milovidov

Copy link
Copy Markdown
Member Author

A huge amount of memory can be eaten by the OS for sockets.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

This was fixed by #105146. Let's update the branch.

alexey-milovidov and others added 2 commits May 17, 2026 20:50
…p instead of skip

Three follow-up fixes for `MemoryWorker`'s dynamic hard-limit logic, based on
re-review comments from `clickhouse-gh`.

Reload race in `setHardLimit` (TOCTOU between generation check and apply).
The previous design captured `settings_generation` at the start of the worker
tick and re-checked it just before `setHardLimit`. But the actual apply was
not under any lock, and `Server.cpp` performed its own `setHardLimit` outside
the worker's view, so a worker tick that observed the old generation could
still call `setHardLimit` *after* a reload had installed its own value,
overwriting it with a stale number computed against the old ratio. With
`max_server_memory_usage_to_ram_ratio = 0`, subsequent ticks skip the
adjustment and the stale limit could persist indefinitely.

Fixed by:
  * Adding `dynamic_hard_limit_apply_mutex` that both `setDynamicHardLimitSettings`
    and the worker's apply step take. The mutex serializes the two writers, so
    the worker's "re-check generation then `setHardLimit`" is now atomic with
    respect to a reload's "store new settings then `setHardLimit`".
  * Moving the `total_memory_tracker.setHardLimit(max_server_memory_usage)`
    call from `Server.cpp` / `LocalServer.cpp` into `setDynamicHardLimitSettings`
    so the hard limit is installed under the same mutex.

Cgroup v1 "no limit" sentinel.
`memory.limit_in_bytes` uses a huge numeric sentinel (`PAGE_COUNTER_MAX`,
~2^63) to mean "no limit". The previous code treated any numeric `> 0` as
finite, so the sentinel made `any_finite == true` and the dynamic limit
stayed pinned to the startup ceiling on v1-unlimited hosts instead of
falling back to `/proc/meminfo`. Captured host RAM at construction (raw
`sysconf(_SC_PHYS_PAGES) * _SC_PAGESIZE`, not the cgroup-aware
`getMemoryAmount`) and skip any limit `>= host_memory_bytes`, the same way
the v2 `"max"` token is skipped.

Skip-when-pressure path actually leaves us with the old, larger limit.
Under high memory pressure the formula can produce `new_hard_limit <= used`.
The previous code skipped the tick to avoid rejecting in-flight allocations
— but that left the previous (often much larger) limit in place, breaking
the contract of shrinking the budget as available memory drops. Replaced
the skip with a clamp: `new_hard_limit = max(formula, used + 64 MiB)`,
still capped at `ceiling`, so the shrink takes effect with a small headroom
for in-flight queries instead of being thrown away.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/MemoryWorker.cpp Outdated
Address the unresolved review thread on `src/Common/MemoryWorker.cpp:727`:
returning `0` from `readAvailableForDynamicLimit` was overloaded — it meant
both "we could not read the metric" (legitimate to skip this tick) and
"there is genuinely no memory available" (the highest-pressure case, where
the dynamic limit must still shrink). The previous `if (available)` guard
collapsed those into a single "skip" branch, so under maximum pressure the
worker would leave the previous (often larger) hard limit in place — exactly
when ClickHouse should be tightening its budget the most.

Switch `readAvailableForDynamicLimit` and `readSystemAvailableMemory` to
`std::optional<uint64_t>` so `std::nullopt` signals "no source could be
read" and a returned `0` signals "no memory available right now". The
caller now treats `0` as valid input: the existing clamp-to-`used +
safety_margin` path will shrink the hard limit at the high-pressure boundary
instead of skipping the adjustment.
Comment thread src/Common/MemoryWorker.cpp
Addresses the unresolved bot thread on `MemoryWorker.cpp:556`. When
running in cgroup mode (`cgroups_reader && !cgroup_memory_max_bufs.empty()`)
and every `memory.max` read throws, the old code fell through to
host-wide `/proc/meminfo`. On a containerized deployment with a finite
effective cgroup limit, that is fail-open: the host's `MemAvailable` can
be far above the cgroup budget, and using it as headroom can let
`total_memory_tracker` exceed the cgroup limit and trigger a cgroup
OOM kill.

Distinguish "cgroup has no finite limit" (safe to use host meminfo)
from "failed to read cgroup limits" (skip this tick): track a
`any_read_failure` flag in the read loop; when no limit was parsed
finite and at least one read threw, return `std::nullopt` so the
worker leaves the hard limit alone and retries on the next tick.

The "all reads succeeded but every value was `max`/sentinel" case
continues to fall through to host `MemAvailable` as before, since
that is the genuinely-unbounded cgroup case.
…'t raise shrunk limit on reload, add unit tests

Three fixes for the dynamic `total_memory_tracker` hard-limit adjustment in
`MemoryWorker`, addressing review feedback:

- Constructor-time cgroup open failures no longer drop a level permanently.
  `CgroupMemoryLevel` now retains the `memory.max`/`memory.current` paths, and
  `readAvailableForDynamicLimit` lazily (re)opens any level whose buffers are
  null and fails the whole tick closed on a persistent open/read failure. This
  prevents computing the headroom minimum from an incomplete ancestor set and
  overestimating the budget when the dropped ancestor was the tighter one.

- `setDynamicHardLimitSettings` no longer raises an already-shrunk dynamic limit
  back to the static ceiling on an unrelated config reload. When dynamic
  adjustment is enabled and the worker has previously shrunk the limit under
  memory pressure, the reload only lowers the limit (never raises it); the next
  worker tick recomputes from live headroom. `0` is treated as "unlimited" on
  both sides of the comparison.

- Factored the per-level decision into the pure helper
  `MemoryWorkerHelpers::decideCgroupLevelAvailability` and added focused unit
  tests covering finite `memory.max` with same-level usage, the `"max"` token,
  the cgroup v1 "no limit" sentinel, at/over-limit (zero headroom), and
  zero/unparseable values.
Comment thread src/Common/MemoryWorker.cpp Outdated
Comment thread src/Common/MemoryWorker.cpp
alexey-milovidov and others added 4 commits June 5, 2026 02:18
`clang-tidy` (`cppcoreguidelines-pro-type-member-init`) flagged
`MemoryWorkerHelpers::CgroupLevelAvailability::kind` as uninitialized by
the implicit constructor. Give it a default member initializer of
`CgroupLevelKind::Unbounded`, matching the sibling `available = 0`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
If a path operation in the cgroup v2 ancestor walk threw after
`cgroups_reader` and `source` were assigned and some (but not all) levels
were pushed, the outer `catch` only logged `Cannot use cgroups reader` and
left the partial state in place. `readAvailableForDynamicLimit` gates the
cgroup branch on `cgroups_reader && !cgroup_memory_levels.empty()`, so it
would then compute the headroom minimum from an incomplete ancestor set and
overestimate the real budget when the dropped ancestor was the tighter one.

Reset `cgroups_reader` / `cgroup_memory_levels` and clear `source` in the
`catch` so the worker fails closed and falls back to the jemalloc source and
host-wide `/proc/meminfo`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`setDynamicHardLimitSettings` kept `min(current, ceiling)` on every reload
after the first whenever `ratio > 0`, on the assumption that the worker will
raise the limit again on a later tick. That assumption is false when the
dynamic adjustment is a no-op: on non-Linux `readAvailableForDynamicLimit`
always returns `std::nullopt`, and when `source == None` (e.g. no jemalloc,
no cgroups) `start` runs no worker tick at all. In those cases raising
`max_server_memory_usage` (e.g. from 8 GiB to 16 GiB) on reload would leave
`total_memory_tracker` pinned at the old, smaller value indefinitely —
whereas before this PR the direct `setHardLimit` applied the new config
immediately.

Only preserve a previously-shrunk value when the worker can actually
recompute it: a live `source` on Linux. Otherwise apply the new `ceiling`
directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continued work on this PR:

  • Merged master (was 573 commits behind, with red CI). No conflicts. The merge also picks up b533bbfa529 ("Fix tidy build in master"), which clears the unrelated BinaryRowInputFormat.cpp cppcoreguidelines-init-variables error that was also failing the arm_tidy build.
  • Fixed the arm_tidy build failure (56a6685ee1a): MemoryWorkerHelpers::CgroupLevelAvailability::kind was flagged by cppcoreguidelines-pro-type-member-init; gave it a default member initializer.
  • Addressed the two open Bugbot threads:
    • fdaf1b58913 — fail closed on a partial cgroup hierarchy setup: the outer catch now resets cgroups_reader / cgroup_memory_levels and clears source, so an incomplete ancestor set can no longer be used to overestimate headroom.
    • 66a135bfa5b — only preserve a previously-shrunk hard limit on reload when the worker can actually recompute it (live source on Linux); otherwise apply the new ceiling directly, so a reload that raises max_server_memory_usage takes effect immediately on non-Linux / source == None configurations.

Verified locally: MemoryWorker.cpp, Server.cpp, LocalServer.cpp, and gtest_cgroups_reader.cpp compile cleanly, and the *Cgroup*/*MemoryWorker* gtests (6 tests, incl. DecideCgroupLevelAvailability) all pass.

The design question raised in review (whether the server should manage this vs. relying on cgroups / memory-pressure reclaim) is left for you to decide.

Comment thread src/Common/MemoryWorker.cpp
Comment thread src/Common/MemoryWorker.cpp
alexey-milovidov and others added 2 commits June 7, 2026 04:44
The cgroup branch of `MemoryWorker::readAvailableForDynamicLimit` computed
per-level headroom as `memory.max - memory.current`. `memory.current` includes
reclaimable page cache (`active_file` + `inactive_file`) and reclaimable slab
(`slab_reclaimable`), which the kernel frees under pressure before invoking the
OOM killer. Counting them as usage shrinks the headroom on a read-heavy server
with a warm page cache (`memory.current` close to `memory.max`) and can make
ClickHouse throw `MEMORY_LIMIT_EXCEEDED` even though most of that memory is
reclaimable.

Read each cgroup v2 level's `memory.stat` and subtract its reclaimable bytes
from `memory.current`, so the cgroup headroom mirrors the host-wide
`MemAvailable` path. `resident`, the formula's baseline, already excludes
reclaimable memory, so on a dedicated cgroup `resident + available` converges to
`memory.max`. The cgroup v1 path is unaffected: leaf usage from `cgroups_reader`
already excludes reclaimable memory.

Add a unit test for the new `reclaimableFromCgroupV2Stat` helper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Merged latest master (was 482 commits behind) and pushed 46986e81306.

Addressed the two open clickhouse-gh review threads:

  1. Cgroup headroom now mirrors MemAvailable — the cgroup v2 path read memory.max - memory.current, but memory.current includes reclaimable page cache and slab. On a read-heavy server with a warm cache (memory.current near memory.max) that shrank the headroom and could throw MEMORY_LIMIT_EXCEEDED for memory the kernel can reclaim. The worker now reads each level's memory.stat and subtracts its reclaimable bytes (active_file + inactive_file + slab_reclaimable) from memory.current. With resident (the baseline) already excluding reclaimable memory, resident + available converges to memory.max on a dedicated cgroup. Unit test added.

  2. Resident-unit limit vs. tracker units — not an issue: the limit is installed on total_memory_tracker (level Global), where the hard limit is enforced against RSS as well as the tracked amount (will_be_rss > current_hard_limit), and rss is kept in resident units. The RSS check binds in the same units the limit is computed in, so there is no over-admission. Explained in-thread.

CI: the only failures are perf-comparison "slower" queries, 7 of 12 clustered on the arm/1 shard (rand, formats_columns_nullable, joins, etc.) — the classic single-noisy-runner signature, with no memory-subsystem relationship to this change (the MemoryWorker tick runs on a background thread and setHardLimit is an atomic store). The fresh run from this push should clear them.

Comment thread src/Common/MemoryWorker.cpp
…e limit adjustment

Address review: a positive max_server_memory_usage_to_ram_ratio used to
opt the server into runtime hard-limit shrinking with no way to keep the
previous static cap, while setting the ratio to 0 removed the cap
entirely. The new memory_worker_dynamic_hard_limit server setting
(enabled by default) gates only the runtime adjustment: when disabled,
max_server_memory_usage_to_ram_ratio caps the hard memory limit
statically at startup and on configuration reload, exactly as in
previous versions.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 4adb205478b: added the memory_worker_dynamic_hard_limit server setting (enabled by default) to address the remaining review thread — disabling it keeps only the static startup/reload cap from max_server_memory_usage_to_ram_ratio, as in previous versions. Verified the build and that the setting appears in system.server_settings.

The previous CH Inc sync failure was two unrelated flakes in the private CI (a session timeout at 99% with zero failed tests, and a SharedMergeTree backup restore-visibility race that passed on the in-job retry); no memory-related errors anywhere, and test_memory_limit_observer passed. The new push re-runs the sync.

@clickhouse-gh

clickhouse-gh Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.60% 84.60% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.20% 77.20% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 256/311 (82.32%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 23 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds very much okay.

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 12, 2026
Merged via the queue into master with commit d049b50 Jun 12, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the memory-worker-dynamic-limit branch June 12, 2026 20:07
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 12, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a constant log spam:

2026.06.15 17:17:57.967192 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.97 GiB to 89.97 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.017390 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.97 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.067590 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.117877 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.168146 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.268706 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.419694 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.469854 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.721332 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.771520 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:58.972527 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.022732 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.072957 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.123229 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.374414 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.675770 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.776425 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.826696 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.927067 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:17:59.977327 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.027615 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.077792 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.128096 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.278766 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.580050 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)
2026.06.15 17:18:00.780905 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)

pull Bot pushed a commit to SINHASantos/ClickHouse that referenced this pull request Jun 18, 2026
The background `MemoryWorker` recomputes the total memory hard limit on
every tick (every ~50ms) from `resident` and `MemAvailable`, both of
which jitter by a few MiB on each read. The previous guard re-applied and
logged the new limit whenever it differed from the current one by even a
single byte, so on an otherwise idle server the trace log was flooded:

    <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9)

repeated dozens of times per second, and `setHardLimit` was called every
tick to no practical effect.

Adjust only when the change exceeds 1% of the current limit. A genuine
memory-pressure shift moves the limit by far more than this, so the
dynamic adjustment still reacts promptly when it actually needs to.

Discussion: ClickHouse#104964 (comment)
Related: ClickHouse#104964

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

memory When memory usage is higher than expected pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants