iframe-proxy

alexey-milovidov · 2026-05-14T16:23:10Z

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

The background MemoryWorker now periodically updates the server's hard memory limit based on the current memory usage and the amount of memory the kernel reports as available, so ClickHouse leaves room for other processes running on the same host. The formula is (resident memory + system MemAvailable) * max_server_memory_usage_to_ram_ratio. The same max_server_memory_usage_to_ram_ratio server setting controls both the startup cap and the dynamic adjustment; set it to 0 to disable both. To keep only the static startup/reload cap (the behavior of previous versions), set the new server setting memory_worker_dynamic_hard_limit to 0.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Motivation

The server's hard memory limit is currently computed once at startup (and on config reload) as total_RAM * max_server_memory_usage_to_ram_ratio. On hosts shared with other processes (sidecars, system services, other workloads), this can be too permissive: as the other processes grow, ClickHouse keeps using its full configured share and the host gets OOM-killed.

This change makes MemoryWorker continuously recompute the limit from a more honest baseline — what we already own plus what the kernel reports as free or reclaimable — and apply the same ratio to leave headroom for the rest of the system.

Behavior

On each tick of MemoryWorker (default ~50ms on cgroups, ~100ms on jemalloc), in addition to the existing RSS/allocated bookkeeping, the worker reads MemAvailable from /proc/meminfo (or, when running in a cgroup with a finite memory.max, derives the per-cgroup equivalent) and computes:

new_hard_limit = (max(0, resident) + available) * max_server_memory_usage_to_ram_ratio

It then calls total_memory_tracker.setHardLimit(new_hard_limit).

On a dedicated host with no other significant memory consumers, this converges to roughly the value the existing startup logic would produce.
When other processes grow, MemAvailable shrinks (or the cgroup's free budget shrinks) and the dynamic limit follows, capping our growth before the host runs out of memory.
max_server_memory_usage_to_ram_ratio = 0 (no startup cap by ratio) also disables the dynamic adjustment.
The new server setting memory_worker_dynamic_hard_limit (enabled by default) gates only the runtime adjustment: when disabled, max_server_memory_usage_to_ram_ratio caps the hard memory limit statically at startup and on configuration reload, exactly as in previous versions.
Keeper does not expose max_server_memory_usage_to_ram_ratio, so the adjustment is off there.
On non-Linux systems /proc/meminfo is unavailable, so the adjustment is a no-op.

Files

src/Common/MemoryWorker.{h,cpp} — new MemoryWorkerConfig::dynamic_hard_limit_ratio; lazy /proc/meminfo reader for MemAvailable; cgroup-aware memory.max walk for cgroup v2; per-tick limit update with LOG_TRACE whenever the limit changes.
programs/server/Server.cpp, programs/local/LocalServer.cpp — feed max_server_memory_usage_to_ram_ratio into the new field, gated by memory_worker_dynamic_hard_limit.
src/Core/ServerSettings.cpp — new memory_worker_dynamic_hard_limit setting.

Note

High Risk
Changes how total_memory_tracker hard limits are applied and updated at runtime based on /proc/meminfo/cgroup headroom, which can affect query admission and memory-related stability under load; also introduces new concurrency/atomic logic around config reload vs worker ticks.

Overview
Adds a dynamic total-memory hard limit driven by MemoryWorker, recomputing the server ceiling each tick as (resident + available) * max_server_memory_usage_to_ram_ratio and applying it to total_memory_tracker.

Updates Server.cpp and LocalServer.cpp to pass the ratio into MemoryWorker and to apply max_server_memory_usage via the new MemoryWorker::setDynamicHardLimitSettings (removing direct setHardLimit calls) to avoid races with the worker.

Extends MemoryWorker with Linux-aware available-memory reading (including cgroup v2 ancestor memory.max min selection, v1 sentinel handling, and fail-close behavior) plus synchronization (settings_generation + mutex) so config reloads and worker adjustments cannot overwrite each other.

^{Reviewed by Cursor Bugbot for commit 9a207ef. Bugbot is set up for automated code reviews on this repo. Configure here.}

Version info

Merged into: 26.6.1.728

`MemoryWorker` now periodically reads `MemFree + Cached` from `/proc/meminfo` and updates the global hard memory limit to `(tracked + free + cached) * max_server_memory_usage_to_ram_ratio`. This shrinks ClickHouse's effective limit when other processes on the host consume memory, reducing the risk of OOM-killing. The ratio is the existing `max_server_memory_usage_to_ram_ratio` server setting (default 0.9 in Server and Local); setting it to 0 disables the dynamic adjustment. Keeper does not expose the ratio, so dynamic adjustment is off there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

clickhouse-gh · 2026-05-14T16:23:47Z

Workflow [PR], commit [4adb205]

Summary: ✅

Performance Comparison: Performance dashboard

AI Review

Summary

This PR makes MemoryWorker dynamically adjust the global hard memory limit from resident memory plus host/cgroup available memory, and adds memory_worker_dynamic_hard_limit to preserve the previous static-cap behavior. Most earlier cgroup-accounting and reload-race issues are addressed in the current code, but the user-facing config docs are still inconsistent and one important runtime boundary remains under-tested.

Findings

⚠️ Majors

[programs/server/config.xml:470 / programs/server/config.yaml.example:232] The XML docs and ServerSettings text now explain the runtime adjustment and memory_worker_dynamic_hard_limit, but programs/server/config.yaml.example still has the old static-cap comment and is linked from docs/en/operations/configuration-files.md as the default YAML config. Users who configure ClickHouse with YAML will not see that max_server_memory_usage_to_ram_ratio now changes the hard limit at runtime or how to disable that behavior while keeping the static cap. Mirror the XML/setting text in the YAML example.

Tests

⚠️ [dismissed by author -- https://github.com/Dynamically adjust memory limit based on system free/cached memory #104964#discussion_r3321701055] The added unit tests cover the pure cgroup decision helpers, but there is still no focused automated check for the runtime apply boundary that protects the compatibility behavior: memory_worker_dynamic_hard_limit = 0 preserving the static cap, reload to ratio 0 clearing a previously dynamic limit, and read/open failures skipping adjustment instead of falling back to host MemAvailable. A small unit seam around setDynamicHardLimitSettings/availability reads or a targeted integration test would close this.

Final Verdict

Status: ⚠️ Request changes

Minimum required actions: update programs/server/config.yaml.example for the new runtime semantics and escape hatch; add focused coverage for the static opt-out/reload/fail-close boundary or provide a stronger reason this cannot be tested.

The dynamic hard-limit adjustment added in the previous commit read `MemFree + Cached` from `/proc/meminfo`. That source describes the **host**, not the cgroup, so when ClickHouse runs in a small cgroup on a big host (8 GiB cgroup on a 192 GiB host) the formula `(tracked + free + cached) * ratio` inflates the global hard limit to ~145 GiB on every tick, overriding the cgroup-derived 7.2 GiB that `Server.cpp` set at startup. `MemoryTracker` then never throws `MEMORY_LIMIT_EXCEEDED` and the kernel OOM-killer fires at 8 GiB RSS, killing the server. The same `setHardLimit` call also overrode an explicit `max_server_memory_usage` from the config, since the dynamic adjustment ran every 50 ms regardless of the value `Server.cpp` had just installed. Two changes: * When `MemoryWorker` has a cgroup reader, the dynamic adjustment reads the cgroup's `memory.max` paired with the cgroup-aware usage from `cgroups_reader->readMemoryUsage` -- the same sources `AsynchronousMetrics` reports as `CGroupMemoryTotal` and `CGroupMemoryUsed`. The available-memory term becomes `memory.max - cgroup_used`. The `/proc/meminfo` path is kept as a fallback for the no-cgroup case. * `Server.cpp` and `LocalServer.cpp` now hand the configured `max_server_memory_usage` to `MemoryWorker` via the new `setExternalHardLimit` setter. The dynamic adjustment caps its computed value at this ceiling, so it can only shrink the budget below the user's setting, never raise it above. The setter is re-called from the config-reload callback whenever the setting changes. Before it is called for the first time, the dynamic adjustment is suppressed, so the worker cannot inflate the limit during the brief window between `MemoryWorker::start` and the first reload. Reproduced and verified with an 8 GiB systemd-run scope on a 192 GiB host: before the fix, five memory-heavy queries all triggered the cgroup OOM-killer; after the fix, all five are caught cleanly by `MemoryTracker` and the server stays alive across the battery. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous commits used `total_memory_tracker.get()` as the "memory we already own" term of the `(used + available) * ratio` formula. That value counts only allocations that went through `Allocator` -- jemalloc-internal fragmentation, mmap'd pages, page cache, and any untracked allocation are excluded. Under load, the tracker can be orders of magnitude smaller than the actual RSS, which makes the formula compute a new hard limit barely above `tracked` while RSS is far higher -- pinning the limit at (or below) current RSS and rejecting every subsequent allocation. Reproduced on `Stateless tests (amd_asan_ubsan, distributed plan, parallel, 2/2)` (https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104964&sha=ced8215d96e5342ac0619ba94e7582618c0b60ca&name_0=PR&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20distributed%20plan%2C%20parallel%2C%202%2F2%29): 168 tests failed with `Connection reset by peer` after the lead query got `(total) memory limit exceeded: would use 780.35 MiB (attempt to allocate chunk of 1.01 MiB), current RSS: 28.67 GiB, maximum: 28.67 GiB`. The hard limit had been driven to 29 GiB by `(tracked: 780 MiB + available: 31 GiB) * 0.9 = 28.6 GiB` while actual RSS was 28.67 GiB. PR: #104964 Three changes: * Use `resident` (the jemalloc/cgroup RSS already computed at the top of the worker tick) as the baseline. `(resident + available) * ratio` always exceeds `resident` when `available > resident * (1/ratio - 1)` and tracks actual memory pressure faithfully. * Refuse to apply a new hard limit that lies at or below `resident`, even after the ceiling clamp. Shrinking under RSS cannot succeed (the server has no way to release memory instantly) and would only break in-flight queries -- the purpose of the adjustment is to leave room for *other* processes, not to throttle our own work. * Roll the previous `setExternalHardLimit(ceiling)` setter into `setDynamicHardLimitSettings(ceiling, ratio)`, so a config-reload change to `max_server_memory_usage_to_ram_ratio` takes effect on the next worker tick. The previous design captured the ratio once in the constructor; subsequent reloads updated `total_memory_tracker`'s hard limit but left the worker's formula running with the stale ratio. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…flags The `meminfo_warnings_printed` and `cgroup_memory_max_warnings_printed` fields in `MemoryWorker` are only read or assigned from `#if defined(OS_LINUX)` blocks, so on Darwin and FreeBSD they triggered `-Werror,-Wunused-private-field`. Mark them `[[maybe_unused]]` so the non-Linux builds compile cleanly without changing the Linux code paths. Build failure reports: - https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_amd_darwin/build_clickhouse/build_clickhouse.log - https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_arm_darwin/build_clickhouse/build_clickhouse.log - https://s3.amazonaws.com/clickhouse-test-reports/PRs/104964/f33774ad1cb4166380827bb0b1dabf606a9449e3/build_amd_freebsd/build_clickhouse/build_clickhouse.log PR: #104964 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…er` dynamic limit Two correctness issues flagged by the AI review on #104964: 1. In cgroup v2, when the leaf cgroup has `memory.max = max` but an ancestor cgroup has a finite limit, the previous code parsed only the leaf's `memory.max`, fell back to `/proc/meminfo`, and computed `available` from host-scoped memory. That could inflate the dynamic hard limit above the effective cgroup budget. Fix: walk all ancestors up to the cgroup mount root, open every `memory.max`, and on each tick take the minimum finite value. Only fall back to `/proc/meminfo` when no ancestor has a finite limit (i.e. the cgroup truly imposes no memory cap). 2. Reload race: a worker tick that read a positive `ratio` could compute `new_hard_limit` and write it via `setHardLimit` after a concurrent reload set the ratio to `0` (and reapplied its own hard limit from `total_memory_tracker.setHardLimit(max_server_memory_usage)`), leaving a stale positive limit. Fix: add a `settings_generation` counter bumped by `setDynamicHardLimitSettings` after writing the new ratio/ceiling. The worker captures the generation when it starts a dynamic update tick and re-checks just before `setHardLimit`, skipping the write if a reload happened in flight. PR: #104964 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ic-limit

azat

This looks questionable to me, do we really want to take care of this?

Maybe it is better to configure maximum allowed memory usage for server properly?
Isn't it the job of cgroups? (i.e. run clickhouse-server in a separate cgroup)
Or instead of configure hard limit we need to shrink memory that we can on memory pressure, i.e. jemalloc dirty pages, marks cache, e.t.c.?

alexey-milovidov · 2026-05-15T18:12:34Z

This looks questionable to me, do we really want to take care of this?

Many ClickHouse users run it on machines with other stuff, like their web servers, PHP, Perl, etc.

Three feedback items on `MemoryWorker`'s dynamic limit: * Use `MemAvailable` from `/proc/meminfo` instead of `MemFree + Cached`. `MemAvailable` is the kernel's own estimate of memory available for new allocations; it already accounts for the *reclaimable* portion of the page cache and slab, so we don't have to add `Cached` ourselves and we don't claim memory the kernel considers pinned (dirty pages, low watermarks, mlocked). * Handle the cgroup-v2 `memory.max` value `"max"` explicitly. The previous code relied on `tryReadIntText` returning false for the non-numeric string; this is the same effect but doesn't rely on parse-failure semantics and is easier to read. * Drop the redundant `> 0` check on a `uint64_t` value. The dynamic-limit formula is now `(resident + MemAvailable) * max_server_memory_usage_to_ram_ratio` in the no-cgroup-limit fallback path, instead of `(resident + MemFree + Cached) * ratio`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-05-17T01:42:45Z

A huge amount of memory can be eaten by the OS for sockets.

alexey-milovidov · 2026-05-17T18:43:02Z

This was fixed by #105146. Let's update the branch.

…p instead of skip Three follow-up fixes for `MemoryWorker`'s dynamic hard-limit logic, based on re-review comments from `clickhouse-gh`. Reload race in `setHardLimit` (TOCTOU between generation check and apply). The previous design captured `settings_generation` at the start of the worker tick and re-checked it just before `setHardLimit`. But the actual apply was not under any lock, and `Server.cpp` performed its own `setHardLimit` outside the worker's view, so a worker tick that observed the old generation could still call `setHardLimit` *after* a reload had installed its own value, overwriting it with a stale number computed against the old ratio. With `max_server_memory_usage_to_ram_ratio = 0`, subsequent ticks skip the adjustment and the stale limit could persist indefinitely. Fixed by: * Adding `dynamic_hard_limit_apply_mutex` that both `setDynamicHardLimitSettings` and the worker's apply step take. The mutex serializes the two writers, so the worker's "re-check generation then `setHardLimit`" is now atomic with respect to a reload's "store new settings then `setHardLimit`". * Moving the `total_memory_tracker.setHardLimit(max_server_memory_usage)` call from `Server.cpp` / `LocalServer.cpp` into `setDynamicHardLimitSettings` so the hard limit is installed under the same mutex. Cgroup v1 "no limit" sentinel. `memory.limit_in_bytes` uses a huge numeric sentinel (`PAGE_COUNTER_MAX`, ~2^63) to mean "no limit". The previous code treated any numeric `> 0` as finite, so the sentinel made `any_finite == true` and the dynamic limit stayed pinned to the startup ceiling on v1-unlimited hosts instead of falling back to `/proc/meminfo`. Captured host RAM at construction (raw `sysconf(_SC_PHYS_PAGES) * _SC_PAGESIZE`, not the cgroup-aware `getMemoryAmount`) and skip any limit `>= host_memory_bytes`, the same way the v2 `"max"` token is skipped. Skip-when-pressure path actually leaves us with the old, larger limit. Under high memory pressure the formula can produce `new_hard_limit <= used`. The previous code skipped the tick to avoid rejecting in-flight allocations — but that left the previous (often much larger) limit in place, breaking the contract of shrinking the budget as available memory drops. Replaced the skip with a clamp: `new_hard_limit = max(formula, used + 64 MiB)`, still capped at `ceiling`, so the shrink takes effect with a small headroom for in-flight queries instead of being thrown away. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Address the unresolved review thread on `src/Common/MemoryWorker.cpp:727`: returning `0` from `readAvailableForDynamicLimit` was overloaded — it meant both "we could not read the metric" (legitimate to skip this tick) and "there is genuinely no memory available" (the highest-pressure case, where the dynamic limit must still shrink). The previous `if (available)` guard collapsed those into a single "skip" branch, so under maximum pressure the worker would leave the previous (often larger) hard limit in place — exactly when ClickHouse should be tightening its budget the most. Switch `readAvailableForDynamicLimit` and `readSystemAvailableMemory` to `std::optional<uint64_t>` so `std::nullopt` signals "no source could be read" and a returned `0` signals "no memory available right now". The caller now treats `0` as valid input: the existing clamp-to-`used + safety_margin` path will shrink the hard limit at the high-pressure boundary instead of skipping the adjustment.

…ic-limit

Addresses the unresolved bot thread on `MemoryWorker.cpp:556`. When running in cgroup mode (`cgroups_reader && !cgroup_memory_max_bufs.empty()`) and every `memory.max` read throws, the old code fell through to host-wide `/proc/meminfo`. On a containerized deployment with a finite effective cgroup limit, that is fail-open: the host's `MemAvailable` can be far above the cgroup budget, and using it as headroom can let `total_memory_tracker` exceed the cgroup limit and trigger a cgroup OOM kill. Distinguish "cgroup has no finite limit" (safe to use host meminfo) from "failed to read cgroup limits" (skip this tick): track a `any_read_failure` flag in the read loop; when no limit was parsed finite and at least one read threw, return `std::nullopt` so the worker leaves the hard limit alone and retries on the next tick. The "all reads succeeded but every value was `max`/sentinel" case continues to fall through to host `MemAvailable` as before, since that is the genuinely-unbounded cgroup case.

…'t raise shrunk limit on reload, add unit tests Three fixes for the dynamic `total_memory_tracker` hard-limit adjustment in `MemoryWorker`, addressing review feedback: - Constructor-time cgroup open failures no longer drop a level permanently. `CgroupMemoryLevel` now retains the `memory.max`/`memory.current` paths, and `readAvailableForDynamicLimit` lazily (re)opens any level whose buffers are null and fails the whole tick closed on a persistent open/read failure. This prevents computing the headroom minimum from an incomplete ancestor set and overestimating the budget when the dropped ancestor was the tighter one. - `setDynamicHardLimitSettings` no longer raises an already-shrunk dynamic limit back to the static ceiling on an unrelated config reload. When dynamic adjustment is enabled and the worker has previously shrunk the limit under memory pressure, the reload only lowers the limit (never raises it); the next worker tick recomputes from live headroom. `0` is treated as "unlimited" on both sides of the comparison. - Factored the per-level decision into the pure helper `MemoryWorkerHelpers::decideCgroupLevelAvailability` and added focused unit tests covering finite `memory.max` with same-level usage, the `"max"` token, the cgroup v1 "no limit" sentinel, at/over-limit (zero headroom), and zero/unparseable values.

…ic-limit

`clang-tidy` (`cppcoreguidelines-pro-type-member-init`) flagged `MemoryWorkerHelpers::CgroupLevelAvailability::kind` as uninitialized by the implicit constructor. Give it a default member initializer of `CgroupLevelKind::Unbounded`, matching the sibling `available = 0`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

If a path operation in the cgroup v2 ancestor walk threw after `cgroups_reader` and `source` were assigned and some (but not all) levels were pushed, the outer `catch` only logged `Cannot use cgroups reader` and left the partial state in place. `readAvailableForDynamicLimit` gates the cgroup branch on `cgroups_reader && !cgroup_memory_levels.empty()`, so it would then compute the headroom minimum from an incomplete ancestor set and overestimate the real budget when the dropped ancestor was the tighter one. Reset `cgroups_reader` / `cgroup_memory_levels` and clear `source` in the `catch` so the worker fails closed and falls back to the jemalloc source and host-wide `/proc/meminfo`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

`setDynamicHardLimitSettings` kept `min(current, ceiling)` on every reload after the first whenever `ratio > 0`, on the assumption that the worker will raise the limit again on a later tick. That assumption is false when the dynamic adjustment is a no-op: on non-Linux `readAvailableForDynamicLimit` always returns `std::nullopt`, and when `source == None` (e.g. no jemalloc, no cgroups) `start` runs no worker tick at all. In those cases raising `max_server_memory_usage` (e.g. from 8 GiB to 16 GiB) on reload would leave `total_memory_tracker` pinned at the old, smaller value indefinitely — whereas before this PR the direct `setHardLimit` applied the new config immediately. Only preserve a previously-shrunk value when the worker can actually recompute it: a live `source` on Linux. Otherwise apply the new `ceiling` directly. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-05T02:23:31Z

Continued work on this PR:

Merged master (was 573 commits behind, with red CI). No conflicts. The merge also picks up b533bbfa529 ("Fix tidy build in master"), which clears the unrelated BinaryRowInputFormat.cpp cppcoreguidelines-init-variables error that was also failing the arm_tidy build.
Fixed the arm_tidy build failure (56a6685ee1a): MemoryWorkerHelpers::CgroupLevelAvailability::kind was flagged by cppcoreguidelines-pro-type-member-init; gave it a default member initializer.
Addressed the two open Bugbot threads:
- fdaf1b58913 — fail closed on a partial cgroup hierarchy setup: the outer catch now resets cgroups_reader / cgroup_memory_levels and clears source, so an incomplete ancestor set can no longer be used to overestimate headroom.
- 66a135bfa5b — only preserve a previously-shrunk hard limit on reload when the worker can actually recompute it (live source on Linux); otherwise apply the new ceiling directly, so a reload that raises max_server_memory_usage takes effect immediately on non-Linux / source == None configurations.

Verified locally: MemoryWorker.cpp, Server.cpp, LocalServer.cpp, and gtest_cgroups_reader.cpp compile cleanly, and the *Cgroup*/*MemoryWorker* gtests (6 tests, incl. DecideCgroupLevelAvailability) all pass.

The design question raised in review (whether the server should manage this vs. relying on cgroups / memory-pressure reclaim) is left for you to decide.

…ic-limit

The cgroup branch of `MemoryWorker::readAvailableForDynamicLimit` computed per-level headroom as `memory.max - memory.current`. `memory.current` includes reclaimable page cache (`active_file` + `inactive_file`) and reclaimable slab (`slab_reclaimable`), which the kernel frees under pressure before invoking the OOM killer. Counting them as usage shrinks the headroom on a read-heavy server with a warm page cache (`memory.current` close to `memory.max`) and can make ClickHouse throw `MEMORY_LIMIT_EXCEEDED` even though most of that memory is reclaimable. Read each cgroup v2 level's `memory.stat` and subtract its reclaimable bytes from `memory.current`, so the cgroup headroom mirrors the host-wide `MemAvailable` path. `resident`, the formula's baseline, already excludes reclaimable memory, so on a dedicated cgroup `resident + available` converges to `memory.max`. The cgroup v1 path is unaffected: leaf usage from `cgroups_reader` already excludes reclaimable memory. Add a unit test for the new `reclaimableFromCgroupV2Stat` helper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

alexey-milovidov · 2026-06-07T04:59:38Z

Merged latest master (was 482 commits behind) and pushed 46986e81306.

Addressed the two open clickhouse-gh review threads:

Cgroup headroom now mirrors MemAvailable — the cgroup v2 path read memory.max - memory.current, but memory.current includes reclaimable page cache and slab. On a read-heavy server with a warm cache (memory.current near memory.max) that shrank the headroom and could throw MEMORY_LIMIT_EXCEEDED for memory the kernel can reclaim. The worker now reads each level's memory.stat and subtracts its reclaimable bytes (active_file + inactive_file + slab_reclaimable) from memory.current. With resident (the baseline) already excluding reclaimable memory, resident + available converges to memory.max on a dedicated cgroup. Unit test added.
Resident-unit limit vs. tracker units — not an issue: the limit is installed on total_memory_tracker (level Global), where the hard limit is enforced against RSS as well as the tracked amount (will_be_rss > current_hard_limit), and rss is kept in resident units. The RSS check binds in the same units the limit is computed in, so there is no over-admission. Explained in-thread.

CI: the only failures are perf-comparison "slower" queries, 7 of 12 clustered on the arm/1 shard (rand, formats_columns_nullable, joins, etc.) — the classic single-noisy-runner signature, with no memory-subsystem relationship to this change (the MemoryWorker tick runs on a background thread and setHardLimit is an atomic store). The fresh run from this push should clear them.

…e limit adjustment Address review: a positive max_server_memory_usage_to_ram_ratio used to opt the server into runtime hard-limit shrinking with no way to keep the previous static cap, while setting the ratio to 0 removed the cap entirely. The new memory_worker_dynamic_hard_limit server setting (enabled by default) gates only the runtime adjustment: when disabled, max_server_memory_usage_to_ram_ratio caps the hard memory limit statically at startup and on configuration reload, exactly as in previous versions. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

alexey-milovidov · 2026-06-11T16:18:34Z

Pushed 4adb205478b: added the memory_worker_dynamic_hard_limit server setting (enabled by default) to address the remaining review thread — disabling it keeps only the static startup/reload cap from max_server_memory_usage_to_ram_ratio, as in previous versions. Verified the build and that the setting appears in system.server_settings.

The previous CH Inc sync failure was two unrelated flakes in the private CI (a session timeout at 99% with zero failed tests, and a SharedMergeTree backup restore-visibility race that passed on the in-job retry); no memory-related errors anywhere, and test_memory_limit_observer passed. The new push re-runs the sync.

clickhouse-gh · 2026-06-11T19:26:53Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.60%	84.60%	+0.00%
Functions	92.30%	92.30%	+0.00%
Branches	77.20%	77.20%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 256/311 (82.32%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 23 line(s) · Uncovered code

Full report · Diff report

alexey-milovidov

Sounds very much okay.

Algunenano · 2026-06-15T15:19:43Z

+                            std::lock_guard apply_lock(dynamic_hard_limit_apply_mutex);
+                            if (settings_generation.load(std::memory_order_acquire) == gen_before)
+                            {
+                                LOG_TRACE(


This is a constant log spam:

2026.06.15 17:17:57.967192 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.97 GiB to 89.97 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.017390 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.97 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.067590 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.117877 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.168146 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.91 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.268706 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.419694 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.469854 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.721332 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.771520 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:58.972527 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.022732 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.072957 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.123229 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.374414 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.675770 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.776425 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.826696 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.927067 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:17:59.977327 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:18:00.027615 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:18:00.077792 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:18:00.128096 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:18:00.278766 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:18:00.580050 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9) 2026.06.15 17:18:00.780905 [ 565461 ] {} <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.61 GiB, ceiling: 112.88 GiB, ratio: 0.9)

The background `MemoryWorker` recomputes the total memory hard limit on every tick (every ~50ms) from `resident` and `MemAvailable`, both of which jitter by a few MiB on each read. The previous guard re-applied and logged the new limit whenever it differed from the current one by even a single byte, so on an otherwise idle server the trace log was flooded: <Trace> MemoryWorker: Adjusting total memory hard limit from 89.98 GiB to 89.98 GiB (resident: 383.83 MiB, available: 99.60 GiB, ceiling: 112.88 GiB, ratio: 0.9) repeated dozens of times per second, and `setHardLimit` was called every tick to no practical effect. Adjust only when the change exceeds 1% of the current limit. A genuine memory-pressure shift moves the limit by far more than this, so the dynamic adjustment still reacts promptly when it actually needs to. Discussion: ClickHouse#104964 (comment) Related: ClickHouse#104964 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label May 14, 2026

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp

clickhouse-gh Bot reviewed May 14, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp

alexey-milovidov added the memory When memory usage is higher than expected label May 14, 2026

alexey-milovidov and others added 3 commits May 14, 2026 17:10

clickhouse-gh Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp

alexey-milovidov and others added 2 commits May 15, 2026 15:34

Merge remote-tracking branch 'origin/master' into memory-worker-dynam…

c297e58

…ic-limit

azat reviewed May 15, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp Outdated

Comment thread src/Common/MemoryWorker.cpp Outdated

Comment thread src/Common/MemoryWorker.cpp Outdated

azat self-assigned this May 15, 2026

clickhouse-gh Bot reviewed May 16, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp Outdated

clickhouse-gh Bot reviewed May 16, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp Outdated

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

alexey-milovidov and others added 2 commits May 17, 2026 20:50

Merge branch 'master' into memory-worker-dynamic-limit

63e0e56

clickhouse-gh Bot reviewed May 18, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp Outdated

alexey-milovidov added 4 commits May 19, 2026 19:15

Merge remote-tracking branch 'origin/master' into memory-worker-dynam…

fb3f174

…ic-limit

Merge remote-tracking branch 'origin/master' into memory-worker-dynam…

3a6b59d

…ic-limit

Merge remote-tracking branch 'origin/master' into memory-worker-dynam…

4e8e393

…ic-limit

clickhouse-gh Bot reviewed May 21, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp

clickhouse-gh Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp Outdated

clickhouse-gh Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp

alexey-milovidov and others added 4 commits June 5, 2026 02:18

Merge remote-tracking branch 'origin/master' into memory-worker-dynam…

24a2f36

…ic-limit

clickhouse-gh Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp

clickhouse-gh Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp

alexey-milovidov mentioned this pull request Jun 6, 2026

Pipelined SQL syntax #101038

Open

alexey-milovidov and others added 2 commits June 7, 2026 04:44

Merge remote-tracking branch 'origin/master' into memory-worker-dynam…

b49be01

…ic-limit

clickhouse-gh Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread src/Common/MemoryWorker.cpp

This was referenced Jun 10, 2026

Add embedded documentation for table engines #106177

Merged

Fix flaky 00377_shard_group_uniq_array_of_string_array under shared-runner memory pressure #107054

Closed

alexey-milovidov commented Jun 12, 2026

View reviewed changes

alexey-milovidov added this pull request to the merge queue Jun 12, 2026

Merged via the queue into master with commit d049b50 Jun 12, 2026
166 checks passed

alexey-milovidov deleted the memory-worker-dynamic-limit branch June 12, 2026 20:07

robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 12, 2026

Algunenano reviewed Jun 15, 2026

View reviewed changes

alexey-milovidov mentioned this pull request Jun 16, 2026

Adjust dynamic memory hard limit only on significant change #107645

Merged

1 task

groeneai mentioned this pull request Jun 27, 2026

Runner-wide OOM (OOM in dmesg / Server died) on Stateless test runners, especially arm_asan_ubsan, azure #108689

Open

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

alexey-milovidov commented May 14, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Motivation

Behavior

Files

Version info

Uh oh!

clickhouse-gh Bot commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Findings

Tests

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

azat left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Jun 5, 2026

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented Jun 7, 2026

Uh oh!

Uh oh!

alexey-milovidov commented Jun 11, 2026

Uh oh!

clickhouse-gh Bot commented Jun 11, 2026

LLVM Coverage Report

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Algunenano Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

alexey-milovidov commented May 14, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented May 14, 2026 •

edited

Loading