Add server setting `additional_memory_tracking_per_thread` by alexey-milovidov · Pull Request #104965 · ClickHouse/ClickHouse · GitHub
Skip to content

Add server setting additional_memory_tracking_per_thread#104965

Draft
alexey-milovidov wants to merge 54 commits into
masterfrom
additional-memory-tracking-per-thread
Draft

Add server setting additional_memory_tracking_per_thread#104965
alexey-milovidov wants to merge 54 commits into
masterfrom
additional-memory-tracking-per-thread

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 14, 2026

Copy link
Copy Markdown
Member

Each thread accumulates up to max_untracked_memory (4 MiB by default) of allocations before reporting them to the server-wide MemoryTracker. With many threads, this unreported memory can sum to a large amount, causing the server's tracked memory usage to under-count actual consumption and leading to OOM.

This PR introduces a server-level setting additional_memory_tracking_per_thread (default 4 MiB) and speculatively charges this amount to the server-wide MemoryTracker around every job executed in our ThreadPool workers. The global tracked memory becomes a safe upper bound on actual consumption.

The reservation is charged on the server-wide (total) tracker only — deliberately not through the query's tracker chain. Query-level accounting feeds heuristics that compare memory deltas against byte thresholds (conversion of aggregation hash tables to two-level via group_by_two_level_threshold_bytes, spill-to-disk decisions), and phantom reservations of num_threads * 4 MiB trip those thresholds immediately: an earlier revision of this PR that charged the query tracker showed consistent slowdowns of GROUP BY queries in performance tests for exactly this reason. With the server-wide-only reservation, query-level and user-level accounting (max_memory_usage, memory_usage in system.processes) are unaffected.

The speculative reservation uses the throwing path of the memory tracker, so when it would exceed the server memory limit the corresponding job is treated as failed with MEMORY_LIMIT_EXCEEDED — the same behavior as if the job itself had exceeded the limit.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Added a new server setting additional_memory_tracking_per_thread (default 4 MiB) which speculatively reserves this amount on the server-wide memory tracker around every ThreadPool job. It compensates for the up to max_untracked_memory of un-reported allocations per thread, making the server's tracked memory a safe upper bound on actual consumption and reducing the risk of OOM with many concurrent threads. Query-level and user-level memory accounting are not affected.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)
Modify your CI run

NOTE: If your merge the PR with modified CI you MUST KNOW what you are doing
NOTE: Checked options will be applied if set before CI RunConfig/PrepareRunConfig step

Include tests (required builds will be added automatically):

  • Fast test
  • Integration tests
  • Stateless tests
  • Stateful tests
  • Unit tests
  • Performance tests
  • All with ASan
  • All with TSan
  • All with MSan
  • All with UBSan
  • All with Coverage
  • All with Aarch64

Exclude tests:

  • Fast test
  • Integration tests
  • Stateless tests
  • Stateful tests
  • Performance tests
  • All with ASan
  • All with TSan
  • All with MSan
  • All with UBSan
  • All with Coverage
  • All with Aarch64

Extra options:

  • Add tests with aarch64 builds
  • do not test (only style check)
  • disable merge-commit (no merge from master before tests)
  • disable CI cache (job reuse)

Only specified batches in multi-batch jobs:

  • 1
  • 2
  • 3
  • 4

Each thread accumulates up to `max_untracked_memory` (4 MiB by default)
of allocations before reporting them to the server-wide `MemoryTracker`.
With many threads, this unreported memory can sum to a large amount,
causing the server's tracked memory to under-count actual consumption
and leading to OOM.

Introduce a server-level setting `additional_memory_tracking_per_thread`
(default 4 MiB) and charge this amount to the `MemoryTracker` around
every job executed in our `ThreadPool` workers. The global tracked
memory becomes a safe upper bound on actual consumption.

The speculative allocation uses the throwing path of `CurrentMemoryTracker`,
so when it would exceed the limit the corresponding job is treated as
failed with `MEMORY_LIMIT_EXCEEDED` — the same behavior as if the job
itself had exceeded the limit.
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label May 14, 2026
Comment thread src/Common/ThreadPool.cpp Outdated
@alexey-milovidov alexey-milovidov added the memory When memory usage is higher than expected label May 14, 2026
alexey-milovidov and others added 2 commits May 14, 2026 18:16
…read group

The original commit reserved the memory inside `ThreadPoolImpl::worker`,
before `job_data->job()` ran. The query's thread group is only attached
inside the job itself (via `ThreadGroupSwitcher`), so the speculative
`CurrentMemoryTracker::alloc` ran against the global tracker only and,
when it threw `MEMORY_LIMIT_EXCEEDED`, the job was skipped entirely.

The pipeline jobs submitted by `PullingAsyncPipelineExecutor`,
`PushingAsyncPipelineExecutor` and `PipelineExecutor::spawnThreads`
rely on running to completion to push a "done" sentinel into the
output queue; when their job is skipped, the consumer hangs in
`ConcurrentBoundedQueue::popImpl`. `KILL QUERY ... SYNC` then also
hangs in `PullingAsyncPipelineExecutor::cancel` waiting on the same
unsignalled event. Reproduced with 200 concurrent SELECTs in an 8 GiB
cgroup -- 56 queries hung permanently and could not be killed.

Move the reservation into the per-job lambdas, after the
`ThreadGroupSwitcher` is constructed and inside the same `try` block
that propagates exceptions through the pipeline's normal failure path.
The throw now goes to the query's `MemoryTracker` chain and the
existing `catch` finishes the executor (signalling consumers).

Verified with the same 200-query workload:
* before this commit: 56 hung queries, server unkillable
* after:              clean `MEMORY_LIMIT_EXCEEDED` on every query that
                      hits the limit, server stays alive

With `additional_memory_tracking_per_thread` bumped from the default
4 MiB to 8 MiB, the workload survives end-to-end (cgroup peak 7.99 GiB,
`memory.events:oom_kill = 0`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* `04238_additional_memory_tracking_per_thread.sql` -- single-query
  smoke test. Verifies the server setting is visible, and that a query
  whose `max_memory_usage` cannot accommodate the speculative reservation
  fails with `MEMORY_LIMIT_EXCEEDED` instead of hanging. A hang would
  surface as the runner's per-test timeout firing.

* `04239_additional_memory_tracking_per_thread_concurrent.sh` --
  regression test for the original deadlock: 64 concurrent SELECTs are
  launched in parallel, each with `max_memory_usage = 1`, so every one
  trips the speculative reservation inside the pipeline executor. The
  bug it guards against was that the speculative throw, raised before
  the query's thread group was attached, was swallowed by the
  ThreadPool worker -- the pipeline consumer then blocked forever on
  the empty output queue. The test runs each client through `timeout`
  and counts how many did not exit (`hung`); the expected count is
  zero, and a final `SELECT 'survived'` confirms the server is still
  serving traffic afterwards.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Core/ServerSettings.cpp Outdated
alexey-milovidov and others added 2 commits May 14, 2026 23:54
Many stateless tests intentionally use very small `max_memory_usage`
values (down to ~10 MiB) to exercise specific memory-limit behaviour.
The PR's production default of 4 MiB speculative reservation per thread
pushes those queries over their limits and breaks the tests. Affected
tests included `01017_uniqCombined_memory_usage`,
`01019_Buffer_and_max_memory_usage`,
`01062_pm_all_join_with_block_continuation`,
`01062_pm_multiple_all_join_same_value`, `01109_inflating_cross_join`,
`01278_min_insert_block_size_rows_for_materialized_views`,
`03030_system_flush_distributed_settings`,
`03217_primary_index_memory_leak`,
`03829_insert_deduplication_info_memory`, and
`04146_spilling_hash_join_low_threshold`.

Set `additional_memory_tracking_per_thread = 0` via a test config so the
existing tests keep covering what they were written to cover. The
feature itself remains covered by `04238_additional_memory_tracking_per_thread`
and `04239_additional_memory_tracking_per_thread_concurrent`, both of
which still pass with the setting disabled because they rely on
`max_memory_usage = 1` rather than on the speculative reservation
specifically tripping the limit.

CI report: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread tests/config/config.d/additional_memory_tracking_per_thread.xml Outdated
Comment thread programs/server/Server.cpp Outdated
alexey-milovidov and others added 8 commits May 15, 2026 06:45
The stateless-test config now sets `additional_memory_tracking_per_thread = 0`
so it does not interfere with the many existing tests that use intentionally
small `max_memory_usage` values. With the speculative reservation off, the
1-byte query limit no longer trips reliably from the reservation itself --
small in-query allocations sit in `max_untracked_memory` (4 MiB by default)
and never reach the query's `MemoryTracker`.

Add `max_untracked_memory = 0` to the failing queries so every allocation
is reported immediately. The first in-query allocation now exceeds the
1-byte limit and throws inside the pipeline lambda, exercising the same
`catch` block (in `PipelineExecutor` / `PullingAsyncPipelineExecutor` /
`PushingAsyncPipelineExecutor`) that the speculative reservation would
have exercised -- which is the actual property the test cares about
(no hang on the consumer's output queue).

Also redirect `wait` stderr in `04239_*_concurrent.sh` so bash's
job-control messages ("Killed", "Terminated") do not contaminate the
test stderr when the OS happens to SIGKILL background clients.

CI report: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Mirror the stateless-test config in the shared integration-test instance
config so the per-thread speculative reservation is also off there.

Several existing integration tests intentionally set `max_memory_usage = 1`
to verify constraint clamping and early-memory-limit behaviour
(`test_early_memory_limit_exception`,
`test_settings_constraints_distributed::test_select_clamps_settings`,
`test_settings_constraints_distributed_ddl::test_ddl_worker_clamps_settings_constraints`).
The production default of 4 MiB speculative reservation pushes those
queries over the limit at the local pipeline executor before the
constraint clamp can take effect, so they fail with
`MEMORY_LIMIT_EXCEEDED`. Disabling the feature in the common config
keeps the existing assertions intact; the feature itself is still
covered by the dedicated stateless tests
`04238_additional_memory_tracking_per_thread` and
`04239_additional_memory_tracking_per_thread_concurrent`.

CI report: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The test 04117_max_threads_min_free_memory_per_thread spins up
`clickhouse-local` with a tight `max_server_memory_usage = 4G` to make the
free-memory limiter actually kick in for `max_threads_min_free_memory_per_thread`.
The production default of 4 MiB speculative per-thread reservation interacts
badly with that small server limit, intermittently failing the test with
`MEMORY_LIMIT_EXCEEDED` before the limiter under test can take effect.

Set `additional_memory_tracking_per_thread = 0` in the same temp config so
the speculative reservation is off for this test only. The feature itself
is still covered by `04238_additional_memory_tracking_per_thread` and
`04239_additional_memory_tracking_per_thread_concurrent`.

CI report: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two corrections from the review of #104965:

* `programs/server/Server.cpp` cast a `UInt64` setting straight to `int64_t`
  before storing it in the atomic that the executors read. A misconfigured
  value above `INT64_MAX` would wrap negative and silently disable the
  feature via the `> 0` check. Clamp at `INT64_MAX` before the cast.

* `tests/queries/0_stateless/04239_additional_memory_tracking_per_thread_concurrent.sh`
  captured `$?` inside an `if ! wait "$pid"; then` branch, where the status
  reflects the negated condition (always 0), not the child's real exit
  code. As a result the timeout-firing path (exit 124) was never seen and
  `hung` could not increment -- the regression test would silently pass
  even when queries hung. Capture the real `wait` status into `rc` first.

CI report: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`CurrentMemoryTracker::alloc` first buffers in the per-thread
`untracked_memory` counter and only flushes to the parent trackers when
`untracked_memory > untracked_memory_limit` (strict `>`). With the default
`additional_memory_tracking_per_thread = 4 MiB` and the default
`max_untracked_memory = 4 MiB`, our 4 MiB reservation sets
`untracked_memory` to exactly 4 MiB, does not exceed the limit, and so
never reaches the global tracker. The undercount that this PR is trying
to close therefore still remains, and the throwing behaviour the
pipeline `catch` block expects can never trigger in the default case.

Force a flush right after the speculative `alloc` and re-run the limit
check explicitly. If the reservation pushes the query's tracker over its
limit, `CurrentMemoryTracker::check` throws `MEMORY_LIMIT_EXCEEDED`,
which the surrounding `catch` propagates through the pipeline as before
(no consumer hang). The `SCOPE_EXIT` was moved above the `alloc` and
guarded by a `reserved` flag so the matching `free` only runs when the
allocation succeeded, otherwise we would double-decrement when
`CurrentMemoryTracker::alloc` itself throws (it already rolls back
internally).

CI report: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…thread`

The setting description said the charge is applied for \"every thread
executing a job in a \`ThreadPool\`\" and that server-wide tracking
becomes a safe upper bound, but the reservation is only wired into the
three pipeline executors (\`PipelineExecutor\`,
\`PullingAsyncPipelineExecutor\`, \`PushingAsyncPipelineExecutor\`).
Other \`ThreadPool\` users (background merges, fetches, etc.) still
keep up to \`max_untracked_memory\` per thread unreported.

Narrow the documented contract to match the implementation and call out
the limitation explicitly so it does not surprise operators.

CI report: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ckhouse-local`

`Server.cpp` propagates the server setting `additional_memory_tracking_per_thread`
into the global atomic that the pipeline executors read, but `LocalServer.cpp`
did not. As a result `clickhouse-local --config-file ...` ignored the setting
entirely: the atomic stayed at the 4 MiB default regardless of what the config
said.

That is why 04117 was still failing on every build after disabling the
speculative reservation in its config -- the disablement was a no-op.

Apply the same store in `LocalServer::processConfig` (mirroring `Server.cpp`,
including the `INT64_MAX` clamp so a huge misconfigured value cannot wrap
negative).

This also addresses the review comment from `clickhouse-gh`
(#104965 (comment)...) on
`tests/config/config.d/additional_memory_tracking_per_thread.xml`: the new
test `04240_additional_memory_tracking_per_thread_speculative.sh` runs
`clickhouse-local` with a private config that re-enables the setting and
verifies that the speculative reservation -- not `max_untracked_memory = 0` --
trips the limit and surfaces as a normal `MEMORY_LIMIT_EXCEEDED`. The earlier
04238 / 04239 tests force the same exception path via `max_untracked_memory`,
so they pass even if the speculative path is broken; 04240 closes that gap.

CI report: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`04239_additional_memory_tracking_per_thread_concurrent` spawns 64 background
`clickhouse-client` processes and gives them 15 s to finish. Under the flaky
check that runs alongside many other parallel tests, the per-process startup
overhead alone can take longer than that, so up to 24 / 64 clients failed to
return inside the window and the test reported a phantom hang
(`hung=24` vs the expected `hung=0`).

The regression we are guarding against would manifest as queries never
finishing at all -- so finishing slowly under heavy CI load is fine, but
finishing slowly inside a tight timeout is a false alarm. Lower the
concurrency to 32 (still 128 worker jobs across the 4-thread queries -- more
than enough to surface the original deadlock) and raise the timeout to 60 s
so the test has room to breathe.

CI report:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104965&sha=51217d026f4482d5c6bde582e9e0a11be51a7fcd&name_0=PR&name_1=Stateless%20tests%20%28amd_asan_ubsan%2C%20flaky%20check%29

PR: #104965

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

alexey-milovidov commented May 15, 2026

Copy link
Copy Markdown
Member Author

Do not disable this setting for tests; it should be on.
It is only ok to edit specific tests.

@azat azat self-assigned this May 15, 2026

@azat azat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, this does not cover allocations not from pipeline, and also may lead to significant memory over accounting

I think this should be a better alternative - #104125

@alexey-milovidov

alexey-milovidov commented May 15, 2026

Copy link
Copy Markdown
Member Author

and also may lead to significant memory over accounting

Memory overaccounting is the goal of this PR. It is exactly what it does - makes sure we track enough additional memory per thread.

Your PR #104125 is way more complex and it is still a draft, how could it be a better alternative?

@azat

azat commented May 15, 2026

Copy link
Copy Markdown
Member

is way more complex and it is still a draft, how could it be a better alternative?

  • This solution does not cover lots of other thread pools (remote IO, IO, e.t.c.), that can allocate enough memory
  • It will not have huge memory reservations (only up to 4MiB*used CPUs)
    • Right now it does not reserve memory at all, I've tried it in one of iterations, but decided to revert for now, until the performance will be on par

Also it is interesting that performance shows some difference here, probably because some lightweight queries may never enter MemoryTracker before

`04028_arrow_stream_format_detection_memory.sh` uses `clickhouse-local`
with a deliberately tight `max_memory_usage = 100Mi` to guard against the
regression of issue #65036 (ArrowStream format detection allocating
~514 MiB). With the production default of `additional_memory_tracking_per_thread = 4 MiB`
and randomized `max_threads`, each pipeline worker thread charges 4 MiB
upfront against the query tracker, pushing the cap over the edge
intermittently:

    Code: 241. DB::Exception: Query memory limit exceeded:
      would use 100.02 MiB (attempt to allocate chunk of 4.00 MiB),
      maximum: 100.00 MiB.

The retry harness reran the same randomized settings 89 times without a
single failure, confirming the flake is at the edge of the limit rather
than a systematic over-allocation.

Mirror the workaround used in `04117_max_threads_min_free_memory_per_thread.sh`:
generate a private config that sets `additional_memory_tracking_per_thread = 0`
and pass it via `--config-file`. The stateless-test server config already
disables the setting (`tests/config/config.d/additional_memory_tracking_per_thread.xml`),
but it is not picked up by `clickhouse-local`, which reads its own config.

Failing CI report: #104965
(Stateless tests (arm_binary, parallel) at commit 6e029d8).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

It's strange that incrementing a few atomics is comparable to spawning a thread. Then we will have to optimize MemoryTracker.

/// each spawned worker job carries its own (see `spawnThreads`) while the calling
/// thread only waits. A throw lands in the surrounding `catch`, which cancels the
/// pipeline and rethrows to the caller.
SpeculativeMemoryReservation speculative_memory_reservation;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

executeImpl now reserves for spawned workers and the single-thread execute path, but PipelineExecutor::executeStep is still uncovered. PullingPipelineExecutor and PushingPipelineExecutor drive real processor work through executeStep on the caller thread, and many server paths use those wrappers, so those active pipeline workers can still keep up to max_untracked_memory unreported while the setting docs promise a reservation for every PipelineExecutor worker.

Please either put the same server-wide reservation around the executeStepImpl path, without double-charging executeImpl, or explicitly exclude synchronous executeStep executors from the setting contract and tests.

…mory tests

The redesign in `41654c5fdf9` charged the speculative reservation on the
server-wide `total_memory_tracker` (via `CurrentMemoryTracker::allocGlobal`)
and reverted all per-test compensations, on the basis that query-level
accounting now matches master. That holds for tests that constrain
`max_memory_usage` (query/user level), but not for tests that exercise a tight
`max_server_memory_usage` (server/total level) for an unrelated purpose: those
are now charged directly by the reservation.

The enforced 4 MiB-per-thread reservation is a single chunk reported to the
total tracker, so — unlike master, where the same tiny allocations stay under
`max_untracked_memory` and never hit the enforced total check — it trips
`max_server_memory_usage` on the first pipeline job of every query whenever the
tracked/RSS amount is already near the limit. In CI this fails deterministically
(`04117` across all build types: `(total) memory limit exceeded: would use
7.07 GiB ... maximum: 3.73 GiB`, reproducible 15/15 without randomized
settings), and the two integration tests assert behaviour at a tight server
budget (`test_failed_async_inserts` uses `max_server_memory_usage=1000`).

This is intended for the feature itself (see `04240`, which makes a single
reservation exceed the limit on purpose), so the fix is to isolate the
unrelated server-memory tests from the production default, exactly as before
the redesign. Query-level test reverts in `41654c5fdf9` are left as-is; they
pass.

Restored:
- `04117_max_threads_min_free_memory_per_thread` (exercises
  `max_threads_min_free_memory_per_thread` under `max_server_memory_usage=4G`)
- `test_early_memory_limit_exception` (tight `max_server_memory_usage` budget)
- `test_failed_async_inserts` (`max_server_memory_usage=1000`)

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104965&sha=41654c5fdf96e16288a69deba287ff501cc2535c&name_0=PR&name_1=Stateless%20tests%20%28arm_binary%2C%20parallel%29

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

The fresh CI on the global-tracker redesign (41654c5fdf9) surfaced one deterministic, all-builds failure plus two integration failures, all the same root cause.

04117_max_threads_min_free_memory_per_thread fails on every stateless build with Code: 241. (total) memory limit exceeded: would use 7.07 GiB (attempt to allocate chunk of 4.00 MiB) ... maximum: 3.73 GiB (reproducible 15/15 without randomized settings). The redesign reverted all per-test compensations on the basis that query-level accounting now matches master — true for max_memory_usage tests, but 04117 and the two failing integration tests exercise a tight max_server_memory_usage (server/total) budget for an unrelated purpose, and the reservation is now charged directly on that tracker. Unlike master — where the equivalent tiny allocations stay under max_untracked_memory and never hit the enforced total check — the enforced 4 MiB chunk trips max_server_memory_usage on the first pipeline job of every query once the tracked/RSS amount is near the limit (on the CI runner RSS is already ~7 GiB vs the test's 3.73 GiB cap).

This is intended for the feature itself (04240 deliberately makes one reservation exceed the limit), so the fix in dacaa8127a3 just restores the pre-redesign additional_memory_tracking_per_thread=0 isolation for the three server-memory tests (04117, test_early_memory_limit_exception, test_failed_async_inserts). The query-level reverts from 41654c5fdf9 are left as-is — they pass.

04257_intersect_empty_input (amd_tsan, 1 job) is an unrelated transient TIMEOUT_EXCEEDED (4131 ms > 2000 ms under TSan); reruns passed 56/56.

Left untouched for your call: the PipelineExecutor.cpp:664 bot thread notes executeStep (the synchronous PullingPipelineExecutor/PushingPipelineExecutor drive path) is not yet covered by the reservation — a charge-site coverage decision, consistent with the scope you're iterating on.

Branch is ~5 days / 1804 commits behind master; since the reds are PR-caused (not staleness/flakes), I did not force a master merge.

{
/// The returned trace is intentionally dropped: no actual allocation backs
/// this reservation, so it must not be reported to the allocation profiler.
std::ignore = total_memory_tracker.allocImpl(size, /*enforce_memory_limit=*/ true);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

allocGlobal calls total_memory_tracker.allocImpl with the default query_tracker = nullptr. That skips the global overcommit path: MemoryTracker::allocImpl only calls OvercommitTracker::needToStopQuery when the total tracker has a non-null query/process tracker, while normal allocations provide it as they propagate through the thread -> process -> total tracker chain.

With this path, a speculative reservation that crosses max_server_memory_usage throws immediately instead of waiting for/killing the selected overcommitted query, so queries can be rejected under global memory pressure in cases where an actual allocation from the same query would have been allowed to continue. Can we pass the current process-level tracker to the total tracker for overcommit decisions without charging it, or explicitly document and test that reservation failures bypass global memory overcommit?

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Picked this up again via /continue-pr. No conflicts (MERGEABLE), nothing new to push this round — worked through the red CI and re-reviewed the open threads.

CI

The only real failure is AST fuzzer (amd_debug): Logical error: Column identifier A is already registered (STID: 4697-4326) on a DELETE/mutation planner path (MutationsInterpreter::validatebuildSubqueryPlansForSetsAndAddGlobalPlannerContext::createColumnIdentifier). This is a master-wide analyzer/planner fuzzer flake unrelated to a server memory-tracking setting: the Column identifier ... is already registered class fires across dozens of distinct PRs every day over the full 90-day CIDB window (including pull_request_number = 0, i.e. master itself, and 8 different PRs on 2026-06-13). Re-ran the failed job. CH Inc sync and Mergeable Check are derived from this same red and will clear when the re-run passes.

Did not re-merge master: merge-base is ~1 day old (3659e5601ea, 2026-06-13) and the red is a master-wide flake that a re-merge would not clear (it is still present on master).

Open review threads (design calls — left to you)

The two outdated ThreadPool.h / changelog wording threads are already settled by your replies ("Leaving this as your design call"). Two non-outdated bot threads remain, both accurate observations but both design/scope decisions on the feature contract, so I did not change them:

  • PipelineExecutor.cpp:664 — the reservation covers executeImpl/spawnThreads (and CompletedPipelineExecutor), but the synchronous executeStep/executeStepImpl path driven by PullingPipelineExecutor/PushingPipelineExecutor on the caller thread is not covered. Whether to extend coverage there (without double-charging executeImpl) or to explicitly scope it out is a feature-scope call.
  • CurrentMemoryTracker.cpp:129allocGlobal calls total_memory_tracker.allocImpl with query_tracker = nullptr, which skips the overcommit path (MemoryTracker::allocImpl only calls OvercommitTracker::needToStopQuery when query_tracker != nullptr, MemoryTracker.cpp:434). So a reservation crossing max_server_memory_usage throws MEMORY_LIMIT_EXCEEDED immediately instead of waiting for/killing the selected overcommitted query. The immediate-throw behavior matches the PR description's stated intent; whether a server-wide speculative reservation (which has no specific query to blame) should participate in query-level overcommit at all is your call.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

This is interesting. Why do we end up doing it only for PipelineExecutor threads, but not for every thread?
And wouldn't it be easier to ask the OS directly about the number of currently active threads and subtract this amount from max_server_memory_usage?

…racking-per-thread

# Conflicts:
#	tests/queries/0_stateless/04117_max_threads_min_free_memory_per_thread.sh

void CurrentMemoryTracker::freeGlobal(Int64 size)
{
std::ignore = total_memory_tracker.free(size);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

freeGlobal always subtracts the speculative reservation from total_memory_tracker, but MemoryWorker::updateAllocated can replace that same counter from RSS/jemalloc state while the reservation is still live (memory_worker_correct_memory_tracker=1, and also the worker's first correction). Concrete trace: a worker reserves 4 MiB, the memory worker corrects the total tracker to resident memory (which does not include the artificial reservation), then this freeGlobal(4 MiB) subtracts from the corrected value. The total tracker is now below the externally-corrected usage by 4 MiB per reservation that was live during the correction, defeating the "safe upper bound" invariant and potentially re-admitting memory under max_server_memory_usage.

Can we keep speculative reservations in a separate counter and include them when correcting the total tracker, or otherwise make freeGlobal subtract only reservations still represented in the total counter?

Resolve conflict in `src/Processors/Executors/PipelineExecutor.cpp`:
master refactored `executeSingleThread` to take a `WorkloadResources`
argument (`WorkloadResources(slot, process_list_element)`) instead of a
raw slot pointer, while this branch added a `SpeculativeMemoryReservation`
RAII local just before the call in both the multi-threaded `spawnThreads`
worker and the single-threaded branch of `executeImpl`. The two changes
are orthogonal: keep the reservation local and adopt master's new call
signature. Also keep both new includes (`Common/CurrentMemoryTracker.h`
from this branch and `Common/Scheduler/MemoryReservation.h` from master).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
alexey-milovidov and others added 3 commits June 27, 2026 22:04
Resolve three test-isolation conflicts where master refactored tests that
the PR had instrumented to disable `additional_memory_tracking_per_thread`:

- `test_early_memory_limit_exception`: master moved the server config to
  `overrides/server.yaml` + `overrides/users.yaml` and added `stay_alive`;
  kept master's structure and re-added the PR's disable config alongside it.
- `test_failed_async_inserts`: master dropped `configs/config.xml` (now uses
  per-query `max_memory_usage` instead of a server-level limit); kept only the
  PR's `additional_memory_tracking_per_thread` disable config.
- `04117_max_threads_min_free_memory_per_thread.sh`: master switched from a
  generated config file to inline `clickhouse-local` server options; ported the
  PR's `--additional_memory_tracking_per_thread=0` into the `server_opts` array.

No source-code conflicts; all functional changes auto-merged.
Master advanced past the `04238`/`04239`/`04240` prefixes used by the
`additional_memory_tracking_per_thread` tests while this branch was open, so
those prefixes are now shared with unrelated tests added on `master`. Renumber
to the next free sequential numbers and update the embedded temp-config name in
the speculative test:

  04238 -> 04489  additional_memory_tracking_per_thread
  04239 -> 04490  additional_memory_tracking_per_thread_concurrent
  04240 -> 04491  additional_memory_tracking_per_thread_speculative

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Picked this up via /continue-pr. Merged master (was 558 behind, last merge 2 days ago, CI red — clean, no conflicts; merge 68f7783590c) and pushed a test renumber (442b6329961).

CI

The red was the Stress test (amd_msan) cluster: Logical error: 'Unexpected exception in refresh scheduling' (STID: 2508-3e7b) in RefreshTask::doScheduling, with Cannot start clickhouse-server / Check failed as downstream symptoms of that crash. This is #106737, which was closed today as COMPLETED by #105588 (67a1d74c12b, "Stop refreshable MV gracefully on Keeper without MULTI_READ on ATTACH/restore"). That fix is now an ancestor of HEAD via the master merge, so the cluster should clear on the next run — it is unrelated to memory tracking.

The Stress test (arm_msan) Hung check failed was on the same pre-fix merge; its backtrace is a generic TCPHandler::runImpl query path with no memory-tracker signature. Re-running on the fresh merge to confirm.

Tests

Master advanced past the 04238/04239/04240 prefixes while the branch was open, so they now collide with unrelated tests on master. Renumbered to 04489/04490/04491 (and updated the embedded temp-config name in the speculative test). All six affected TUs (CurrentMemoryTracker, ThreadPool, PipelineExecutor, ServerSettings, Server, LocalServer) compile after the merge.

Review threads

Left the open threads untouched as your design calls: the ThreadPool.h comment scope and the changelog wording (both deliberately reverted to the broad "every ThreadPool job" planned-direction wording), the PipelineExecutor::executeStep coverage question, and the two CurrentMemoryTracker global-tracker consequences (overcommit bypass in allocGlobal, and the MemoryWorker correction interaction in freeGlobal) — all downstream of the deliberate "charge the total tracker only" redesign.

Comment thread programs/server/Server.cpp Outdated
alexey-milovidov and others added 2 commits June 29, 2026 23:55
…memory

The setting is `UInt64`, but its value is consumed as a signed `int64_t` delta
that the pipeline executor adds directly into the total `MemoryTracker`
(`will_be = size + amount.fetch_add(size)` in `MemoryTracker::allocImpl`).

Clamping a misconfigured huge value at `INT64_MAX` is not safe: on a running
server `amount`/`rss` are already positive, so a near-`INT64_MAX` reservation
overflows that signed addition, wraps negative and corrupts the total tracker
before the hard-limit check can reject it. The corrupted (negative) total then
silently passes the limit check, so the query wrongly succeeds while server
memory accounting is left broken.

Clamp the value to the physical server memory instead: a per-thread reservation
can never sensibly exceed the total RAM, and the result stays far below
`INT64_MAX`, so the tracker arithmetic cannot overflow. The same rule is applied
in both `clickhouse-server` config reload and `clickhouse-local`.

Adds `04492_additional_memory_tracking_per_thread_oversized` which configures
the setting to `UInt64` max and verifies the query fails cleanly with
`MEMORY_LIMIT_EXCEEDED` instead of corrupting the tracker.

Addresses an AI review finding on
#104965

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed b3622effa08 (on top of a clean merge of current master):

  • Fixed the overflow Major (the only new finding in the latest AI review): clamping additional_memory_tracking_per_thread at INT64_MAX was unsafe, because MemoryTracker::allocImpl computes will_be = size + amount.fetch_add(size) and on a running server amount/rss are already positive, so a near-INT64_MAX reservation overflows the signed addition and corrupts the total tracker before the hard-limit check. The value is now clamped to the physical server memory (getMemoryAmount()) in both Server.cpp (config reload) and LocalServer.cpp; a per-thread reservation can never sensibly exceed total RAM, and the result stays far below INT64_MAX. Added 04492_additional_memory_tracking_per_thread_oversized, which sets the value to UInt64 max and asserts the query fails cleanly with MEMORY_LIMIT_EXCEEDED instead of succeeding on a corrupted tracker. Thread resolved.
  • The remaining open review threads (broad ThreadPool scope wording, executeStep-driven executors, overcommit bypass, MemoryWorker correction interaction) are intentional design decisions already discussed inline.

The only red CI check, 01244_optimize_distributed_group_by_sharding_key, is an arm_binary, parallel infrastructure timeout flake: over the last 3 days, 12 different distributed/heavy tests each timed out once on that job across many PRs (including master itself and the sibling 01247_optimize_distributed_group_by_sharding_key_dist_on_dist). A default-4 MiB server-wide reservation cannot make a distributed GROUP BY test time out; the fresh CI run re-runs it.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

CI for b3622effa08 finished. Both failures are unrelated flakes, not regressions from this change:

1. Stress test (amd_tsan)Logical error: 'std::exception. Code: 1001, type: std::future_error, e.what() = The associated promise has been destructed prior to the associated state becoming ready' (STID 2508-38c6).
It aborts in MergePlainMergeTreeTask::executeSteptryLogCurrentException on a raw std::future_error that escaped merge_task->execute() (query id …::201403_1_31_1, i.e. a background merge). This is the long-standing broken-promise race in the async-read / merge teardown path (#55155, which CheSema noted is "still with us"), independent of this PR: SpeculativeMemoryReservation is constructed once at the entry of each PipelineExecutor worker — before any pipeline future/promise exists — inside the worker's try, so a failing reservation surfaces as a MEMORY_LIMIT_EXCEEDED DB::Exception routed through finish, never as a raw std::future_error. It cannot produce a broken promise.

2. Fast test (arm_darwin)03135_keeper_client_find_commands.
A pre-existing keeper-client test, failing on the macOS shard only; this PR touches no Keeper code. The find_super_nodes / find_big_family output in the failure log matches the reference, so the difference is an environment/connection issue specific to arm_darwin.

The latest AI review (❌) restates only the four findings it itself tags [dismissed by author] (freeGlobal vs MemoryWorker::updateAllocated correction window, uncovered executeStep, allocGlobal overcommit bypass, and the ThreadPool.h / changelog wording) — the documented trade-offs of charging the server-wide tracker only. No new actionable issue, so no code change; the two flaky checks just need a re-run.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Re-ran both failed jobs (attempt 2 now in progress). CIDB master history confirms both are widespread fleet flakes, not regressions from this change:

  • Stress test (amd_tsan) broken-promise future_error (STID 2508-38c6): hit on ~25 distinct PRs over the last 30 days plus master itself (pull_request_number = 0 on 2026-06-09) — the recurring #55155-class async-read / merge-teardown race. As noted above, SpeculativeMemoryReservation is constructed at PipelineExecutor worker entry inside the worker try and can only surface as MEMORY_LIMIT_EXCEEDED, never as a raw broken promise.
  • Fast test (arm_darwin) 03135_keeper_client_find_commands: hit on ~20 distinct PRs over the last 30 days, all on the macOS (arm_darwin) shard; this PR touches no Keeper code.

No code change this round.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Picked this up via /continue-pr. MERGEABLE, no conflicts, nothing to push this round.

Public CI for b3622effa08 is now fully green — the two flakes you re-ran (the amd_tsan broken-promise std::future_error of the #55155 class and 03135_keeper_client_find_commands on the arm_darwin shard) both passed.

The only remaining red is CH Inc sync (tests failed). I checked the private sync run: three failures, all flaky and none related to memory tracking. There is no MEMORY_LIMIT_EXCEEDED/MemoryTracker signature anywhere, and this PR's own test_early_memory_limit_exception passed there. Two were integration tests in unrelated subsystems that passed on automatic retry, and one was a Stress test Hung check on a single AST-fuzzer formatQuery(...) query (~2 MB memory) that the harness itself auto-marked flaky against #107941. I re-ran the failed private jobs (attempt 2 in progress).

The 5 unresolved threads remain the reserved design trade-offs of charging the server-wide tracker only (freeGlobal/updateAllocated correction window, allocGlobal overcommit bypass, executeStep coverage, the ThreadPool.h/changelog wording) — your call.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Picked this up via /continue-pr. Head unchanged (b3622effa08), nothing to push this round.

The CH Inc sync loop from the last round is now closed: your reruns of the private integration jobs (old analyzer, 1/7 & 5/7) drained the Cloud-runner queue and passed, so CH Inc sync is now green. Public CI stays fully green (155 success / 19 skipped / 0 failures), and a dry-run merge of current master into the branch is conflict-free — so the PR is effectively fully green.

Independently re-verified the 5 remaining unresolved threads against the current head; all are your reserved design/scope calls, not overridden:

  • ThreadPool.h:460 comment + changelog "every ThreadPool job" wording — the intended planned-scope direction (reverted the narrowing in bdc05f16e52).
  • PipelineExecutor.cpp:776executeStep (the synchronous PullingPipelineExecutor/PushingPipelineExecutor drive path) is uncovered; you already marked this "left untouched for your call — a charge-site coverage decision."
  • CurrentMemoryTracker.cpp:142 allocGlobal overcommit bypass — consistent with the documented throwing-path intent (MEMORY_LIMIT_EXCEEDED immediately).
  • CurrentMemoryTracker.cpp:147 freeGlobal vs MemoryTracker::updateAllocated — the AI's ❌ blocker. Confirmed mechanically real (updateAllocated does amount.exchange(external), discarding a live reservation, so a following freeGlobal drops amount below observed usage), but it is gated on memory_worker_correct_memory_tracker (default 0; otherwise only first_run/total<0), transient (self-heals on the next MemoryWorker tick), bounded by live_reservations * additional_memory_tracking_per_thread, and inherent to the exchange-based correction mechanism rather than specific to this reservation. A separate live-reservation counter that updateAllocated re-adds would restore the strict upper bound if you want it — leaving that as your design call.

Remaining blockers are non-CI: un-draft + review approval + your adjudication of the threads above.

# Conflicts:
#	src/Core/ServerSettings.cpp
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed cc7c847c859: merged current master to resolve the conflict (the PR was CONFLICTING/DIRTY; it is now MERGEABLE).

The only conflict was in src/Core/ServerSettings.cpp. Since the last push, master ran a file-wide documentation-indentation normalization (stripping the leading 4 spaces from every setting's R"(...)" doc body). The conflict was purely the adjacency of that reformatting of the merges_mutations_memory_usage_to_ram_ratio "See also" lines with the newly-added additional_memory_tracking_per_thread DECLARE. I took master's unindented "See also" lines and also unindented the new setting's doc body to match the now file-wide convention (whitespace-only inside the raw string, no behavior change). All of the PR's own changes are intact (21 files); the getMemoryAmount() clamping in Server.cpp/LocalServer.cpp auto-merged cleanly.

No code/behavior changes. The remaining open review threads are the documented server-wide-tracker-only design trade-offs already discussed inline. Fresh CI re-runs on the merge commit.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status after the master merge in cc7c847c859: MERGEABLE, fresh CI is in progress.

The only completed failure so far is Stress test (amd_debug)Logical error: 'Inconsistent AST formatting' (STID 1941-26fa). This is the tracked fleet-wide AST-fuzzer flake #108372 (exact STID match; it fires on master and across ~68 PRs in the last 14 days). It is not a regression from the merge: the stack is entirely the AST format/re-parse round-trip in executeASTFuzzerQueriesexecuteQueryImpl, with no memory-tracking frames — the additional_memory_tracking_per_thread reservation is not involved. It can be re-run once the whole CI run completes.

The 5 remaining unresolved review threads are unchanged — the server-wide-tracker-only accounting design points (ThreadPool.h, ServerSettings.cpp, PipelineExecutor.cpp executeStep coverage, CurrentMemoryTracker.cpp allocGlobal/freeGlobal), left as your design calls.

@clickhouse-gh

clickhouse-gh Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.50% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 53/53 (100.00%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

memory When memory usage is higher than expected pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants