Workload scheduling: Memory reservations by serxa · Pull Request #82414 · ClickHouse/ClickHouse · GitHub
Skip to content

Workload scheduling: Memory reservations#82414

Merged
serxa merged 186 commits into
masterfrom
workload-memory-scheduling
Jun 24, 2026
Merged

Workload scheduling: Memory reservations#82414
serxa merged 186 commits into
masterfrom
workload-memory-scheduling

Conversation

@serxa

@serxa serxa commented Jun 23, 2025

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Introduce a memory reservation feature for workloads. More details https://clickhouse.com/docs/operations/workload-scheduling

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Details

To enable memory reservations for workloads create MEMORY RESERVATION resource and set at least one limit for the total memory reserved using workload settings:

CREATE RESOURCE memory (MEMORY RESERVATION)
CREATE WORKLOAD all SETTINGS max_memory = '2Gi'

ClickHouse tracks memory allocations of all queries and background activities. The number of allocated bytes is aggregated through the scheduling hierarchy up to the root. Every query has an associated allocation in the leaf workload it belongs to. If a query has the reserve_memory setting greater than zero, then the allocation is created in a pending state. Pending allocation reserves requested amount of memory in the workload hierarchy. If there is not enough memory available, the allocation remains pending until enough memory is freed or other allocations are evicted (killed). When allocation is admitted, it becomes running. Running allocation could increase or decrease its size dynamically according to memory consumption of the query. Allocation life-cycle can be depicted with the following state diagram:

stateDiagram-v2
    [*] --> Pending: init [reserve_memory > 0]
    [*] --> Running: init [reserve_memory == 0]

    Pending --> Running: admit

    state Running {
        %% Region 1: increase flow
        NotIncreasing --> Increasing: request
        Increasing --> NotIncreasing: approve

        --

        %% Region 2: decrease flow
        NotDecreasing --> Decreasing: request
        Decreasing --> NotDecreasing: approve
    }


    Running --> Killed: evict
    Running --> Released: finish
Loading

Pending allocations of a leaf workload are admitted according to FIFO order. When multiple workloads have pending allocations, they are admitted according to precedence and weight settings. Higher precedence workloads are served first. Sibling workloads with the same precedence share memory according to weights in a max-min fair manner, which means that workload with lower normalized memory usage (current usage plus requested increase divided by weight) is served first. The reverse logic is applied during eviction. When memory needs to be freed, workloads with lower precedence and higher normalized memory usage are evicted first.

Note that time-shared resources use priority, while space-shared resources use precedence. They are independent settings and could be set to different values. Higher priority implies non-destructive preemption (delay or throttling), while higher precedence may imply destructive eviction (stops with an error). A workload could have high priority for CPU scheduling, but the same precedence for memory reservation to avoid evicting other workloads and losing work that was already done by them.

Every workload with a max_memory limit ensures that the total memory allocated in its subtree does not exceed the limit. If a pending or increasing allocation would exceed the limit, eviction procedure is initiated to free memory. Eviction procedure selects a victim to be killed. The least common ancestor workload of killer and victim prevents eviction in the following situations:

  • Pending allocation cannot evict running allocations in the same workload. (Killer and victim workloads coincide).
  • Pending allocation of lower precedence never kills workload of higher precedence.
  • Pending allocation cannot kill an allocation of the same precedence. Note that running allocations of the same precedence may evict each other based on normalized memory usage.
    If eviction is prevented or does not free enough memory, the new allocation is blocked until enough memory is freed. These rules allow queueing of excessive queries based on memory pressure and provide a convenient way to avoid MEMORY_LIMIT_EXCEEDED errors.

NOTE: Workload limits are independent from other ways to limit memory consumption like max_memory_usage query setting. They could be used together to achieve better control over memory consumption. It is possible to set independent memory limits based on users (not workloads). This is less flexible and does not provide features like memory reservation and queueing of pending queries. See Memory overcommit

Workload setting max_waiting_queries limits the number of pending allocations for the workload. When the limit is reached, the server returns an error SERVER_OVERLOADED.

Memory reservation scheduling is not supported for merges and mutations yet.

Only queries with the reserve_memory setting greater than zero are subject to blocking while waiting for memory reservation. However, queries with zero reserve_memory are also accounted for in their workload memory footprint, and they can be evicted if necessary to free memory for other pending or increasing allocations. Queries without proper workload markup are not subject to memory reservation scheduling and cannot be evicted by the scheduler.

To provide non-elastic memory reservation for a query, set both reserve_memory and max_memory_usage query settings to the same value. In this case, the query will reserve fixed amount of memory and will not be able to increase its allocation dynamically.

Let's consider an example of configuration:

CREATE RESOURCE memory (MEMORY RESERVATION)
CREATE WORKLOAD all SETTINGS max_memory = '10Gi'
CREATE WORKLOAD system IN all SETTINGS weight = 1
CREATE WORKLOAD user IN all SETTINGS weight = 9
CREATE WORKLOAD production IN user SETTINGS precedence = 1, weight = 3
CREATE WORKLOAD staging IN user SETTINGS precedence = 1, weight = 1
CREATE WORKLOAD testing IN user SETTINGS precedence = 2

In this example, the total memory reserved by all queries and background activities cannot exceed 10 GiB. The system workload has a guarantee of at least 1 GiB (10% of 10 GiB), while the user workload has a guarantee of at least 9 GiB (90% of 10 GiB). Inside the user workload, production and staging workloads share memory according to weights (3 to 1) with equal precedence of 1. Testing workload has precedence 2, which is lower than production and staging. Therefore, testing workload can only use memory that is not used by production and staging.

If memory pressure arises, testing workload allocations will be evicted first. Then, if more memory needs to be freed, staging workload allocations will be evicted before production workload allocations if they exceed their guarantees. Note that pending queries in production and staging can evict running allocations in testing workload to free memory, but they cannot evict each other because they have the same precedence. In case of memory pressure, they will wait in queues, which allows the system to avoid MEMORY_LIMIT_EXCEEDED errors due to too many concurrently executing queries.

Note that system workload has precedence 0 (default), which is higher than production, staging and testing workloads, but they are not sibling workload. The least common ancestor is workload all, both children of which has equal precedence. So pending system workload cannot evict any of them, and vice versa. This ensures that system activities cannot easily be evicted.

Version info

  • Merged into: 26.6.1.1190
  • Backported to: 26.6.2.5

@clickhouse-gh

clickhouse-gh Bot commented Jun 23, 2025

Copy link
Copy Markdown
Contributor

@serxa serxa marked this pull request as draft June 23, 2025 12:52
@serxa serxa changed the title [WIP [WIP] Workload memory scheduling Jun 23, 2025
@serxa serxa changed the title [WIP] Workload memory scheduling [WIP] Workload scheduling: memory reservations Jun 24, 2025
@serxa serxa changed the title [WIP] Workload scheduling: memory reservations [WIP] Workload scheduling: Memory reservations Jun 24, 2025
@serxa serxa mentioned this pull request Aug 3, 2025
29 tasks
@clickhouse-gh

clickhouse-gh Bot commented Aug 26, 2025

Copy link
Copy Markdown
Contributor

Dear @serxa, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

@clickhouse-gh

clickhouse-gh Bot commented Oct 28, 2025

Copy link
Copy Markdown
Contributor

Dear @serxa, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Nov 11, 2025
@azat azat self-assigned this Nov 16, 2025
@serxa

serxa commented Jun 23, 2026

Copy link
Copy Markdown
Member Author

serxa and others added 3 commits June 23, 2026 17:13
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor

@serxa Not caused by this PR, and not yet fixed. It is a pre-existing data race in the MVCC transaction code that the server-side AST fuzzer surfaces; it is unrelated to the scheduler changes here.

What the report shows: a TSan race on Context::merge_tree_transaction (a shared_ptr<MergeTreeTransaction>, Context.h:696).

  • Read side (Restore_MakeTbl thread): RestorerFromBackup::createTable -> Context::createCopy (Context.cpp:1368, holds SharedLockGuard(other->mutex)) -> ContextData copy ctor reads merge_tree_transaction (Context.cpp:1316).
  • Write side (TCPHandler thread): executeASTFuzzerQueries (executeQuery.cpp:2114) -> Context::setCurrentTransaction (Context.cpp:7543) does merge_tree_transaction = std::move(txn) with no lock. getCurrentTransaction (Context.cpp:7548) is lock-free too.

Root cause: asymmetric locking. The copy path takes Context::mutex (shared), but setCurrentTransaction/getCurrentTransaction take no lock, so the reader's shared lock guards nothing. Any concurrent createCopy(ctx) racing a setCurrentTransaction(ctx, ...) on the same Context is a data race.

Why it is not this PR:

Fix direction (for the transaction owners, not this PR): make access to merge_tree_transaction symmetric, i.e. take Context::mutex in setCurrentTransaction/getCurrentTransaction as well, or otherwise serialize Context copy vs transaction mutation. There is a hot-path consideration on getCurrentTransaction, so the exact locking granularity is a call for the transaction maintainers (cc the MVCC area). I did not push anything here since it is outside this PR and needs a TSan repro plus owner sign-off. I can open a tracking issue with this analysis if you want.

Base max_memory_ratio on total_memory_tracker.getHardLimit() so that
max_server_memory_usage (and its derived RAM ratio) is respected, falling
back to physical memory only when no server hard limit is configured.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

// Update increase pointer in case the removed allocation was the current one
if (setIncrease() && parent)
propagate(Update().setIncrease(increase));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateMinMaxAllocated still holds AllocationQueue::mutex when it propagates the new increase upward. That can deadlock the scheduler thread if the new current increase also needs eviction: e.g. one running allocation of 10, pending P1=30, pending P2=20, then max_memory is lowered to 25. This method rejects P1, makes P2 current, and calls propagate under the lock; AllocationLimit::setIncrease sees 10 + 20 > 25, calls selectAllocationToKill, and recurses back into AllocationQueue::selectAllocationToKill, which tries to take the same mutex.

Please collect whether the increase pointer changed while holding the lock, release the mutex, and only then propagate the update, or otherwise ensure victim selection cannot re-enter the same queue while its mutex is held.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

// Update increase pointer in case the removed allocation was the current one
if (setIncrease() && parent)
propagate(Update().setIncrease(increase));

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateQueueLimit can re-enter this same queue while AllocationQueue::mutex is still held. A concrete path is: lowering max_waiting_queries rejects the front pending allocation, setIncrease exposes the next pending/increasing request, then this propagate call reaches AllocationLimit::setIncrease; if the new request is still over max_memory, victim selection calls back into AllocationQueue::selectAllocationToKill, which tries to lock mutex again and deadlocks the scheduler thread.

Please follow the same pattern needed for updateMinMaxAllocated: decide whether the increase pointer changed while holding the lock, release mutex, and only then propagate the update upward.

@azat

azat commented Jun 23, 2026

Copy link
Copy Markdown
Member

It will not turn itself without some "admin-level" CREATE RESOURCE memory (MEMORY RESERVATION). And I dont think we need to mark query setting reserve_memory with experimental_ prefix because we have this switch. So it is at least safe to release. And yes, I consider it experimental. What is the best way to reflect it?

Probably indeed some server-side experimental settings, not sure does it make sense to add some basic just for workflows (non-experimental)

@serxa serxa enabled auto-merge June 23, 2026 22:15
@clickhouse-gh

clickhouse-gh Bot commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.50% -0.10%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 3491/3740 (93.34%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 9 line(s) · Uncovered code

Full report · Diff report

@serxa serxa added this pull request to the merge queue Jun 24, 2026
Merged via the queue into master with commit a4e0784 Jun 24, 2026
166 of 167 checks passed
@serxa serxa deleted the workload-memory-scheduling branch June 24, 2026 00:51
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 24, 2026
@Ergus Ergus mentioned this pull request Jun 24, 2026
alexey-milovidov added a commit that referenced this pull request Jun 24, 2026
Add changelog entries for PRs that landed on the 26.6 release line after
the first pass: #82414, #108042 (New Feature), #107428 (Improvement),
and #108128, #108288, #100205, #108029 (Bug Fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@robot-ch-test-poll robot-ch-test-poll added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Jun 27, 2026
robot-clickhouse-ci-2 added a commit that referenced this pull request Jun 27, 2026
Cherry pick #82414 to 26.6: Workload scheduling: Memory reservations
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Jun 27, 2026
clickhouse-gh Bot added a commit that referenced this pull request Jun 27, 2026
Backport #82414 to 26.6: Workload scheduling: Memory reservations
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-feature Pull request with new product feature pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v26.6-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants