Fix max_memory_usage_soft_limit hot reload by nauu · Pull Request #104940 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix max_memory_usage_soft_limit hot reload#104940

Merged
Ergus merged 3 commits into
ClickHouse:masterfrom
nauu:fix-keeper
May 20, 2026
Merged

Fix max_memory_usage_soft_limit hot reload#104940
Ergus merged 3 commits into
ClickHouse:masterfrom
nauu:fix-keeper

Conversation

@nauu

@nauu nauu commented May 14, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

When max_memory_usage_soft_limit is not set explicitly and is auto-calculated from max_memory_usage_soft_limit_ratio, initializeKeeperMemorySoftLimit called config.setUInt64() to cache the result into the writable MapConfiguration layer(PRIO_APPLICATION = -100).

On config reload triggered by CgroupsMemoryUsageObserver(e.g., when the cgroup memory limit increases), LayeredConfiguration::replace() only swaps the file layer (PRIO_DEFAULT = 0) but leaves MapConfiguration intact.

Because MapConfiguration has higher priority, the stale startup value shadows the new file config, and updateKeeperMemorySoftLimit reads the old value instead of recalculating.

Fix: extract calculateMemorySoftLimit and remove the config.setUInt64() call, so each reload recalculates from the current config and getMemoryAmount().

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed the max_memory_usage_soft_limit was not updated on-the-fly when the configuration changed. The soft memory limit was calculated once at startup and cached, so subsequent config reloads had no effect. After this fix, the limit is recalculated and applied immediately on every config reload, no Keeper restart required.

Version info

  • Merged into: 26.5.1.893

@nauu nauu added the can be tested Allows running workflows for external contributors label May 14, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 14, 2026

@Ergus Ergus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix as is seems correct (LGTM), but there are a couple or corner cases/paths not covered in tests:

  1. If we set max_memory_usage_soft_limit directly in the config then calculateMemorySoftLimit takes the hasProperty branch and there is no test for that:
    We need a test that:

  2. Sets an explicit value at startup and verifies it is respected

  3. Changes the explicit value on reload and verifies the new value is picked up

  4. If we change ratio at startup. After the PR's fix, this should work: the MapConfiguration is clean (as there is not call to setUInt64 now), so hasProperty("...soft_limit") is false at startup (correct ratio path), and true after reload with an explicit value (correct explicit path).

  5. When calculateMemorySoftLimit returns 0 (i.e ratio = 0). Callers of getKeeperMemorySoftLimit() must treat 0 as "no limit".

Comment thread src/Coordination/KeeperContext.cpp Outdated
@nauu

nauu commented May 14, 2026

Copy link
Copy Markdown
Member Author

The fix as is seems correct (LGTM), but there are a couple or corner cases/paths not covered in tests:

  1. If we set max_memory_usage_soft_limit directly in the config then calculateMemorySoftLimit takes the hasProperty branch and there is no test for that:
    We need a test that:
  2. Sets an explicit value at startup and verifies it is respected
  3. Changes the explicit value on reload and verifies the new value is picked up
  4. If we change ratio at startup. After the PR's fix, this should work: the MapConfiguration is clean (as there is not call to setUInt64 now), so hasProperty("...soft_limit") is false at startup (correct ratio path), and true after reload with an explicit value (correct explicit path).
    as "no limit".

I haven't change any other existing logic in this PR, however I've added more test cases

  1. When calculateMemorySoftLimit returns 0 (i.e ratio = 0). Callers of getKeeperMemorySoftLimit() must treat 0

I thought this is guaranteed by this part

bool KeeperServer::isExceedingMemorySoftLimit() const

@nauu

nauu commented May 15, 2026

Copy link
Copy Markdown
Member Author

The failed test 02346_text_index_queries has a known issue #104923

It is unrelated to this PR.

@Ergus

Ergus commented May 15, 2026

Copy link
Copy Markdown
Member

I haven't change any other existing logic in this PR, however I've added more test cases

I know, but as this part is a fragile defined behavior, the extra tests simplify detectin conflicting changes in apparently unrelated parts in the future.

  1. When calculateMemorySoftLimit returns 0 (i.e ratio = 0). Callers of getKeeperMemorySoftLimit() must treat 0

I thought this is guaranteed by this part

bool KeeperServer::isExceedingMemorySoftLimit() const

Indeed, but like in crypto: "Don't Trust, Verify" ;)

@nauu

nauu commented May 15, 2026

Copy link
Copy Markdown
Member Author

I haven't change any other existing logic in this PR, however I've added more test cases

I know, but as this part is a fragile defined behavior, the extra tests simplify detectin conflicting changes in apparently unrelated parts in the future.

  1. When calculateMemorySoftLimit returns 0 (i.e ratio = 0). Callers of getKeeperMemorySoftLimit() must treat 0

I thought this is guaranteed by this part

bool KeeperServer::isExceedingMemorySoftLimit() const

Indeed, but like in crypto: "Don't Trust, Verify" ;)

Hi @Ergus thanks your reply .

In the original logic, when ratio = 0, memory_soft_limit remains 0 and setUInt64 is never called . so the behavior is identical to what we have now.

The only caller of getKeeperMemorySoftLimit() is isExceedingMemorySoftLimit, which already guards with mem_soft_limit > 0.

Could you clarify what specific test case you have in mind here? I am happy to add it if there's a concrete scenario to cover.

@rschu1ze

Copy link
Copy Markdown
Member

@nauu Normal users won't be able to decipher the changelog entry Fixed Keeper max_memory_usage_soft_limit not hot updating after config reload. Please rewrite and expand.

@Ergus

Ergus commented May 19, 2026

Copy link
Copy Markdown
Member

@rschu1ze

I would rephrase the changelog more or less like this:

Fixed on-the-fly update of soft memory limit when the configuration changes by  recalculating and applying the value on every config reload without requiring keeper restart.

Comment thread src/Coordination/KeeperContext.cpp Outdated

LOG_INFO(log, "keeper_server.max_memory_usage_soft_limit is set to {}", formatReadableSizeWithBinarySuffix(memory_soft_limit));
void KeeperContext::initializeKeeperMemorySoftLimit(Poco::Util::AbstractConfiguration & config, Poco::Logger * log)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name of this function is misleading now because it is not changing the memory_soft_limit

Comment thread src/Coordination/KeeperContext.cpp Outdated

LOG_INFO(log, "keeper_server.max_memory_usage_soft_limit is set to {}", formatReadableSizeWithBinarySuffix(memory_soft_limit));
void KeeperContext::initializeKeeperMemorySoftLimit(Poco::Util::AbstractConfiguration & config, Poco::Logger * log)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I think that the config argument can be const now. Which requires updatign the comment in the header

const UInt64 expected_reload = static_cast<UInt64>(static_cast<double>(getMemoryAmount()) * 0.5);
ASSERT_EQ(ctx->getKeeperMemorySoftLimit(), expected_reload);
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to add a test for explicit value -> ratio-only config on reload.

Similar to ExplicitZeroFallsBackToRatio bit where max_memory_usage_soft_limit_ratio=0

Comment thread src/Coordination/KeeperContext.cpp Outdated
config.setUInt64("keeper_server.max_memory_usage_soft_limit", memory_soft_limit);
}
}
memory_soft_limit = calculateMemorySoftLimit(config);

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a log line in this function so the actual stored value is always observable.

Comment thread src/Coordination/KeeperContext.cpp Outdated
void KeeperContext::initializeKeeperMemorySoftLimit(Poco::Util::AbstractConfiguration & config, Poco::Logger * log)
{
const auto limit = calculateMemorySoftLimit(config);
LOG_INFO(log, "keeper_server.max_memory_usage_soft_limit is set to {}", formatReadableSizeWithBinarySuffix(limit));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this is the log line needed in updateKeeperMemorySoftLimit

@Ergus

Ergus commented May 19, 2026

Copy link
Copy Markdown
Member

The failing test is unrelated.

@rschu1ze rschu1ze May 19, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nauu Can you please write an integration test instead? A C++ unit test seems too low level for the job.

@@ -104,3 +108,61 @@ def test_soft_limit_create(started_cluster):
return

raise Exception("all records are inserted but no error occurs")


def test_soft_limit_hot_reload(started_cluster):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs a test-level comment for the next guy who will read this test in a couple of weeks from now, basically what is the test supposed to test.

Comment thread tests/integration/test_keeper_memory_soft_limit/test.py Outdated
@clickhouse-gh

clickhouse-gh Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.10% +0.10%
Functions 91.30% 91.40% +0.10%
Branches 76.30% 76.50% +0.20%

Changed lines: 56.00% (14/25) | lost baseline coverage: 4 line(s) · Uncovered code

Full report · Diff report

@Ergus Ergus left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I liked the unit tests, but I agree that with integration tests it is simpler to test.

@Ergus Ergus added this pull request to the merge queue May 20, 2026
Merged via the queue into ClickHouse:master with commit b9e7652 May 20, 2026
165 checks passed
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label May 20, 2026
DavidHe-2008 pushed a commit to DavidHe-2008/ClickHouse that referenced this pull request Jun 1, 2026
Fix max_memory_usage_soft_limit hot reload
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants