Support hot-reloading keeper server settings by al13n321 · Pull Request #100773 · ClickHouse/ClickHouse · GitHub
Skip to content

Support hot-reloading keeper server settings#100773

Merged
al13n321 merged 17 commits into
masterfrom
hot
Apr 11, 2026
Merged

Support hot-reloading keeper server settings#100773
al13n321 merged 17 commits into
masterfrom
hot

Conversation

@al13n321

@al13n321 al13n321 commented Mar 26, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

A few Keeper server settings are now hot-reloaded if config file is changed at runtime: max_requests_batch_size, max_requests_batch_bytes_size, max_request_size, quorum_reads.


I got tired of restarting the server all the time when benchmarking, so here's a way to tweak things on the fly.

To make a setting hot-reloadable, add HOT_RELOAD flag to its declaration in CoordinationSettings.cpp, and use keeper_context->getCoordinationSettings() to get the latest value.

This turned out much easier than expected. I always thought this would have to be slow or complicated (rcu! hazard pointers! thread-cached refcount!) and require changing all the call sites. But no, it's fine, you can just cache a shared_ptr in a thread-local variable to avoid all cacheline contention.

This PR only makes a couple easiest settings reloadable, just to make sure it works: max_requests_batch_size, max_requests_batch_bytes_size, quorum_reads.

Maybe we should do this to ServerSettings too, maybe replacing some of the laborious manual change propagation in the config reload callback.

Version info

  • Merged into: 26.4.1.826

@clickhouse-gh

clickhouse-gh Bot commented Mar 26, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Mar 26, 2026
Comment thread src/Coordination/KeeperContext.cpp Outdated
Comment thread src/Server/KeeperTCPHandler.cpp Outdated
Comment thread src/Coordination/CoordinationSettings.cpp Outdated
@al13n321 al13n321 requested a review from antonio2368 March 27, 2026 22:13
@antonio2368 antonio2368 self-assigned this Mar 30, 2026
Comment thread src/Coordination/KeeperContext.h Outdated
Comment thread src/Coordination/CoordinationSettings.cpp Outdated
uint64_t max_batch_bytes_size = coordination_settings[CoordinationSetting::max_requests_batch_bytes_size];
size_t max_batch_size = coordination_settings[CoordinationSetting::max_requests_batch_size];
const auto & dynamic_settings = keeper_context->getDynamicSettings();
uint64_t operation_timeout_ms = dynamic_settings[CoordinationSetting::operation_timeout_ms].totalMilliseconds();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but you didn't mark it as hot reloadable

@al13n321 al13n321 Apr 1, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operation_timeout_ms is passed into nuraft, so wouldn't be easy to hot reload.

It's ok to use getDynamicSettings() for non-reloadable settings. Unless you think it would be better to have a style convention to always do getCoordinationSettings() even if there's already a perfectly good instance of CoordinationSettings right there in a local variable. (I was thinking the opposite: use getDynamicSettings everywhere and remove getCoordinationSettings. But that would require checking all call sites to fix the ones that hold on to the returned reference for too long.)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah now I see what is going on. It is a bit confusing and people could easily mess up as you have access to all settings in both.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what do you propose? Convention to use getCoordinationSettings for non-reloadable settings? Split settings into two separate structs? Remove getCoordinationSettings and update all call sites to use getDynamicSettings? Merge as is? Abandon this PR? I'm ok with most of these options, unclear which is best.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would just have getCoordinationSettings, and have the logic for dynamic settings.
Using getDynamicSettings everywhere could be misleading and understood as all settings here are dynamic.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean remove getDynamicSettings or have a convention to not use it for non-reloadable settings?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I meant removing getDynamic

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed getCoordinationSettings to getFixedCoordinationSettings, and getDynamicSettings to getCoordinationSettings.

Comment thread tests/integration/test_keeper_dynamic_settings/test.py
@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

@alexey-milovidov

Copy link
Copy Markdown
Member

The test 02859_replicated_db_name_zookeeper is fixed in #101952

Comment thread tests/integration/test_keeper_dynamic_settings/test.py
@alexey-milovidov

Copy link
Copy Markdown
Member

The 02346_text_index_bug101913 test failure is fixed by #102108 (already merged). Please rebase or merge master to pick up the fix.

@antonio2368

Copy link
Copy Markdown
Member

Update changelog, this is a very good change.

Comment thread src/Coordination/KeeperContext.cpp
@al13n321 al13n321 enabled auto-merge April 9, 2026 19:35
@clickhouse-gh clickhouse-gh Bot added pr-improvement Pull request with some product improvements and removed pr-not-for-changelog This PR should not be mentioned in the changelog labels Apr 10, 2026
@antonio2368

Copy link
Copy Markdown
Member

Stress test (arm_msan) - tracking issue: #102241, should be fixed by #102305

Comment thread tests/integration/test_keeper_dynamic_settings/test.py
@clickhouse-gh

clickhouse-gh Bot commented Apr 10, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.20% +0.20%
Functions 90.90% 90.90% +0.00%
Branches 76.50% 76.70% +0.20%

Changed lines: 83.48% (283/339) | lost baseline coverage: 8 line(s) · Uncovered code

Full report · Diff report

@al13n321

Copy link
Copy Markdown
Member Author

@al13n321 al13n321 added this pull request to the merge queue Apr 11, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Apr 11, 2026
@al13n321 al13n321 added this pull request to the merge queue Apr 11, 2026
Merged via the queue into master with commit 52c89f4 Apr 11, 2026
157 of 164 checks passed
@al13n321 al13n321 deleted the hot branch April 11, 2026 02:02
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants