Keep max_bytes_before_external_join peak under the configured cap#103838
Conversation
The auto-spilling join could still hit `MEMORY_LIMIT_EXCEEDED` even when
`max_bytes_before_external_join` was set. With a stats-warm cache, e.g.
running the user's transcript:
SET max_memory_usage = '10G';
SET max_bytes_before_external_join = '5G';
SELECT count() FROM
(SELECT number AS id, randomString(100) AS s FROM numbers(200000000)) AS a
JOIN (SELECT number AS id FROM numbers(200000000)) AS b USING (id);
failed at peak `12.5 GiB` while attempting an `8 GiB` chunk allocation,
even though the threshold was supposed to cap it well below `10 GiB`.
Three independent issues conspired:
1. `ConcurrentHashJoin::reserveSpaceInHashMaps` preallocated capacity for
`hint->ht_size` rows (e.g. 200M from a previous run) up front, blowing
past the threshold before `SpillingHashJoin` could check it. Now we
cap the entries by `max_bytes_before_external_join / (8 * cell_size)`
so the preallocated buffer stays under half the threshold while still
keeping the optimization for queries that fit.
2. `SpillingHashJoin::addBlockToJoin` checked the threshold after the
inner `addBlockToJoin` returned. Hash tables grow in power-of-two
steps and a doubling from X to 2X transiently holds 3X, so the inner
call could OOM during the resize before the post-call check ever
ran. The check now runs proactively before the inner call, and uses
half the threshold so the next doubling cannot push past the cap.
3. `GraceHashJoin::hasMemoryOverflow` only spilled on
`max_rows_in_join` / `max_bytes_in_join`. When the wrapper handed
data over, the in-memory bucket grew unbounded and the wrapper's
spill decision became meaningless. It now also rehashes when
`total_bytes` reaches half of `max_bytes_before_external_join`,
gated by a 64 MiB minimum so unrealistically tiny test thresholds
don't blow `grace_hash_join_max_buckets`.
Additionally, after rehashing, the new in-memory join was created with
`reserve_num = prev_keys_num`, which allocated a power-of-two buffer
for the pre-rehash size - itself another multi-gigabyte chunk. Reserve
is now `prev_keys_num / buckets_snapshot.size()`, matching the rows
that actually stay in this bucket after the rehash.
With all four together the user's query completes (200M result, peak
`6.5 GiB`, well under the `10 GiB` cap). Existing
`03915_spilling_hash_join` and `04061_spilling_hash_join_overflow_limits`
still pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [0d8e9ca] Summary: ✅ AI ReviewSummaryThe PR fixes real memory-threshold enforcement gaps in the ClickHouse Rules
Final VerdictStatus: ✅ Approve |
alexey-milovidov
left a comment
There was a problem hiding this comment.
The code looks good to me.
`rehashBuckets` doubles the bucket count from `N` to `2N`, so the rows in the current bucket are split between bucket `i` and bucket `i + N` under the new modulus, with about half staying in this bucket. Reserve `prev_keys_num / 2` rather than `prev_keys_num / buckets_snapshot.size()`, which underestimated by a factor of `N` (e.g. with 128 buckets it reserved ~1/128 instead of ~1/2) and would force repeated re-grow/rehash cycles in the inner `HashJoin`, regressing the build path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `04123_spilling_hash_join_peak_under_cap`. The test runs an INNER JOIN whose right-side build payload (~140 MiB of random strings) is well above `max_bytes_before_external_join`, with a strict `max_memory_usage` that previously could have been exceeded by the combination of statistics-driven preallocation, the SpillingHashJoin threshold check running after the inner doubling, and the GraceHashJoin not yet enforcing `max_bytes_before_external_join`. The query must now complete on both the single-thread `hash` and the concurrent `parallel_hash` paths. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| return chosen_join->addBlockToJoin(block, check_limits); | ||
|
|
||
| return true; | ||
| return concurrent_join->addBlockToJoin(block, check_limits); |
There was a problem hiding this comment.
Moving the spill check to pre-insert avoids the resize-time OOM, but removing the post-insert check introduces a missed-switch case: if the last build block takes memory from below threshold / 2 to above it, SpillingHashJoin never switches because there is no next addBlockToJoin call.
Concrete trace: with max_bytes_before_external_join = 80 MiB, suppose getTotalByteCount() is 39 MiB before the final block. Pre-check passes (39 * 2 < 80). After inserting the final block, usage can be 60+ MiB (or even above 80 MiB), but we return immediately and onBuildPhaseFinish promotes the in-memory join without re-checking the threshold.
Please keep the pre-check and add a post-insert re-check (or an equivalent check in onBuildPhaseFinish) so terminal blocks cannot bypass spilling.
The earlier scale (1M rows × 140-byte strings, `max_memory_usage = 200Mi`, `max_bytes_before_external_join = 80Mi`) did not actually trip `MEMORY_LIMIT_EXCEEDED` on master: peak stayed around 175 MiB and the hash table never reached the spill threshold, so the test passed on the unfixed code as well as the fixed one. Bugfix validation flagged this as "Failed to reproduce the bug": https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103838&sha=1c320f5a08784eb2c657ac12282d0c94268f3f89&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29 PR: #103838 The new scale (3M integer keys, `max_memory_usage = 256Mi`, `max_bytes_before_external_join = 64Mi`) hits the actual doubling pathology: on master the build-side hash table doubles past the threshold and tries to allocate a `~515 MiB` chunk, exceeding `max_memory_usage`, while on the fixed code `SpillingHashJoin` switches to `GraceHashJoin` before the doubling and the peak stays around `~150 MiB`. Verified locally against the master binary at the merge base (`b7507c4b7b3`) and the PR binary for SHA `1c320f5`: master, hash, 3M: fails with `659.49 MiB` peak master, parallel_hash, 3M: fails with `663.66 MiB` peak fixed, hash, 3M: passes with `149.11 MiB` peak fixed, parallel_hash, 3M: passes with `156.12 MiB` peak Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The proactive pre-insert check in `SpillingHashJoin::addBlockToJoin` only fires on a *subsequent* call. If the very last build block crosses `max_bytes_before_external_join` (a single huge block, or a doubling that takes the table from below `threshold/2` to above `threshold`), the post-insert state is over the cap and `onBuildPhaseFinish` promotes the in-memory join as-is instead of switching to `GraceHashJoin`. Add a final check in `onBuildPhaseFinish`: if the build-phase total already exceeds the threshold, switch to `GraceHashJoin` instead of promoting `HashJoin` / `ConcurrentHashJoin`. This restores the "`max_bytes_before_external_join` decides whether to spill" contract for the terminal-block edge case. Raised in PR review: #103838 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| { | ||
| static constexpr size_t MIN_THRESHOLD_FOR_AUTOMATIC_SPILL = 64 * 1024 * 1024; | ||
| const auto external_join_threshold = table_join->maxBytesBeforeExternalJoin(); | ||
| if (external_join_threshold >= MIN_THRESHOLD_FOR_AUTOMATIC_SPILL && total_bytes * 2 >= external_join_threshold) |
There was a problem hiding this comment.
This introduces a hard cutoff at 64 MiB, so max_bytes_before_external_join is no longer enforced for any lower value.
Concrete case: with max_bytes_before_external_join = 32 MiB, SpillingHashJoin can switch to GraceHashJoin, but this check stays disabled (32 < 64), so the in-memory bucket can keep growing past 32 MiB until some other limit is hit. That breaks the setting contract and reopens the same memory-peak issue for low-but-valid thresholds.
Please remove the fixed 64 MiB gate and apply the threshold check for any positive max_bytes_before_external_join (or make the lower bound configurable as a setting if we really need one).
The previous 64 MiB hardcoded floor on the `max_bytes_before_external_join` check inside `GraceHashJoin::hasMemoryOverflow` silently disabled the cap for any lower value. A user setting `max_bytes_before_external_join = 32 MiB` would still see in-memory buckets grow well past the configured limit, breaking the setting contract. The floor exists because the rehash only halves a bucket; for total data `D` and threshold `T` we need ~`log2(D / T)` rehashes per bucket (capped by `grace_hash_join_max_buckets`). With pathologically small thresholds (e.g. 1 KiB in `03915_spilling_hash_join`) the per-row hash cell overhead (~32 bytes) alone exceeds `T/2` before any rehash can help, forcing one rehash per added block and quickly exhausting the bucket cap. 1 MiB is small enough to enforce the contract for any practical configuration (the AI-flagged 32 MiB case is now respected) while still leaving low test thresholds unaffected. Existing tests `03915_spilling_hash_join`, `04061_spilling_hash_join_overflow_limits`, and `04123_spilling_hash_join_peak_under_cap` continue to pass. Add `04146_spilling_hash_join_low_threshold` as a regression for a threshold (`16 MiB`) below the previous floor: with `max_memory_usage = 160 MiB`, the unfixed code peaked at 167 MiB during the doubling step, while the fixed code stays around 131 MiB. Raised in PR review: #103838 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`04146_spilling_hash_join_low_threshold` was failing with `MEMORY_LIMIT_EXCEEDED` during the build phase even after the `SpillingHashJoin` pre-insert check landed. The same problem the `SpillingHashJoin` fix solved for the wrapper still existed one level down: `GraceHashJoin::addBlockToJoinImpl` checked `hasMemoryOverflow` *after* `hash_join->addBlockToJoin` returned, so the in-memory hash table inside the bucket could double its buffer past the cap before the rehash decision ran. Move the check before the inner `addBlockToJoin` using projected `(rows, allocatedBytes)`. When the projection already triggers `hasMemoryOverflow`, skip the inner add and include the un-added block in the rescatter pass so the data is preserved and redistributed to the new bucket layout. CI: #103838
The CI runs with random per-test settings injection. The test caps memory at `max_memory_usage = 160Mi` to make the absence of the `64 MiB` floor in `GraceHashJoin::hasMemoryOverflow` visible: with the fix in place, the build side peaks at around `131 MiB`, leaving a `~29 MiB` margin. That margin is too narrow to absorb the memory overhead added by random settings such as `filesystem_prefetch_max_memory_usage 32Mi`, `temporary_files_buffer_size`, and friends, so the cap was being tripped during the build phase even with the fix applied. The test failed in CI: #103838 with `MEMORY_LIMIT_EXCEEDED` at peak `~160 MiB` across multiple shards. Add `no-random-settings, no-random-merge-tree-settings` so the regression remains deterministic. The setting tag is strictly necessary here because the very setting under test (`max_bytes_before_external_join`) is among those randomized, and the tight memory cap is intentional.
| /// will actually hold (which adds cell overhead and load-factor padding), but it is | ||
| /// enough for the pre-check to fire one rehash earlier than the post-check would. | ||
| const size_t projected_keys = prev_keys_num + current_block.rows(); | ||
| const size_t projected_bytes = pre_total_bytes + current_block.allocatedBytes(); |
There was a problem hiding this comment.
projected_bytes = pre_total_bytes + current_block.allocatedBytes() is only a lower bound and can miss the same resize-time OOM pattern this change is trying to prevent.
Concrete trace with max_bytes_before_external_join = 80 MiB:
- Before insert:
pre_total_bytes = 39 MiB(39 * 2 < 80), bucket is already nearmaxFill. - Incoming
current_blockis tiny (e.g. one row), soprojected_bytesstays< 40 MiB, and the pre-check does not rehash. hash_join->addBlockToJoincrossesmaxFill, triggersX -> 2Xgrowth, and can hit the transient~3Xpeak inside the allocator before the post-insert check runs.
So a tiny block can still bypass the pre-check and fail at allocation time. Please base the pre-check on current hash-table occupancy/capacity growth risk (for example, a getTotalByteCount() * 2 >= threshold style check, as done in SpillingHashJoin) instead of current_block.allocatedBytes().
The previous pre-check used `pre_total_bytes + current_block.allocatedBytes()` fed into `hasMemoryOverflow`, which has two problems: 1. `block.allocatedBytes()` is a misleading lower bound on the in-memory hash table cost: cell overhead, load-factor padding, and the resize-time peak are not represented. A tiny block can leave a near-full bucket, with the projected sum still under the cap, then `addBlockToJoin` triggers an X→2X buffer doubling and OOMs in the allocator before the post-add check runs. 2. `hasMemoryOverflow` also performs `softCheck` against `max_bytes_in_join`, so the projected check fired against very small soft limits used in tests (e.g. `max_bytes_in_join = '10K'` in `02274_full_sort_join_nodistinct`). Combined with `block.allocatedBytes()` rounding, the pre-check rehashed on every block and quickly hit `grace_hash_join_max_buckets`. The pre-check now mirrors `SpillingHashJoin::addBlockToJoin`: rehash when the existing bucket alone is already past half of `max_bytes_before_external_join`, gated by the same 1 MiB floor (`MIN_THRESHOLD_FOR_AUTOMATIC_SPILL`, lifted to file scope so both call sites share it). Soft limits stay enforced by the post-insert call to `hasMemoryOverflow`, so behaviour is unchanged when `max_bytes_before_external_join` is unset. CI report (failure on `02274_full_sort_join_nodistinct`): https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103838&sha=0025061ddbeaca0d44371f337595d27e86064c17&name_0=PR&name_1=Stateless%20tests%20%28amd_debug%2C%20distributed%20plan%2C%20s3%20storage%2C%20parallel%29 PR: #103838 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| /// buffer while the old one is still alive, transiently using 3X memory. We must trigger the | ||
| /// switch BEFORE the inner `addBlockToJoin` runs (and possibly doubles the buffer); a check | ||
| /// that runs after the call would race with the doubling and observe the OOM only as an | ||
| /// allocator exception. Threshold is half of `max_bytes_before_external_join` so that after |
There was a problem hiding this comment.
This proactive pre-insert threshold check changes the semantics from "we already exceeded memory" to "we are approaching the spill threshold".
But switchToGraceHashJoin still logs Memory limit exceeded ... switching to GraceHashJoin.
That message is now misleading in normal (non-exceptional) switching paths and can send incident triage in the wrong direction. Please update the log text to something like Memory spill threshold reached.
| /// any rehash can help, forcing one rehash per added block and quickly hitting the bucket | ||
| /// cap. 1 MiB enforces the contract for any practical configuration while still leaving low | ||
| /// test thresholds (e.g. 1 KiB in `03915_spilling_hash_join`) unaffected. | ||
| static constexpr size_t MIN_THRESHOLD_FOR_AUTOMATIC_SPILL = 1024 * 1024; |
There was a problem hiding this comment.
I don't like the fact that we are adding this artificial limit just to keep the tests working. Basically the tests now test a different behavior then in real world use cases. I would much rather prefer changing those tests.
| /// `SpillingHashJoin` ever runs its threshold check. We still want preallocation - just bounded by | ||
| /// the memory budget. We aim for about half of the threshold so that the cap itself plus the live | ||
| /// data plus the conversion peak still fit under it. | ||
| const auto external_join_threshold = hash_join.getTableJoin().maxBytesBeforeExternalJoin(); |
There was a problem hiding this comment.
I think it would be better if we would separate max_bytes_before_external_join and maximum size that concurrent hash join can reserve. E.g.: join_algorithm = 'parallel_hash', max_bytes_before_external_join = 10MB will cause concurrent hash performance to plummet. Let's make max reserve size as a parameter of ConcurrentHash join.
| /// on `MIN_THRESHOLD_FOR_AUTOMATIC_SPILL` for why a small floor is required. | ||
| if (!has_overflow) | ||
| { | ||
| const auto external_join_threshold = table_join->maxBytesBeforeExternalJoin(); |
There was a problem hiding this comment.
Same here as for concurrent hash join: join_algorithm = 'grace_hash', max_bytes_before_external_join = 10MB affects grace hash join when it shouldn't.
The pre-insert threshold check in `addBlockToJoin` switches before any memory limit is actually exceeded, so the previous "Memory limit exceeded ... switching to GraceHashJoin" log line was misleading for normal threshold-triggered switches and could mislead incident triage. Reword to "Memory spill threshold reached ..." to reflect the new semantics. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
| /// rehashing buckets early prevents that doubling from exceeding the cap. See the comment | ||
| /// on `MIN_THRESHOLD_FOR_AUTOMATIC_SPILL` for why a small floor is required. | ||
| if (!has_overflow) | ||
| { |
There was a problem hiding this comment.
max_bytes_before_external_join is documented and serialized as applying to hash/parallel_hash auto-conversion paths, not to explicit join_algorithm = 'grace_hash' (src/Core/Settings.cpp and src/Processors/QueryPlan/QueryPlanSerializationSettings.cpp).
This new check runs inside GraceHashJoin unconditionally, so users who intentionally picked grace_hash now get different spill behavior whenever max_bytes_before_external_join is set (including inherited profile settings). That is a behavior-contract break and can cause unexpected extra spilling/perf regressions in existing deployments.
Please gate this branch to the wrapper path only (e.g. only when GraceHashJoin is used as a spill target from SpillingHashJoin) instead of applying it to standalone grace_hash.
`max_bytes_before_external_join` was being read directly inside `ConcurrentHashJoin::reserveSpaceInHashMaps` and `GraceHashJoin::hasMemoryOverflow`, which meant the new cap also fired for users who explicitly picked `join_algorithm = 'parallel_hash'` or `'grace_hash'` standalone. The setting is documented as governing the auto-conversion path (`join_algorithm = 'auto'` or `'hash'` / `'parallel_hash'` with the threshold set), so applying it inside those joins unconditionally was a behavior-contract break - a `parallel_hash` build with a low `max_bytes_before_external_join` would lose its statistics- driven preallocation, and a standalone `grace_hash` would change spill timing. Pass the threshold explicitly via a new constructor parameter `external_join_threshold_` on both joins, set only when `SpillingHashJoin` constructs them. When 0 (standalone use), the new caps are skipped and the prior behavior is preserved. Also remove `MIN_THRESHOLD_FOR_AUTOMATIC_SPILL`. The 1 MiB floor existed solely so that the unconditional check would not break tests with very low thresholds; now that the check is wrapper-only, the floor is no longer needed. Update `03915_spilling_hash_join` to use a realistic 100 KiB threshold for the single-thread spill block (the old `1000`-byte value is below any reasonable production setting and kept the suite passing only because of the floor); the spilled queries now also produce delayed blocks, matching the concurrent-path tests.
LLVM Coverage ReportChanged lines: 98.71% (153/155) | lost baseline coverage: 1 line(s) · Uncovered code |
Cherry pick #103838 to 26.4: Keep `max_bytes_before_external_join` peak under the configured cap
…under the configured cap
Backport #103838 to 26.4: Keep `max_bytes_before_external_join` peak under the configured cap

The auto-spilling join (#97813) could still hit
MEMORY_LIMIT_EXCEEDEDeven whenmax_bytes_before_external_joinwas set. With a stats-warm cache, the user's transcriptfailed at peak
12.5 GiBwhile attempting an8 GiBchunk allocation, even though the threshold was supposed to cap memory well below10 GiB.Three independent issues conspired:
ConcurrentHashJoin::reserveSpaceInHashMapspreallocated capacity forhint->ht_sizerows (e.g. 200M from a previous run) up front, blowing past the threshold beforeSpillingHashJoincould check it. Now we cap entries bymax_bytes_before_external_join / (8 * cell_size)so the preallocated buffer stays under half the threshold while still keeping the optimization for queries that fit.SpillingHashJoin::addBlockToJoinchecked the threshold after the inneraddBlockToJoinreturned. Hash tables grow in power-of-two steps and a doubling from X to 2X transiently holds 3X, so the inner call could OOM during the resize before the post-call check ran. The check now runs proactively before the inner call, and uses half the threshold so the next doubling cannot push past the cap.GraceHashJoin::hasMemoryOverflowonly spilled onmax_rows_in_join/max_bytes_in_join. When the wrapper handed data over, the in-memory bucket grew unbounded and the wrapper's spill decision became meaningless. It now also rehashes whentotal_bytesreaches half ofmax_bytes_before_external_join, gated by a 64 MiB minimum so unrealistically tiny test thresholds don't blowgrace_hash_join_max_buckets.Additionally, after rehashing, the new in-memory join was created with
reserve_num = prev_keys_num, which itself allocated a power-of-two buffer for the pre-rehash size — another multi-gigabyte chunk. Reserve is nowprev_keys_num / buckets_snapshot.size(), matching the rows that actually stay in this bucket.The user's transcript: https://pastila.nl/?000b64fe/c41bd280359481e9fdf4c2b285b31943
Verification (peak observed via
clickhouse-client --memory-usage=default):max_memory_usage = 10Gonlymax_memory_usage = 10G+max_bytes_before_external_join = 5GExisting
03915_spilling_hash_joinand04061_spilling_hash_join_overflow_limitsstill pass against the modified server.Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Keep the auto-spilling hash join's actual memory peak under
max_bytes_before_external_join. Previously, statistics-driven preallocation, in-place hash table doubling, and unboundedGraceHashJoinin-memory buckets could each push the query past the configured cap and tripMEMORY_LIMIT_EXCEEDED.Documentation entry for user-facing changes
Version info
26.5.1.27426.4.1.1136