Keep `max_bytes_before_external_join` peak under the configured cap by alexey-milovidov · Pull Request #103838 · ClickHouse/ClickHouse · GitHub
Skip to content

Keep max_bytes_before_external_join peak under the configured cap#103838

Merged
antaljanosbenjamin merged 16 commits into
masterfrom
fix-spilling-join-peak
May 5, 2026
Merged

Keep max_bytes_before_external_join peak under the configured cap#103838
antaljanosbenjamin merged 16 commits into
masterfrom
fix-spilling-join-peak

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented May 1, 2026

Copy link
Copy Markdown
Member

The auto-spilling join (#97813) could still hit MEMORY_LIMIT_EXCEEDED even when max_bytes_before_external_join was set. With a stats-warm cache, the user's transcript

SET max_memory_usage = '10G';
SET max_bytes_before_external_join = '5G';
SELECT count() FROM
  (SELECT number AS id, randomString(100) AS s FROM numbers(200000000)) AS a
  JOIN (SELECT number AS id FROM numbers(200000000)) AS b USING (id);

failed at peak 12.5 GiB while attempting an 8 GiB chunk allocation, even though the threshold was supposed to cap memory well below 10 GiB.

Three independent issues conspired:

  1. ConcurrentHashJoin::reserveSpaceInHashMaps preallocated capacity for hint->ht_size rows (e.g. 200M from a previous run) up front, blowing past the threshold before SpillingHashJoin could check it. Now we cap entries by max_bytes_before_external_join / (8 * cell_size) so the preallocated buffer stays under half the threshold while still keeping the optimization for queries that fit.
  2. SpillingHashJoin::addBlockToJoin checked the threshold after the inner addBlockToJoin returned. Hash tables grow in power-of-two steps and a doubling from X to 2X transiently holds 3X, so the inner call could OOM during the resize before the post-call check ran. The check now runs proactively before the inner call, and uses half the threshold so the next doubling cannot push past the cap.
  3. GraceHashJoin::hasMemoryOverflow only spilled on max_rows_in_join/max_bytes_in_join. When the wrapper handed data over, the in-memory bucket grew unbounded and the wrapper's spill decision became meaningless. It now also rehashes when total_bytes reaches half of max_bytes_before_external_join, gated by a 64 MiB minimum so unrealistically tiny test thresholds don't blow grace_hash_join_max_buckets.

Additionally, after rehashing, the new in-memory join was created with reserve_num = prev_keys_num, which itself allocated a power-of-two buffer for the pre-rehash size — another multi-gigabyte chunk. Reserve is now prev_keys_num / buckets_snapshot.size(), matching the rows that actually stay in this bucket.

The user's transcript: https://pastila.nl/?000b64fe/c41bd280359481e9fdf4c2b285b31943

Verification (peak observed via clickhouse-client --memory-usage=default):

scenario result peak
baseline (no limits) 200M rows, 4.2s 17.4 GiB
max_memory_usage = 10G only fails (matches transcript) 9.31 GiB
max_memory_usage = 10G + max_bytes_before_external_join = 5G succeeds, 28.2s 6.5 GiB

Existing 03915_spilling_hash_join and 04061_spilling_hash_join_overflow_limits still pass against the modified server.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Keep the auto-spilling hash join's actual memory peak under max_bytes_before_external_join. Previously, statistics-driven preallocation, in-place hash table doubling, and unbounded GraceHashJoin in-memory buckets could each push the query past the configured cap and trip MEMORY_LIMIT_EXCEEDED.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.274
  • Backported to: 26.4.1.1136

The auto-spilling join could still hit `MEMORY_LIMIT_EXCEEDED` even when
`max_bytes_before_external_join` was set. With a stats-warm cache, e.g.
running the user's transcript:

    SET max_memory_usage = '10G';
    SET max_bytes_before_external_join = '5G';
    SELECT count() FROM
      (SELECT number AS id, randomString(100) AS s FROM numbers(200000000)) AS a
      JOIN (SELECT number AS id FROM numbers(200000000)) AS b USING (id);

failed at peak `12.5 GiB` while attempting an `8 GiB` chunk allocation,
even though the threshold was supposed to cap it well below `10 GiB`.

Three independent issues conspired:

1. `ConcurrentHashJoin::reserveSpaceInHashMaps` preallocated capacity for
   `hint->ht_size` rows (e.g. 200M from a previous run) up front, blowing
   past the threshold before `SpillingHashJoin` could check it. Now we
   cap the entries by `max_bytes_before_external_join / (8 * cell_size)`
   so the preallocated buffer stays under half the threshold while still
   keeping the optimization for queries that fit.

2. `SpillingHashJoin::addBlockToJoin` checked the threshold after the
   inner `addBlockToJoin` returned. Hash tables grow in power-of-two
   steps and a doubling from X to 2X transiently holds 3X, so the inner
   call could OOM during the resize before the post-call check ever
   ran. The check now runs proactively before the inner call, and uses
   half the threshold so the next doubling cannot push past the cap.

3. `GraceHashJoin::hasMemoryOverflow` only spilled on
   `max_rows_in_join` / `max_bytes_in_join`. When the wrapper handed
   data over, the in-memory bucket grew unbounded and the wrapper's
   spill decision became meaningless. It now also rehashes when
   `total_bytes` reaches half of `max_bytes_before_external_join`,
   gated by a 64 MiB minimum so unrealistically tiny test thresholds
   don't blow `grace_hash_join_max_buckets`.

Additionally, after rehashing, the new in-memory join was created with
`reserve_num = prev_keys_num`, which allocated a power-of-two buffer
for the pre-rehash size - itself another multi-gigabyte chunk. Reserve
is now `prev_keys_num / buckets_snapshot.size()`, matching the rows
that actually stay in this bucket after the rehash.

With all four together the user's query completes (200M result, peak
`6.5 GiB`, well under the `10 GiB` cap). Existing
`03915_spilling_hash_join` and `04061_spilling_hash_join_overflow_limits`
still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 1, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [0d8e9ca]

Summary:


AI Review

Summary

The PR fixes real memory-threshold enforcement gaps in the SpillingHashJoinGraceHashJoin path (preallocation cap, proactive spill checks, terminal-block handling, and wrapper-only threshold wiring), and adds targeted regressions. I did not find additional blocker/major issues beyond already-raised inline review points.

ClickHouse Rules
Item Status Notes
Deletion logging
Serialization versioning
Core-area scrutiny
No test removal
Experimental gate
No magic constants
Backward compatibility
SettingsChangesHistory.cpp
PR metadata quality
Safe rollout
Compilation time
No large/binary files
Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 1, 2026

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me.

Comment thread src/Interpreters/GraceHashJoin.cpp Outdated
alexey-milovidov and others added 3 commits May 1, 2026 18:59
`rehashBuckets` doubles the bucket count from `N` to `2N`, so the rows
in the current bucket are split between bucket `i` and bucket `i + N`
under the new modulus, with about half staying in this bucket. Reserve
`prev_keys_num / 2` rather than `prev_keys_num / buckets_snapshot.size()`,
which underestimated by a factor of `N` (e.g. with 128 buckets it
reserved ~1/128 instead of ~1/2) and would force repeated re-grow/rehash
cycles in the inner `HashJoin`, regressing the build path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `04123_spilling_hash_join_peak_under_cap`. The test runs an
INNER JOIN whose right-side build payload (~140 MiB of random strings)
is well above `max_bytes_before_external_join`, with a strict
`max_memory_usage` that previously could have been exceeded by the
combination of statistics-driven preallocation, the SpillingHashJoin
threshold check running after the inner doubling, and the GraceHashJoin
not yet enforcing `max_bytes_before_external_join`. The query must now
complete on both the single-thread `hash` and the concurrent
`parallel_hash` paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
return chosen_join->addBlockToJoin(block, check_limits);

return true;
return concurrent_join->addBlockToJoin(block, check_limits);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving the spill check to pre-insert avoids the resize-time OOM, but removing the post-insert check introduces a missed-switch case: if the last build block takes memory from below threshold / 2 to above it, SpillingHashJoin never switches because there is no next addBlockToJoin call.

Concrete trace: with max_bytes_before_external_join = 80 MiB, suppose getTotalByteCount() is 39 MiB before the final block. Pre-check passes (39 * 2 < 80). After inserting the final block, usage can be 60+ MiB (or even above 80 MiB), but we return immediately and onBuildPhaseFinish promotes the in-memory join without re-checking the threshold.

Please keep the pre-check and add a post-insert re-check (or an equivalent check in onBuildPhaseFinish) so terminal blocks cannot bypass spilling.

alexey-milovidov and others added 2 commits May 2, 2026 05:33
The earlier scale (1M rows × 140-byte strings, `max_memory_usage = 200Mi`,
`max_bytes_before_external_join = 80Mi`) did not actually trip
`MEMORY_LIMIT_EXCEEDED` on master: peak stayed around 175 MiB and the
hash table never reached the spill threshold, so the test passed
on the unfixed code as well as the fixed one.

Bugfix validation flagged this as "Failed to reproduce the bug":
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103838&sha=1c320f5a08784eb2c657ac12282d0c94268f3f89&name_0=PR&name_1=Bugfix%20validation%20%28functional%20tests%29

PR: #103838

The new scale (3M integer keys, `max_memory_usage = 256Mi`,
`max_bytes_before_external_join = 64Mi`) hits the actual doubling pathology:
on master the build-side hash table doubles past the threshold and tries to
allocate a `~515 MiB` chunk, exceeding `max_memory_usage`, while on the
fixed code `SpillingHashJoin` switches to `GraceHashJoin` before the
doubling and the peak stays around `~150 MiB`.

Verified locally against the master binary at the merge base
(`b7507c4b7b3`) and the PR binary for SHA `1c320f5`:

  master, hash, 3M:           fails with `659.49 MiB` peak
  master, parallel_hash, 3M:  fails with `663.66 MiB` peak
  fixed,  hash, 3M:           passes with `149.11 MiB` peak
  fixed,  parallel_hash, 3M:  passes with `156.12 MiB` peak

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The proactive pre-insert check in `SpillingHashJoin::addBlockToJoin` only
fires on a *subsequent* call. If the very last build block crosses
`max_bytes_before_external_join` (a single huge block, or a doubling
that takes the table from below `threshold/2` to above `threshold`),
the post-insert state is over the cap and `onBuildPhaseFinish` promotes
the in-memory join as-is instead of switching to `GraceHashJoin`.

Add a final check in `onBuildPhaseFinish`: if the build-phase total
already exceeds the threshold, switch to `GraceHashJoin` instead of
promoting `HashJoin` / `ConcurrentHashJoin`. This restores the
"`max_bytes_before_external_join` decides whether to spill" contract
for the terminal-block edge case.

Raised in PR review:
#103838

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Interpreters/GraceHashJoin.cpp Outdated
{
static constexpr size_t MIN_THRESHOLD_FOR_AUTOMATIC_SPILL = 64 * 1024 * 1024;
const auto external_join_threshold = table_join->maxBytesBeforeExternalJoin();
if (external_join_threshold >= MIN_THRESHOLD_FOR_AUTOMATIC_SPILL && total_bytes * 2 >= external_join_threshold)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This introduces a hard cutoff at 64 MiB, so max_bytes_before_external_join is no longer enforced for any lower value.

Concrete case: with max_bytes_before_external_join = 32 MiB, SpillingHashJoin can switch to GraceHashJoin, but this check stays disabled (32 < 64), so the in-memory bucket can keep growing past 32 MiB until some other limit is hit. That breaks the setting contract and reopens the same memory-peak issue for low-but-valid thresholds.

Please remove the fixed 64 MiB gate and apply the threshold check for any positive max_bytes_before_external_join (or make the lower bound configurable as a setting if we really need one).

alexey-milovidov and others added 6 commits May 2, 2026 20:54
The previous 64 MiB hardcoded floor on the
`max_bytes_before_external_join` check inside
`GraceHashJoin::hasMemoryOverflow` silently disabled the cap for any
lower value. A user setting `max_bytes_before_external_join = 32 MiB`
would still see in-memory buckets grow well past the configured limit,
breaking the setting contract.

The floor exists because the rehash only halves a bucket; for total data
`D` and threshold `T` we need ~`log2(D / T)` rehashes per bucket
(capped by `grace_hash_join_max_buckets`). With pathologically small
thresholds (e.g. 1 KiB in `03915_spilling_hash_join`) the per-row hash
cell overhead (~32 bytes) alone exceeds `T/2` before any rehash can
help, forcing one rehash per added block and quickly exhausting the
bucket cap.

1 MiB is small enough to enforce the contract for any practical
configuration (the AI-flagged 32 MiB case is now respected) while still
leaving low test thresholds unaffected. Existing tests
`03915_spilling_hash_join`, `04061_spilling_hash_join_overflow_limits`,
and `04123_spilling_hash_join_peak_under_cap` continue to pass.

Add `04146_spilling_hash_join_low_threshold` as a regression for a
threshold (`16 MiB`) below the previous floor: with
`max_memory_usage = 160 MiB`, the unfixed code peaked at 167 MiB during
the doubling step, while the fixed code stays around 131 MiB.

Raised in PR review:
#103838

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`04146_spilling_hash_join_low_threshold` was failing with `MEMORY_LIMIT_EXCEEDED`
during the build phase even after the `SpillingHashJoin` pre-insert check landed.
The same problem the `SpillingHashJoin` fix solved for the wrapper still existed
one level down: `GraceHashJoin::addBlockToJoinImpl` checked `hasMemoryOverflow`
*after* `hash_join->addBlockToJoin` returned, so the in-memory hash table inside
the bucket could double its buffer past the cap before the rehash decision ran.

Move the check before the inner `addBlockToJoin` using projected
`(rows, allocatedBytes)`. When the projection already triggers `hasMemoryOverflow`,
skip the inner add and include the un-added block in the rescatter pass so the
data is preserved and redistributed to the new bucket layout.

CI: #103838
The CI runs with random per-test settings injection. The test caps
memory at `max_memory_usage = 160Mi` to make the absence of the
`64 MiB` floor in `GraceHashJoin::hasMemoryOverflow` visible: with the
fix in place, the build side peaks at around `131 MiB`, leaving a
`~29 MiB` margin. That margin is too narrow to absorb the memory
overhead added by random settings such as
`filesystem_prefetch_max_memory_usage 32Mi`,
`temporary_files_buffer_size`, and friends, so the cap was being
tripped during the build phase even with the fix applied.

The test failed in CI: #103838
with `MEMORY_LIMIT_EXCEEDED` at peak `~160 MiB` across multiple shards.

Add `no-random-settings, no-random-merge-tree-settings` so the regression
remains deterministic. The setting tag is strictly necessary here because
the very setting under test (`max_bytes_before_external_join`) is among
those randomized, and the tight memory cap is intentional.
Comment thread src/Interpreters/GraceHashJoin.cpp Outdated
/// will actually hold (which adds cell overhead and load-factor padding), but it is
/// enough for the pre-check to fire one rehash earlier than the post-check would.
const size_t projected_keys = prev_keys_num + current_block.rows();
const size_t projected_bytes = pre_total_bytes + current_block.allocatedBytes();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

projected_bytes = pre_total_bytes + current_block.allocatedBytes() is only a lower bound and can miss the same resize-time OOM pattern this change is trying to prevent.

Concrete trace with max_bytes_before_external_join = 80 MiB:

  1. Before insert: pre_total_bytes = 39 MiB (39 * 2 < 80), bucket is already near maxFill.
  2. Incoming current_block is tiny (e.g. one row), so projected_bytes stays < 40 MiB, and the pre-check does not rehash.
  3. hash_join->addBlockToJoin crosses maxFill, triggers X -> 2X growth, and can hit the transient ~3X peak inside the allocator before the post-insert check runs.

So a tiny block can still bypass the pre-check and fail at allocation time. Please base the pre-check on current hash-table occupancy/capacity growth risk (for example, a getTotalByteCount() * 2 >= threshold style check, as done in SpillingHashJoin) instead of current_block.allocatedBytes().

The previous pre-check used `pre_total_bytes + current_block.allocatedBytes()`
fed into `hasMemoryOverflow`, which has two problems:

1. `block.allocatedBytes()` is a misleading lower bound on the in-memory hash
   table cost: cell overhead, load-factor padding, and the resize-time peak
   are not represented. A tiny block can leave a near-full bucket, with the
   projected sum still under the cap, then `addBlockToJoin` triggers an X→2X
   buffer doubling and OOMs in the allocator before the post-add check runs.
2. `hasMemoryOverflow` also performs `softCheck` against `max_bytes_in_join`,
   so the projected check fired against very small soft limits used in tests
   (e.g. `max_bytes_in_join = '10K'` in `02274_full_sort_join_nodistinct`).
   Combined with `block.allocatedBytes()` rounding, the pre-check rehashed
   on every block and quickly hit `grace_hash_join_max_buckets`.

The pre-check now mirrors `SpillingHashJoin::addBlockToJoin`: rehash when
the existing bucket alone is already past half of `max_bytes_before_external_join`,
gated by the same 1 MiB floor (`MIN_THRESHOLD_FOR_AUTOMATIC_SPILL`, lifted to
file scope so both call sites share it). Soft limits stay enforced by the
post-insert call to `hasMemoryOverflow`, so behaviour is unchanged when
`max_bytes_before_external_join` is unset.

CI report (failure on `02274_full_sort_join_nodistinct`):
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=103838&sha=0025061ddbeaca0d44371f337595d27e86064c17&name_0=PR&name_1=Stateless%20tests%20%28amd_debug%2C%20distributed%20plan%2C%20s3%20storage%2C%20parallel%29
PR: #103838

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
/// buffer while the old one is still alive, transiently using 3X memory. We must trigger the
/// switch BEFORE the inner `addBlockToJoin` runs (and possibly doubles the buffer); a check
/// that runs after the call would race with the doubling and observe the OOM only as an
/// allocator exception. Threshold is half of `max_bytes_before_external_join` so that after

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proactive pre-insert threshold check changes the semantics from "we already exceeded memory" to "we are approaching the spill threshold".

But switchToGraceHashJoin still logs Memory limit exceeded ... switching to GraceHashJoin.

That message is now misleading in normal (non-exceptional) switching paths and can send incident triage in the wrong direction. Please update the log text to something like Memory spill threshold reached.

@antaljanosbenjamin antaljanosbenjamin self-assigned this May 4, 2026
Comment thread src/Interpreters/GraceHashJoin.cpp Outdated
/// any rehash can help, forcing one rehash per added block and quickly hitting the bucket
/// cap. 1 MiB enforces the contract for any practical configuration while still leaving low
/// test thresholds (e.g. 1 KiB in `03915_spilling_hash_join`) unaffected.
static constexpr size_t MIN_THRESHOLD_FOR_AUTOMATIC_SPILL = 1024 * 1024;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the fact that we are adding this artificial limit just to keep the tests working. Basically the tests now test a different behavior then in real world use cases. I would much rather prefer changing those tests.

Comment thread src/Interpreters/ConcurrentHashJoin.cpp Outdated
/// `SpillingHashJoin` ever runs its threshold check. We still want preallocation - just bounded by
/// the memory budget. We aim for about half of the threshold so that the cap itself plus the live
/// data plus the conversion peak still fit under it.
const auto external_join_threshold = hash_join.getTableJoin().maxBytesBeforeExternalJoin();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better if we would separate max_bytes_before_external_join and maximum size that concurrent hash join can reserve. E.g.: join_algorithm = 'parallel_hash', max_bytes_before_external_join = 10MB will cause concurrent hash performance to plummet. Let's make max reserve size as a parameter of ConcurrentHash join.

Comment thread src/Interpreters/GraceHashJoin.cpp Outdated
/// on `MIN_THRESHOLD_FOR_AUTOMATIC_SPILL` for why a small floor is required.
if (!has_overflow)
{
const auto external_join_threshold = table_join->maxBytesBeforeExternalJoin();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here as for concurrent hash join: join_algorithm = 'grace_hash', max_bytes_before_external_join = 10MB affects grace hash join when it shouldn't.

The pre-insert threshold check in `addBlockToJoin` switches before any
memory limit is actually exceeded, so the previous "Memory limit exceeded
... switching to GraceHashJoin" log line was misleading for normal
threshold-triggered switches and could mislead incident triage. Reword to
"Memory spill threshold reached ..." to reflect the new semantics.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Interpreters/GraceHashJoin.cpp Outdated
/// rehashing buckets early prevents that doubling from exceeding the cap. See the comment
/// on `MIN_THRESHOLD_FOR_AUTOMATIC_SPILL` for why a small floor is required.
if (!has_overflow)
{

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max_bytes_before_external_join is documented and serialized as applying to hash/parallel_hash auto-conversion paths, not to explicit join_algorithm = 'grace_hash' (src/Core/Settings.cpp and src/Processors/QueryPlan/QueryPlanSerializationSettings.cpp).

This new check runs inside GraceHashJoin unconditionally, so users who intentionally picked grace_hash now get different spill behavior whenever max_bytes_before_external_join is set (including inherited profile settings). That is a behavior-contract break and can cause unexpected extra spilling/perf regressions in existing deployments.

Please gate this branch to the wrapper path only (e.g. only when GraceHashJoin is used as a spill target from SpillingHashJoin) instead of applying it to standalone grace_hash.

`max_bytes_before_external_join` was being read directly inside
`ConcurrentHashJoin::reserveSpaceInHashMaps` and `GraceHashJoin::hasMemoryOverflow`,
which meant the new cap also fired for users who explicitly picked
`join_algorithm = 'parallel_hash'` or `'grace_hash'` standalone. The setting is
documented as governing the auto-conversion path (`join_algorithm = 'auto'` or
`'hash'` / `'parallel_hash'` with the threshold set), so applying it inside
those joins unconditionally was a behavior-contract break - a `parallel_hash`
build with a low `max_bytes_before_external_join` would lose its statistics-
driven preallocation, and a standalone `grace_hash` would change spill timing.

Pass the threshold explicitly via a new constructor parameter
`external_join_threshold_` on both joins, set only when `SpillingHashJoin`
constructs them. When 0 (standalone use), the new caps are skipped and the
prior behavior is preserved.

Also remove `MIN_THRESHOLD_FOR_AUTOMATIC_SPILL`. The 1 MiB floor existed
solely so that the unconditional check would not break tests with very low
thresholds; now that the check is wrapper-only, the floor is no longer needed.
Update `03915_spilling_hash_join` to use a realistic 100 KiB threshold for
the single-thread spill block (the old `1000`-byte value is below any
reasonable production setting and kept the suite passing only because of the
floor); the spilled queries now also produce delayed blocks, matching the
concurrent-path tests.
@clickhouse-gh

clickhouse-gh Bot commented May 5, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.00% -0.10%
Functions 91.10% 91.10% +0.00%
Branches 76.60% 76.60% +0.00%

Changed lines: 98.71% (153/155) | lost baseline coverage: 1 line(s) · Uncovered code

Full report · Diff report

@antaljanosbenjamin antaljanosbenjamin added this pull request to the merge queue May 5, 2026
Merged via the queue into master with commit 9a0a52e May 5, 2026
165 checks passed
@antaljanosbenjamin antaljanosbenjamin deleted the fix-spilling-join-peak branch May 5, 2026 09:14
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label May 5, 2026
robot-ch-test-poll added a commit that referenced this pull request May 5, 2026
Cherry pick #103838 to 26.4: Keep `max_bytes_before_external_join` peak under the configured cap
robot-clickhouse added a commit that referenced this pull request May 5, 2026
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label May 5, 2026
@robot-clickhouse robot-clickhouse added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label May 5, 2026
alexey-milovidov added a commit that referenced this pull request May 5, 2026
Backport #103838 to 26.4: Keep `max_bytes_before_external_join` peak under the configured cap
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v26.4-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants