iframe-proxy

antaljanosbenjamin · 2026-02-23T21:38:04Z

Resolves #83677

Changelog category (leave one):

Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add automatic spilling to hash and parallel hash joins by converting them to grace hash join when memory limit is reached. This behavior is controlled by max_bytes_before_external_join.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Note

Medium Risk
Introduces a new join implementation that can switch from in-memory hash joins to disk-spilling GraceHashJoin at runtime, affecting core query execution and concurrency paths. Risk is mitigated by being opt-in via max_bytes_before_external_join plus extensive pipeline/test coverage, but regressions could impact join correctness/performance and read-in-order queries.

Overview
Adds opt-in automatic spilling for hash/parallel_hash/default/auto joins via new setting max_bytes_before_external_join; when the right side exceeds the threshold, the join converts to GraceHashJoin to spill to disk (new SpillingHashJoin).

Updates join selection (analyzer + planner) to instantiate SpillingHashJoin, extends IJoin with keepLeftPipelineInOrder() and canProcessNonJoinedBlocksInParallel(), and adjusts pipeline wiring so delayed-join blocks and parallel non-joined streams can coexist safely.

Improves correctness/observability around spilling by enabling per-slot block extraction for ConcurrentHashJoin, filtering released blocks to avoid duplication, adding new join profile events (non-joined/delayed block counts + spill count), fixing a couple dictionary exception messages, and adding/adjusting stateless tests (including a new 03915_spilling_hash_join suite and a read-in-order negative case).

^{Written by Cursor Bugbot for commit db5f199. This will update automatically on new commits. Configure here.}

Version info

Merged into: 26.5.1.140
Backported to: 26.4.1.1125

clickhouse-gh · 2026-02-23T21:38:51Z

Workflow [PR], commit [c40e404]

Summary: ✅

AI Review

Summary

This PR introduces SpillingHashJoin to auto-switch hash/parallel_hash joins to GraceHashJoin when max_bytes_before_external_join is exceeded, plus planner/pipeline wiring and tests. The main risk I found is a performance regression: enabling auto-spill can disable clone-dependent join optimizations even when the query never spills.

PR Metadata

⚠️ Changelog category: Improvement matches the change.
⚠️ Changelog entry: present and user-readable; it correctly describes behavior and controlling setting.
⚠️ Documentation requirement: this is a user-facing feature, but Documentation is written is still unchecked.

Findings

⚠️ Majors

[src/Interpreters/SpillingHashJoin.h:46] SpillingHashJoin does not override isCloneSupported/clone, so when max_bytes_before_external_join is enabled, clone-dependent optimizations (e.g. join swap and outer->inner conversion in legacy optimizer) are skipped even for in-memory executions.
- Suggested fix: implement clone support for the in-memory path (or explicitly gate and document optimization loss when auto-spill is enabled).

ClickHouse Rules

Item	Status	Notes
Deletion logging	➖
Serialization versioning	➖
Core-area scrutiny	✅
No test removal	✅
Experimental gate	➖
No magic constants	✅
Backward compatibility	✅
`SettingsChangesHistory.cpp`	✅
PR metadata quality	⚠️	User-facing feature but documentation checkbox is still unchecked
Safe rollout	⚠️	Auto-spill is opt-in, but it currently suppresses clone-dependent join optimizations
Compilation time	✅
No large/binary files	✅

Performance & Safety

⚠️ Auto-spill wrapping can reduce plan quality by disabling existing join optimizations due to missing clone support in SpillingHashJoin.

Final Verdict

Status: ⚠️ Request changes
Minimum required actions:
1. Add (or explicitly justify/gate) clone behavior for SpillingHashJoin so enabling max_bytes_before_external_join does not silently disable clone-dependent join optimizations.
2. Update PR metadata/doc status for this user-facing feature.

antaljanosbenjamin · 2026-02-23T22:00:38Z

        throw Exception(ErrorCodes::LOGICAL_ERROR, "QueryPipeline is already completed");
 }

-static void checkSource(const ProcessorPtr & source, bool can_have_totals)


These functions were simply unused, so I removed them.

alexey-milovidov

Do not merge until investigating and fixing the bug in transactions:

Logical error: 'txn->getState() != MergeTreeTransaction::COMMITTED' (STID: 2508-2b69)

antaljanosbenjamin · 2026-02-24T13:52:04Z

I fixed the issue that you introduced with server side fuzzing. Now please unblock the PR.

This way we still do the conversion on multiple threads while also having all allocated memory in a single place and not shared in two join algorithms (concurrent hash and grace hash).

antaljanosbenjamin · 2026-02-25T22:37:11Z

optimizeJoinLegacy is depending on HashJoin, check if it is a problem.

antaljanosbenjamin · 2026-02-25T22:53:46Z

+    M(JoinNonJoinedTransformBlockCount, "Number of blocks emitted by NonJoinedBlocksTransform.", ValueType::Number) \
+    M(JoinNonJoinedTransformRowCount, "Number of non-joined rows emitted by NonJoinedBlocksTransform.", ValueType::Number) \
+    M(JoinDelayedJoinedTransformBlockCount, "Number of blocks emitted by DelayedJoinedBlocksWorkerTransform.", ValueType::Number) \
+    M(JoinDelayedJoinedTransformRowCount, "Number of rows emitted by DelayedJoinedBlocksWorkerTransform.", ValueType::Number) \
+    M(JoinSpilledToDisk, "Number of times a hash join was switched to GraceHashJoin due to memory limit.", ValueType::Number) \


I am not 100% happy about these names and metrics, but this is the best I could come up with.

Would it be better to be more explicit about the emitter of the metric, i.e. the SpillingHashJoin? If yes, an alternative could be to use SpillingHashJoinSwitchedToGraceJoin instead of JoinSpilledToDisk. Is this what you are after?

antaljanosbenjamin · 2026-02-25T22:55:41Z

+    /// The decision should be done at latest in onBuildPhaseFinish, after that the returned value should not change.
+    /// This is important for SpillingHashJoin, which can change algorithms runtime, and parallel non-joined blocks
+    /// processing depends on the algorithm used.
+    virtual bool canProcessNonJoinedBlocksInParallel() const { return supportParallelNonJoinedBlocksProcessing(); }


Also not happy about the naming here: support vs can is not clear.

The first one supposed to mean whether the regarding processors (NonJoinedBlocksTransform) should be included in the pipeline or not.

The second one supposed to mean whether those processors can be actually used during execution. Decision is made latest on onBuildPhaseFinished.

What about adding the term Now or RightNow to make it more clear that it's a runtime decision that may change: canProcessNonJoinedBlocksInParallelNow()

Or differentiating it from supportParallelNonJoinedBlocksProcessing() by emphasizing that it may request enabling/disabling the NonJoinedBlocksTransform at runtime: isParallelNonJoinedProcessingEnabled()

clickhouse-gh · 2026-04-17T14:55:38Z

+SET max_threads = 6;
+SET max_bytes_before_external_join = 0;

 SELECT * FROM (


This change removes the GLOBAL ANY INNER JOIN ... LIMIT 0 assertion from this test and effectively runs GLOBAL ANY LEFT JOIN twice. That weakens coverage for empty-right-table semantics in sharded execution (LEFT and INNER exercise different code paths/invariants).

Please keep an INNER case in this test (or add a separate deterministic test) so we do not regress coverage while adapting it for spilling.

clickhouse-gh · 2026-04-17T16:11:42Z

    }

+    /// Notify the join that the query plan requires left-side read-in-order preservation.
+    /// SpillingHashJoin overrides this to forbid switching to GraceHashJoin at runtime.


The new IJoin::keepLeftPipelineInOrder contract says SpillingHashJoin "overrides this to forbid switching to GraceHashJoin at runtime", but SpillingHashJoin currently does not override keepLeftPipelineInOrder.

This creates a mismatch between API contract and implementation and can mislead future call sites that rely on this hook to preserve in-order semantics.

Please either:

implement SpillingHashJoin::keepLeftPipelineInOrder with the intended behavior, or

adjust the base-interface comment to reflect current behavior.

clickhouse-gh · 2026-04-17T20:50:00Z

    DECLARE(UInt64, archive_adaptive_buffer_max_size_bytes, 8 * DBMS_DEFAULT_BUFFER_SIZE, R"(
 Limits the maximum size of the adaptive buffer used when writing to archive files (for example, tar archives)", 0) \
+    DECLARE(UInt64, max_bytes_before_external_join, 0, R"(
+If set to a non-zero value and `join_algorithm` is `hash`, `parallel_hash`, `default`, or `auto`, the hash join will automatically be converted to grace hash join to enable spilling to disk when the right-side data exceeds this many bytes. When set to 0 (default), automatic spilling is disabled. It prevents read in order through join optimization.


The setting description sentence "It prevents read in order through join optimization" is a bit unclear/awkward.

Could we rephrase to explicitly name the optimization, e.g. "Enabling this setting disables the read_in_order_through_join optimization"? This will make the user-facing behavior easier to understand.

clickhouse-gh · 2026-04-27T04:55:11Z

+    std::string getName() const override;
+    const TableJoin & getTableJoin() const override { return *table_join; }
+
+    bool addBlockToJoin(const Block & block, bool check_limits) override;


SpillingHashJoin does not override IJoin::addBlockToJoin(const Block &, size_t num_rows, bool), so FillingRightJoinSideTransform calls the base overload and num_rows is dropped.

This regresses correctness for right blocks with zero columns but non-zero rows (the exact PREWHERE/CROSS JOIN case handled in HashJoin::addBlockToJoin(const Block &, size_t, bool)). In the in-memory path, SpillingHashJoin::addBlockToJoin(const Block &, bool) forwards to HashJoin::addBlockToJoin(const Block &, bool), which uses Block::rows() and treats such blocks as 0 rows.

Please override the num_rows overload in SpillingHashJoin and propagate num_rows to the wrapped join (HashJoin/ConcurrentHashJoin) so row counts stay correct when the right block has no columns.

clickhouse-gh · 2026-04-28T02:19:22Z

+/// Because hasDelayedBlocks returns true, the read-in-order-through-join optimisation
+/// in optimizeReadInOrder.cpp will NOT propagate through SpillingHashJoin (same as
+/// GraceHashJoin), since spilling may reorder rows.
+class SpillingHashJoin final : public IJoin


Enabling max_bytes_before_external_join wraps hash joins in SpillingHashJoin, but this class does not override isCloneSupported / clone.

That makes join-clone-dependent optimizations silently stop applying (e.g. optimizeJoinLegacy and tryConvertOuterJoinToInnerJoinLegacy both bail out when isCloneSupported is false), even in the common case where the query stays in-memory and never spills.

Please either forward clone support to the active in-memory implementation (and document behavior after switching), or explicitly gate/disable this optimization loss to avoid surprising performance regressions when users only enable auto-spill protection.

clickhouse-gh · 2026-04-28T05:11:19Z

LLVM Coverage Report

Changed lines: 81.10% (382/471) · Uncovered code

Full report · Diff report

Cherry pick #97813 to 26.4: Auto spilling join

Backport #97813 to 26.4: Auto spilling join

antaljanosbenjamin added 7 commits February 19, 2026 17:25

Initial implementation

4009dfd

Use HashJoin to collect blocks

762c0be

Use SpillableHashJoin for multithreaded queries

7a197c1

Use initial buckets settings for spillable hash join

99b2489

Merge remote-tracking branch 'origin/master'

2523c51

Implement parallel processing of non joined blocks for SpillingHashJoin

44916c9

Add some profile events

fe106b5

clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label Feb 23, 2026

Add new setting to setting changes history

aed50bf

antaljanosbenjamin commented Feb 23, 2026

View reviewed changes

alexey-milovidov requested changes Feb 24, 2026

View reviewed changes

alexey-milovidov approved these changes Feb 24, 2026

View reviewed changes

clickhouse-gh Bot assigned alexey-milovidov Feb 24, 2026

antaljanosbenjamin mentioned this pull request Feb 24, 2026

Automatic spill-to-disk for joins #83677

Closed

antaljanosbenjamin and others added 5 commits February 25, 2026 14:22

Convert hash joins completely in addBlockToJoin

b4b4443

This way we still do the conversion on multiple threads while also having all allocated memory in a single place and not shared in two join algorithms (concurrent hash and grace hash).

Make setting non experimental

fbbe1b1

Randomize setting in tests

c8a7514

Move the settings changes to 26.3

1f5cd4a

Merge branch 'master' into auto-spilling-join

3e85814

antaljanosbenjamin added 6 commits February 25, 2026 22:51

Use more precise profile event names

07928e1

Handle runtime decision about parallel of non joined blocks properly

44f2b09

Small renames, more accurate comments

cf82f20

Proper use of temporary data

2f9d7a3

Unified printing of limit exceeded log

2c947c7

Fix some tests

539486a

antaljanosbenjamin commented Feb 25, 2026

View reviewed changes

antaljanosbenjamin added 3 commits April 17, 2026 11:46

Merge remote-tracking branch 'origin/master' into auto-spilling-join

bb249c5

Fix test

5757406

Mention incompatibility with read in order optimization

473a40b

clickhouse-gh Bot reviewed Apr 17, 2026

View reviewed changes

Add back accidentally removed queries

d54d7cf

clickhouse-gh Bot reviewed Apr 17, 2026

View reviewed changes

Fix test

6ca7f84

clickhouse-gh Bot reviewed Apr 17, 2026

View reviewed changes

antaljanosbenjamin added 3 commits April 27, 2026 04:45

Merge remote-tracking branch 'origin/master' into auto-spilling-join

2250dcf

Disable automatic spilling in pretty explain tests

12163a1

Move settings changes history to 26.5

5fa1df2

clickhouse-gh Bot reviewed Apr 27, 2026

View reviewed changes

Fix build

c40e404

clickhouse-gh Bot reviewed Apr 28, 2026

View reviewed changes

alexey-milovidov added this pull request to the merge queue Apr 28, 2026

alexey-milovidov added the v26.4-must-backport label Apr 28, 2026

Merged via the queue into master with commit c19c93a Apr 28, 2026
165 checks passed

alexey-milovidov deleted the auto-spilling-join branch April 28, 2026 07:50

robot-ch-test-poll added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Apr 28, 2026

robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 28, 2026

robot-ch-test-poll4 mentioned this pull request Apr 28, 2026

Cherry pick #97813 to 26.4: Auto spilling join #103667

Merged

robot-ch-test-poll1 added a commit that referenced this pull request Apr 28, 2026

Merge pull request #103667 from ClickHouse/cherrypick/26.4/97813

aa3a51f

Cherry pick #97813 to 26.4: Auto spilling join

robot-clickhouse added a commit that referenced this pull request Apr 28, 2026

Backport #97813 to 26.4: Auto spilling join

b73f4f5

robot-ch-test-poll1 mentioned this pull request Apr 28, 2026

Backport #97813 to 26.4: Auto spilling join #103680

Merged

robot-ch-test-poll1 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label Apr 28, 2026

alexey-milovidov added a commit that referenced this pull request Apr 29, 2026

Merge pull request #103680 from ClickHouse/backport/26.4/97813

64fe834

Backport #97813 to 26.4: Auto spilling join

alexey-milovidov mentioned this pull request May 1, 2026

Keep max_bytes_before_external_join peak under the configured cap #103838

Merged

1 task

antaljanosbenjamin mentioned this pull request May 5, 2026

Update SettingsChangesHistory.cpp #104097

Merged

1 task

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

antaljanosbenjamin commented Feb 23, 2026 • edited by robot-clickhouse Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Version info

Uh oh!

clickhouse-gh Bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

PR Metadata

Findings

ClickHouse Rules

Performance & Safety

Final Verdict

Uh oh!

antaljanosbenjamin Feb 23, 2026

Choose a reason for hiding this comment

Uh oh!

alexey-milovidov left a comment

Choose a reason for hiding this comment

Uh oh!

antaljanosbenjamin commented Feb 24, 2026

Uh oh!

antaljanosbenjamin commented Feb 25, 2026

Uh oh!

antaljanosbenjamin Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

cv4g Mar 12, 2026

Choose a reason for hiding this comment

Uh oh!

antaljanosbenjamin Feb 25, 2026

Choose a reason for hiding this comment

Uh oh!

cv4g Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Apr 28, 2026

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot commented Apr 28, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

antaljanosbenjamin commented Feb 23, 2026 •

edited by robot-clickhouse

Loading

clickhouse-gh Bot commented Feb 23, 2026 •

edited

Loading