Transform JOIN hash table payload to row major by m-selmi · Pull Request #104884 · ClickHouse/ClickHouse · GitHub
Skip to content

Transform JOIN hash table payload to row major#104884

Open
m-selmi wants to merge 62 commits into
ClickHouse:masterfrom
m-selmi:transform-hash-table-payload-to-row-major
Open

Transform JOIN hash table payload to row major#104884
m-selmi wants to merge 62 commits into
ClickHouse:masterfrom
m-selmi:transform-hash-table-payload-to-row-major

Conversation

@m-selmi

@m-selmi m-selmi commented May 13, 2026

Copy link
Copy Markdown
Contributor

Resolves #97883

Introduce an alternative layout for hash table payload in hash joins, where fixed and contiguous columns under a certain size are transformed into one row-major block of data.
Before the change:

    col1 Int32     col2 String    col3 UInt8     col4 Float64
   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────────┐
   │   ....   │   │   ....   │   │   ....   │   │    ....    │
   │   ....   │   │   ....   │   │   ....   │   │    ....    │
   └──────────┘   └──────────┘   └──────────┘   └────────────┘
         └──────────────┴──────────────┴───────────────┘
                row N → col1[N], col2[N], col3[N], col4[N]

After the change:

   ROW STORE (col1 + col3 + col4, contiguous per row)     col2 String (columnar)
   ┌──────────┬──────────┬────────────┐                   ┌──────────┐
   │   col1   │   col3   │    col4    │ ← row N (record)  │   ....   │ ← col2[N]
   ├──────────┼──────────┼────────────┤                   ├──────────┤
   │   col1   │   col3   │    col4    │                   │   ....   │
   └──────────┴──────────┴────────────┘                   └──────────┘

Rational:

The row-major format allows the output construction to be a tight loop over each column, basically one pointer load and data copy:

for (size_t i = 0; i < n; ++i)
{
    const char * row_store_ptr = row_store_ptrs[i];
    col->insertData(row_store_ptr + value_offset, value_size);
}

In comparison the columnar version has to do more work: load columns[j], row_numbers[j], replicated_columns access, insertFrom ...
For the following query SELECT * FROM rs_probe_200m l JOIN rs_right_10m r ON (l.k % toUInt64(10000000 / 0.9)) = r.k FORMAT Null from 04054_hash_join_with_row_store.sql the perf counters for LazyOutput::buildOutput in AddedColumns look as follows:

┌─variant────────────────────────────┬─query_id─────────────────────────────┬─runtime_ms─┬───────instr─┬──────cycles─┬──────cmiss─┬──brmiss─┬───ipc─┬─cmiss_per_kinstr─┬─brmiss_per_kinstr─┐
│ row-major-enabled-multi-threaded   │ 9d2d6886-2ad7-430f-af2c-6c7cd32f3808 │       1621 │ 11393048228 │ 31974507197 │ 1470246364 │ 5892574 │ 0.356 │           129.05 │             0.517 │
│ row-major-enabled-multi-threaded   │ 56b33013-8bec-4d7a-a415-2a06beb14bf6 │       1633 │ 11547943445 │ 31874206968 │ 1484117664 │ 6265932 │ 0.362 │           128.52 │             0.543 │
│ row-major-enabled-multi-threaded   │ 65583fa7-9ce0-4426-95eb-977251d38f75 │       1708 │ 11338601601 │ 31825721703 │ 1468670383 │ 5921051 │ 0.356 │           129.53 │             0.522 │
│ row-major-disabled-multi-threaded  │ 5ff1b154-3c3f-40c6-bae9-f6cf0feafe81 │       2548 │ 15453833738 │ 58322156979 │ 2636461094 │ 9533242 │ 0.265 │            170.6 │             0.617 │
│ row-major-disabled-multi-threaded  │ 6e92c1d4-8432-4a79-a6cc-85e38232116d │       2580 │ 15400954730 │ 58683032344 │ 2623136775 │ 9455570 │ 0.262 │           170.32 │             0.614 │
│ row-major-disabled-multi-threaded  │ 133f384f-c09a-4e12-9e69-85b61272a360 │       2686 │ 15419169622 │ 59548322941 │ 2647601264 │ 9746405 │ 0.259 │           171.71 │             0.632 │
│ row-major-enabled-single-threaded  │ 20039e91-b5dc-4b4a-8e22-f4451f6067be │      30610 │  9046504977 │ 16964398066 │  995568814 │ 3235355 │ 0.533 │           110.05 │             0.358 │
│ row-major-disabled-single-threaded │ a3f9a822-dae6-45c1-9a4e-63e87a3592e8 │      56821 │ 15279361788 │ 27940919919 │ 1167368997 │ 4694553 │ 0.547 │             76.4 │             0.307 │
└────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────┴───────┴──────────────────┴───────────────────┘

Changes:

  • A RowDataStore that takes a set of columns and lays them in row major format. It exposes access to individual rows and the layout of each row (offset and size of fields in a row).
  • A HashJoin/ConcurrentHashJoin post processing step that transforms the hash join payload to row-major once all conditions have been fulfilled.
    • In case of ConcurrentHashJoin the payload transformation phase is parallelized.
    • a new FinalizingRightJoinSideTransform drives the post build phase.
  • New IColumn interfaces that handle building the output columns from the row store, used manly by AddedColumns.cpp.
  • Two new settings to control the row-major transformation:
    • min_columns_for_hash_join_row_store: the minimum number of columns that must be suitable to the transformation to apply.
    • max_bytes_for_hash_join_row_store: The maximum size the row store is allowed to reach. This affects the duration of the transformation phase.

Limitations:

  1. Only supports fixed and contiguous types and nullable version of them.
    Supporting non fixed rows would make the row-major layout non-uniform between rows and output column construction loop would not be as tight anymore. Also transforming unbounded unfixed types to row major could be very heavy.
  2. Does not support ColumnReplicated.
    Column replicated would need to be materialized or have special handling where only the index in the row store.
  3. Slowdown in case of wrong join order.
    If the join side are swapped, meaning the build side is larger than the probe side, the gain from using the row store during probe is overshadowed by the overhead of the transformation. Unfortunately we can not determine this case in advance, for example via statistics (if we could we would have chosen the right join sides). The max_bytes_for_hash_join_row_store was introduced for this reason to keep the overhead bounded.
    Transforming around 128 MiB payload with a single thread:
Type Rows Columns avg FinalizingRightJoinSide (ms)
Int64 2M 8 28.2
Int64 1M 16 31.9
FixedString(16) 1M 8 24.6
FixedString(16) 2M 4 21.5

Benchmarks:

from 04054_hash_join_with_row_store.sql:

# Right table Match Join Before (s) After (s) Speedup
0 rs_right_100k 0.9 INNER 0.317 0.269 1.18×
1 rs_right_100k 0.1 INNER 0.158 0.155 1.02×
2 rs_right_10m 0.9 INNER 1.466 0.760 1.93×
3 rs_right_10m 0.1 INNER 0.825 0.499 1.65×
4 rs_right_100k_nullable 0.9 INNER 0.414 0.289 1.43×
5 rs_right_100k_nullable 0.1 INNER 0.173 0.159 1.09×
6 rs_right_10m_nullable 0.9 INNER 2.021 1.102 1.83×
7 rs_right_10m_nullable 0.1 INNER 0.713 0.618 1.15×
8 rs_right_10m 0.9 FULL 1.894 1.135 1.67×
9 rs_right_10m 0.1 FULL 1.359 0.916 1.48×
10 rs_right_1m_x10 0.9 INNER 2.929 2.638 1.11×
11 rs_right_1m_x10 0.1 INNER 0.553 0.516 1.07×
12 rs_right_1m_nullable_x10 0.9 INNER 4.622 2.809 1.65×
13 rs_right_1m_nullable_x10 0.1 INNER 0.790 0.580 1.36×
14 rs_right_mixed_10m 0.9 INNER 2.055 1.941 1.06×
15 rs_right_mixed_10m 0.1 INNER 0.973 0.950 1.02×
16 rs_right_wide_10m 0.9 INNER 6.079 2.211 2.75×
17 rs_right_wide_10m 0.1 INNER 1.451 1.080 1.34×

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Introduce an alternative layout for hash table payload in hash joins, where qualifying columns are transformed into one row-major block of data in order to speed up join output reconstruction. Controlled by two new settings: min_columns_for_hash_join_row_store (default 3, the minimum number of fixed-size payload columns that triggers the row-major layout; set to 0 to disable) and max_bytes_for_hash_join_row_store (default 128 MiB, the size budget for the row store per hash join).

@clickhouse-gh

clickhouse-gh Bot commented May 13, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [5500d5b]

Summary:

job_name test_name status info comment
AST fuzzer (amd_debug, targeted, old_compatibility) FAIL
Logical error: Block structure mismatch in A stream: different columns: (STID: 0993-2cc2) FAIL cidb, issue
Stress test (arm_release) FAIL
Hung check failed, possible deadlock found FAIL cidb, issue
AST fuzzer (amd_debug) FAIL
Logical error: Block structure mismatch in A stream: different columns: (STID: 0993-27f0) FAIL cidb, issue

AI Review

Summary

This PR adds a row-major RowDataStore payload path for HashJoin, with post-build conversion and settings to enable it by default for sufficiently wide join payloads. The current code addresses the earlier correctness concerns I checked around parallel_hash, post-build state transitions, GraceHashJoin, nullable payload decoding, and replicated columns, but the PR still has not closed the evidence gate for a default-on Performance Improvement.

Missing context / blind spots
  • ⚠️ I could not run the local performance-report helper because clickhouse was not available in PATH. A completed current Performance Comparison report for PR head 5500d5b326fb96a540f26c5f0d896c8ef72ef5c4 would close this gap.
  • ⚠️ The latest raw CI report for 5500d5b326fb96a540f26c5f0d896c8ef72ef5c4 still has the Performance Comparison shards pending, so there is no current before/after evidence for the final defaults.
Tests
  • ⚠️ The PR is marked Performance Improvement, but the only available cloud benchmark I found is the older 2026-06-05 comment at Transform JOIN hash table payload to row major #104884 (comment), before the later max_bytes_for_hash_join_row_store = 128_MiB default/cap changes. That report still showed a material tpch_adapted_1_official Q8 regression. Please provide a completed current performance comparison for the current head/defaults, or tune/disable the default guard until the default-on setting has evidence that it does not introduce material regressions.
Performance & Safety
  • ⚠️ max_bytes_for_hash_join_row_store is now enabled by default, so the missing current benchmark is not just a validation gap: it affects the rollout safety of a hot query execution path. The existing stateless tests cover important correctness modes, but they do not prove the default is a safe performance improvement.
Final Verdict

Status: ⚠️ Request changes

Minimum required action: provide completed current before/after performance evidence for the PR head with the final min_columns_for_hash_join_row_store and max_bytes_for_hash_join_row_store defaults, or adjust the defaults/guards so the feature is not enabled by default without that evidence.

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 13, 2026
Comment thread tests/queries/0_stateless/04054_hash_join_with_row_store.sql
Comment thread src/Core/Settings.cpp
@alexey-milovidov

Copy link
Copy Markdown
Member

This was fixed by #105146. Let's update the branch.

@harikrishnan94

Copy link
Copy Markdown
Contributor

@m-selmi
Can we combine null flags together (as bit masks) at the start of each row?
This will reduce the padding requirement drastically.
ie.
From this
[null_flag(1B) | uint32_val(4B) | float64_val(8B) | null_flag(1B) | uint64_val(8B)]
to
[nulls_bitmap(1B) | uint32_val(4B) | float64_val(8B) | uint64_val(8B)]
ofcourse aligned.

@m-selmi

m-selmi commented May 19, 2026

Copy link
Copy Markdown
Contributor Author

@m-selmi Can we combine null flags together (as bit masks) at the start of each row? This will reduce the padding requirement drastically. ie. From this [null_flag(1B) | uint32_val(4B) | float64_val(8B) | null_flag(1B) | uint64_val(8B)] to [nulls_bitmap(1B) | uint32_val(4B) | float64_val(8B) | uint64_val(8B)] ofcourse aligned.

@harikrishnan94 Thanks for taking a look. Right now it at least keeps the same memory footprint as separate columns, but that's a very good point, I can try it out.

@nickitat nickitat self-assigned this May 26, 2026
Comment thread tests/queries/0_stateless/04054_hash_join_with_row_store.sql
@m-selmi m-selmi marked this pull request as ready for review June 12, 2026 08:58
@m-selmi m-selmi requested a review from nickitat June 12, 2026 08:59
Comment thread src/Core/Settings.cpp
Comment thread src/Interpreters/ConcurrentHashJoin.cpp
Comment thread tests/queries/0_stateless/04054_hash_join_with_row_store.sql Outdated
Comment thread src/Core/Settings.cpp Outdated
Comment thread src/Core/Settings.cpp
DECLARE(UInt64, min_columns_for_hash_join_row_store, 3, R"(
Minimum number of payload columns to trigger transforming hash join payload to row-major. 0 disables the row-major transformation.
)", 0) \
DECLARE(UInt64, max_bytes_for_hash_join_row_store, 128_MiB, R"(

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This default-on cap needs current performance evidence. The only cloud benchmark comment I found is #104884 (comment) from 2026-06-05, before the later 128_MiB cap/default changes, and it reported tpch_adapted_1_official Q8 at +17.0% with the hint that the extra row-store transfer may not pay off for that join shape.

For a Performance Improvement, please rerun/provide measurements on current HEAD, or tune the default guard so join shapes like Q8 do not materially regress when max_bytes_for_hash_join_row_store is enabled by default.

@harikrishnan94 harikrishnan94 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we compare filling and gathering from a smaller batch of rows to a block row count for large right-side row widths to see if that improves both?

Comment thread src/Interpreters/RowDataStore.cpp
@clickhouse-gh

clickhouse-gh Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Dear @nickitat, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@clickhouse-gh

clickhouse-gh Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 897/963 (93.15%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 9 line(s) · Uncovered code

Full report · Diff report

@m-selmi

m-selmi commented Jun 29, 2026

Copy link
Copy Markdown
Contributor Author

Could we compare filling and gathering from a smaller batch of rows to a block row count for large right-side row widths to see if that improves both?

@harikrishnan94 Great point, here are some results for this:

  • 10M build rows, 20M probe rows
  • build payload 640 bytes per row so a block of 65536 rows has 40 MiB
  • cache sizes: L1 32KiB / L2 1MiB / L3 32MiB
batch size FinalizingRightJoinSide median (ms) JoiningTransform median (ms) Query median (ms)
4 KB 164.58 399.71 1272
8 KB 135.68 337.88 1147
16 KB 121.92 317.02 1159
32 KB 142.82 329.69 1166
64 KB 119.04 322.40 1105
512 KB 106.44 286.51 988
1 MB 150.82 353.54 1201
4 MB 199.06 685.78 1598
8 MB 173.32 722.34 1625
16 MB 124.79 769.57 1495
32 MB 145.74 838.73 1643
no limit 154.32 808.56 1701

Seems like the output fill path has the best results while the batch still fits in L2 cache. The row store gather path is not affected as much because CH reduces the build input block size to 8192 to deal with the wide rows. When forcing the build input blocks to remain large manually there is a similar pattern where there is degradation once we leave L2 cache:

batch size FinalizingRightJoinSide median (ms) JoiningTransform median (ms) Query median (ms)
4 KB 108.61 362.65 948
8 KB 133.36 311.90 897
16 KB 143.74 274.48 872
32 KB 125.20 287.66 883
64 KB 119.24 278.45 839
512 KB 108.76 296.01 982
1 MB 96.73 289.81 892
4 MB 175.38 557.08 1210
8 MB 194.40 620.75 1287
16 MB 211.12 705.10 1460
32 MB 202.65 759.22 1550
no limit 245.07 767.90 1574

Based on this I added batching to L2 cache size / 4 to leave some headroom.

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Evaluate using Row-Major Storage instead of RowRef/RowRefList in Hash Table

5 participants