iframe-proxy

m-selmi · 2026-05-13T17:49:22Z

Resolves #97883

Introduce an alternative layout for hash table payload in hash joins, where fixed and contiguous columns under a certain size are transformed into one row-major block of data.
Before the change:

    col1 Int32     col2 String    col3 UInt8     col4 Float64
   ┌──────────┐   ┌──────────┐   ┌──────────┐   ┌────────────┐
   │   ....   │   │   ....   │   │   ....   │   │    ....    │
   │   ....   │   │   ....   │   │   ....   │   │    ....    │
   └──────────┘   └──────────┘   └──────────┘   └────────────┘
         └──────────────┴──────────────┴───────────────┘
                row N → col1[N], col2[N], col3[N], col4[N]

After the change:

   ROW STORE (col1 + col3 + col4, contiguous per row)     col2 String (columnar)
   ┌──────────┬──────────┬────────────┐                   ┌──────────┐
   │   col1   │   col3   │    col4    │ ← row N (record)  │   ....   │ ← col2[N]
   ├──────────┼──────────┼────────────┤                   ├──────────┤
   │   col1   │   col3   │    col4    │                   │   ....   │
   └──────────┴──────────┴────────────┘                   └──────────┘

Rational:

The row-major format allows the output construction to be a tight loop over each column, basically one pointer load and data copy:

for (size_t i = 0; i < n; ++i)
{
    const char * row_store_ptr = row_store_ptrs[i];
    col->insertData(row_store_ptr + value_offset, value_size);
}

In comparison the columnar version has to do more work: load columns[j], row_numbers[j], replicated_columns access, insertFrom ...
For the following query SELECT * FROM rs_probe_200m l JOIN rs_right_10m r ON (l.k % toUInt64(10000000 / 0.9)) = r.k FORMAT Null from 04054_hash_join_with_row_store.sql the perf counters for LazyOutput::buildOutput in AddedColumns look as follows:

┌─variant────────────────────────────┬─query_id─────────────────────────────┬─runtime_ms─┬───────instr─┬──────cycles─┬──────cmiss─┬──brmiss─┬───ipc─┬─cmiss_per_kinstr─┬─brmiss_per_kinstr─┐
│ row-major-enabled-multi-threaded   │ 9d2d6886-2ad7-430f-af2c-6c7cd32f3808 │       1621 │ 11393048228 │ 31974507197 │ 1470246364 │ 5892574 │ 0.356 │           129.05 │             0.517 │
│ row-major-enabled-multi-threaded   │ 56b33013-8bec-4d7a-a415-2a06beb14bf6 │       1633 │ 11547943445 │ 31874206968 │ 1484117664 │ 6265932 │ 0.362 │           128.52 │             0.543 │
│ row-major-enabled-multi-threaded   │ 65583fa7-9ce0-4426-95eb-977251d38f75 │       1708 │ 11338601601 │ 31825721703 │ 1468670383 │ 5921051 │ 0.356 │           129.53 │             0.522 │
│ row-major-disabled-multi-threaded  │ 5ff1b154-3c3f-40c6-bae9-f6cf0feafe81 │       2548 │ 15453833738 │ 58322156979 │ 2636461094 │ 9533242 │ 0.265 │            170.6 │             0.617 │
│ row-major-disabled-multi-threaded  │ 6e92c1d4-8432-4a79-a6cc-85e38232116d │       2580 │ 15400954730 │ 58683032344 │ 2623136775 │ 9455570 │ 0.262 │           170.32 │             0.614 │
│ row-major-disabled-multi-threaded  │ 133f384f-c09a-4e12-9e69-85b61272a360 │       2686 │ 15419169622 │ 59548322941 │ 2647601264 │ 9746405 │ 0.259 │           171.71 │             0.632 │
│ row-major-enabled-single-threaded  │ 20039e91-b5dc-4b4a-8e22-f4451f6067be │      30610 │  9046504977 │ 16964398066 │  995568814 │ 3235355 │ 0.533 │           110.05 │             0.358 │
│ row-major-disabled-single-threaded │ a3f9a822-dae6-45c1-9a4e-63e87a3592e8 │      56821 │ 15279361788 │ 27940919919 │ 1167368997 │ 4694553 │ 0.547 │             76.4 │             0.307 │
└────────────────────────────────────┴──────────────────────────────────────┴────────────┴─────────────┴─────────────┴────────────┴─────────┴───────┴──────────────────┴───────────────────┘

Changes:

A RowDataStore that takes a set of columns and lays them in row major format. It exposes access to individual rows and the layout of each row (offset and size of fields in a row).
A HashJoin/ConcurrentHashJoin post processing step that transforms the hash join payload to row-major once all conditions have been fulfilled.
- In case of ConcurrentHashJoin the payload transformation phase is parallelized.
- a new FinalizingRightJoinSideTransform drives the post build phase.
New IColumn interfaces that handle building the output columns from the row store, used manly by AddedColumns.cpp.
Two new settings to control the row-major transformation:
- min_columns_for_hash_join_row_store: the minimum number of columns that must be suitable to the transformation to apply.
- max_bytes_for_hash_join_row_store: The maximum size the row store is allowed to reach. This affects the duration of the transformation phase.

Limitations:

Only supports fixed and contiguous types and nullable version of them.
Supporting non fixed rows would make the row-major layout non-uniform between rows and output column construction loop would not be as tight anymore. Also transforming unbounded unfixed types to row major could be very heavy.
Does not support ColumnReplicated.
Column replicated would need to be materialized or have special handling where only the index in the row store.
Slowdown in case of wrong join order.
If the join side are swapped, meaning the build side is larger than the probe side, the gain from using the row store during probe is overshadowed by the overhead of the transformation. Unfortunately we can not determine this case in advance, for example via statistics (if we could we would have chosen the right join sides). The max_bytes_for_hash_join_row_store was introduced for this reason to keep the overhead bounded.
Transforming around 128 MiB payload with a single thread:

Benchmarks:

from `04054_hash_join_with_row_store.sql`:

#	Right table	Match	Join	Before (s)	After (s)	Speedup
0	`rs_right_100k`	0.9	INNER	0.317	0.269	1.18×
1	`rs_right_100k`	0.1	INNER	0.158	0.155	1.02×
2	`rs_right_10m`	0.9	INNER	1.466	0.760	1.93×
3	`rs_right_10m`	0.1	INNER	0.825	0.499	1.65×
4	`rs_right_100k_nullable`	0.9	INNER	0.414	0.289	1.43×
5	`rs_right_100k_nullable`	0.1	INNER	0.173	0.159	1.09×
6	`rs_right_10m_nullable`	0.9	INNER	2.021	1.102	1.83×
7	`rs_right_10m_nullable`	0.1	INNER	0.713	0.618	1.15×
8	`rs_right_10m`	0.9	FULL	1.894	1.135	1.67×
9	`rs_right_10m`	0.1	FULL	1.359	0.916	1.48×
10	`rs_right_1m_x10`	0.9	INNER	2.929	2.638	1.11×
11	`rs_right_1m_x10`	0.1	INNER	0.553	0.516	1.07×
12	`rs_right_1m_nullable_x10`	0.9	INNER	4.622	2.809	1.65×
13	`rs_right_1m_nullable_x10`	0.1	INNER	0.790	0.580	1.36×
14	`rs_right_mixed_10m`	0.9	INNER	2.055	1.941	1.06×
15	`rs_right_mixed_10m`	0.1	INNER	0.973	0.950	1.02×
16	`rs_right_wide_10m`	0.9	INNER	6.079	2.211	2.75×
17	`rs_right_wide_10m`	0.1	INNER	1.451	1.080	1.34×

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Introduce an alternative layout for hash table payload in hash joins, where qualifying columns are transformed into one row-major block of data in order to speed up join output reconstruction. Controlled by two new settings: min_columns_for_hash_join_row_store (default 3, the minimum number of fixed-size payload columns that triggers the row-major layout; set to 0 to disable) and max_bytes_for_hash_join_row_store (default 128 MiB, the size budget for the row store per hash join).

clickhouse-gh · 2026-05-13T17:50:11Z

Workflow [PR], commit [5500d5b]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
AST fuzzer (amd_debug, targeted, old_compatibility)		FAIL
	Logical error: Block structure mismatch in A stream: different columns: (STID: 0993-2cc2)	FAIL	cidb, issue
Stress test (arm_release)		FAIL
	Hung check failed, possible deadlock found	FAIL	cidb, issue
AST fuzzer (amd_debug)		FAIL
	Logical error: Block structure mismatch in A stream: different columns: (STID: 0993-27f0)	FAIL	cidb, issue

AI Review

Summary

This PR adds a row-major RowDataStore payload path for HashJoin, with post-build conversion and settings to enable it by default for sufficiently wide join payloads. The current code addresses the earlier correctness concerns I checked around parallel_hash, post-build state transitions, GraceHashJoin, nullable payload decoding, and replicated columns, but the PR still has not closed the evidence gate for a default-on Performance Improvement.

Missing context / blind spots

⚠️ I could not run the local performance-report helper because clickhouse was not available in PATH. A completed current Performance Comparison report for PR head 5500d5b326fb96a540f26c5f0d896c8ef72ef5c4 would close this gap.
⚠️ The latest raw CI report for 5500d5b326fb96a540f26c5f0d896c8ef72ef5c4 still has the Performance Comparison shards pending, so there is no current before/after evidence for the final defaults.

Tests

⚠️ The PR is marked Performance Improvement, but the only available cloud benchmark I found is the older 2026-06-05 comment at Transform JOIN hash table payload to row major #104884 (comment), before the later max_bytes_for_hash_join_row_store = 128_MiB default/cap changes. That report still showed a material tpch_adapted_1_official Q8 regression. Please provide a completed current performance comparison for the current head/defaults, or tune/disable the default guard until the default-on setting has evidence that it does not introduce material regressions.

Performance & Safety

⚠️ max_bytes_for_hash_join_row_store is now enabled by default, so the missing current benchmark is not just a validation gap: it affects the rollout safety of a hot query execution path. The existing stateless tests cover important correctness modes, but they do not prove the default is a safe performance improvement.

Final Verdict

Status: ⚠️ Request changes

Minimum required action: provide completed current before/after performance evidence for the PR head with the final min_columns_for_hash_join_row_store and max_bytes_for_hash_join_row_store defaults, or adjust the defaults/guards so the feature is not enabled by default without that evidence.

alexey-milovidov · 2026-05-17T18:42:32Z

This was fixed by #105146. Let's update the branch.

harikrishnan94 · 2026-05-19T11:43:31Z

m-selmi · 2026-05-19T13:38:34Z

@m-selmi Can we combine null flags together (as bit masks) at the start of each row? This will reduce the padding requirement drastically. ie. From this [null_flag(1B) | uint32_val(4B) | float64_val(8B) | null_flag(1B) | uint64_val(8B)] to [nulls_bitmap(1B) | uint32_val(4B) | float64_val(8B) | uint64_val(8B)] ofcourse aligned.

@harikrishnan94 Thanks for taking a look. Right now it at least keeps the same memory footprint as separate columns, but that's a very good point, I can try it out.

…nce.

clickhouse-gh · 2026-06-18T10:34:50Z

+    DECLARE(UInt64, min_columns_for_hash_join_row_store, 3, R"(
+Minimum number of payload columns to trigger transforming hash join payload to row-major. 0 disables the row-major transformation.
+)", 0) \
+    DECLARE(UInt64, max_bytes_for_hash_join_row_store, 128_MiB, R"(


This default-on cap needs current performance evidence. The only cloud benchmark comment I found is #104884 (comment) from 2026-06-05, before the later 128_MiB cap/default changes, and it reported tpch_adapted_1_official Q8 at +17.0% with the hint that the extra row-store transfer may not pay off for that join shape.

For a Performance Improvement, please rerun/provide measurements on current HEAD, or tune the default guard so join shapes like Q8 do not materially regress when max_bytes_for_hash_join_row_store is enabled by default.

harikrishnan94

Could we compare filling and gathering from a smaller batch of rows to a block row count for large right-side row widths to see if that improves both?

clickhouse-gh · 2026-06-25T14:00:11Z

Dear @nickitat, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

clickhouse-gh · 2026-06-28T00:24:36Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	85.40%	85.40%	+0.00%
Functions	92.60%	92.60%	+0.00%
Branches	77.60%	77.60%	+0.00%

Changed lines: Changed C/C++ lines covered by tests: 897/963 (93.15%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 9 line(s) · Uncovered code

Full report · Diff report

m-selmi · 2026-06-29T10:34:39Z

Could we compare filling and gathering from a smaller batch of rows to a block row count for large right-side row widths to see if that improves both?

@harikrishnan94 Great point, here are some results for this:

10M build rows, 20M probe rows
build payload 640 bytes per row so a block of 65536 rows has 40 MiB
cache sizes: L1 32KiB / L2 1MiB / L3 32MiB

batch size	FinalizingRightJoinSide median (ms)	JoiningTransform median (ms)	Query median (ms)
4 KB	164.58	399.71	1272
8 KB	135.68	337.88	1147
16 KB	121.92	317.02	1159
32 KB	142.82	329.69	1166
64 KB	119.04	322.40	1105
512 KB	106.44	286.51	988
1 MB	150.82	353.54	1201
4 MB	199.06	685.78	1598
8 MB	173.32	722.34	1625
16 MB	124.79	769.57	1495
32 MB	145.74	838.73	1643
no limit	154.32	808.56	1701

Seems like the output fill path has the best results while the batch still fits in L2 cache. The row store gather path is not affected as much because CH reduces the build input block size to 8192 to deal with the wide rows. When forcing the build input blocks to remain large manually there is a similar pattern where there is degradation once we leave L2 cache:

batch size	FinalizingRightJoinSide median (ms)	JoiningTransform median (ms)	Query median (ms)
4 KB	108.61	362.65	948
8 KB	133.36	311.90	897
16 KB	143.74	274.48	872
32 KB	125.20	287.66	883
64 KB	119.24	278.45	839
512 KB	108.76	296.01	982
1 MB	96.73	289.81	892
4 MB	175.38	557.08	1210
8 MB	194.40	620.75	1287
16 MB	211.12	705.10	1460
32 MB	202.65	759.22	1550
no limit	245.07	767.90	1574

Based on this I added batching to L2 cache size / 4 to leave some headroom.

harikrishnan94 · 2026-06-30T08:06:46Z

m-selmi added 11 commits April 2, 2026 15:17

transform hash join payload to row major v1.

b38b854

clean up code.

7c5fbfd

support concurrent hash join.

4ad56d2

support nullable columns.

76509fc

update tests.

2061676

Merge branch 'master' into transform-hash-table-payload-to-row-major

1938ef5

fix build.

2e8a229

remove unnecessary data copies in concurrent hash join.

39f8b64

add perf test with large build side.

b4d3142

add test for spilling.

fbf2b62

disable row store for replicated columns in the first block.

c194334

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 13, 2026

m-selmi added 2 commits May 15, 2026 07:24

pin timezone in test.

5fd89d4

fix test.

9cb3de9

clickhouse-gh Bot reviewed May 15, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04054_hash_join_with_row_store.sql

clickhouse-gh Bot reviewed May 15, 2026

View reviewed changes

Comment thread src/Core/Settings.cpp

Merge branch 'master' into transform-hash-table-payload-to-row-major

fb36216

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

Merge branch 'master' into transform-hash-table-payload-to-row-major

8318c8e

m-selmi added 5 commits May 20, 2026 07:37

Merge branch 'master' into transform-hash-table-payload-to-row-major

520d8ab

fix test.

76793de

remove reserve() from row store hot path.

a65dcfc

Merge branch 'master' into transform-hash-table-payload-to-row-major

4412b1b

move setting to 26.6

28f206c

nickitat self-assigned this May 26, 2026

Merge branch 'master' into transform-hash-table-payload-to-row-major

d47918a

m-selmi added 2 commits June 10, 2026 08:30

Cover duplicates and nullable in tests.

7956056

Remove column names from test.

db23b5d

clickhouse-gh Bot reviewed Jun 10, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04054_hash_join_with_row_store.sql

m-selmi added 3 commits June 11, 2026 11:15

Self review.

60ed6cc

Merge branch 'master' into transform-hash-table-payload-to-row-major

d395902

Pre-reserve row store exactly.

ba70547

m-selmi marked this pull request as ready for review June 12, 2026 08:58

m-selmi requested a review from nickitat June 12, 2026 08:59

Make max_bytes_for_hash_join_row_store considered per hash join insta…

5e2900b

…nce.

clickhouse-gh Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread src/Core/Settings.cpp

Merge branch 'master' into transform-hash-table-payload-to-row-major

cb33f05

clickhouse-gh Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/Interpreters/ConcurrentHashJoin.cpp

clickhouse-gh Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread tests/queries/0_stateless/04054_hash_join_with_row_store.sql Outdated

remove support for joinGet.

003ab20

clickhouse-gh Bot reviewed Jun 18, 2026

View reviewed changes

Comment thread src/Core/Settings.cpp Outdated

Fix typo.

a794f85

clickhouse-gh Bot reviewed Jun 18, 2026

View reviewed changes

harikrishnan94 reviewed Jun 20, 2026

View reviewed changes

Comment thread src/Interpreters/RowDataStore.cpp

m-selmi added 3 commits June 24, 2026 14:36

Merge branch 'master' into transform-hash-table-payload-to-row-major

d1fe5cc

Fix merge conflicts.

53e0c67

Templatize gather on field size.

ead1be9

clickhouse-gh Bot unassigned nickitat Jun 25, 2026

m-selmi assigned nickitat Jun 25, 2026

m-selmi added 3 commits June 27, 2026 19:03

Add batching to L2 cache level.

1f9cba8

Clean up old tests and add wide data test.

fe45389

Merge branch 'master' into transform-hash-table-payload-to-row-major

5500d5b

Type	Rows	Columns	avg FinalizingRightJoinSide (ms)
Int64	2M	8	28.2
Int64	1M	16	31.9
FixedString(16)	1M	8	24.6
FixedString(16)	2M	4	21.5

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

m-selmi commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rational:

Changes:

Limitations:

Benchmarks:

from 04054_hash_join_with_row_store.sql:

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Uh oh!

clickhouse-gh Bot commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Missing context / blind spots

Tests

Performance & Safety

Final Verdict

Uh oh!

Uh oh!

Uh oh!

alexey-milovidov commented May 17, 2026

Uh oh!

harikrishnan94 commented May 19, 2026

Uh oh!

m-selmi commented May 19, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot Jun 18, 2026

Choose a reason for hiding this comment

Uh oh!

harikrishnan94 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

clickhouse-gh Bot commented Jun 25, 2026

Uh oh!

clickhouse-gh Bot commented Jun 28, 2026

LLVM Coverage Report

Uh oh!

m-selmi commented Jun 29, 2026

Uh oh!

harikrishnan94 commented Jun 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

m-selmi commented May 13, 2026 •

edited

Loading

from `04054_hash_join_with_row_store.sql`:

clickhouse-gh Bot commented May 13, 2026 •

edited

Loading