Transform JOIN hash table payload to row major#104884
Conversation
|
Workflow [PR], commit [5500d5b] Summary: ❌
AI ReviewSummaryThis PR adds a row-major Missing context / blind spots
Tests
Performance & Safety
Final VerdictStatus: Minimum required action: provide completed current before/after performance evidence for the PR head with the final |
|
This was fixed by #105146. Let's update the branch. |
|
@m-selmi |
@harikrishnan94 Thanks for taking a look. Right now it at least keeps the same memory footprint as separate columns, but that's a very good point, I can try it out. |
| DECLARE(UInt64, min_columns_for_hash_join_row_store, 3, R"( | ||
| Minimum number of payload columns to trigger transforming hash join payload to row-major. 0 disables the row-major transformation. | ||
| )", 0) \ | ||
| DECLARE(UInt64, max_bytes_for_hash_join_row_store, 128_MiB, R"( |
There was a problem hiding this comment.
This default-on cap needs current performance evidence. The only cloud benchmark comment I found is #104884 (comment) from 2026-06-05, before the later 128_MiB cap/default changes, and it reported tpch_adapted_1_official Q8 at +17.0% with the hint that the extra row-store transfer may not pay off for that join shape.
For a Performance Improvement, please rerun/provide measurements on current HEAD, or tune the default guard so join shapes like Q8 do not materially regress when max_bytes_for_hash_join_row_store is enabled by default.
harikrishnan94
left a comment
There was a problem hiding this comment.
Could we compare filling and gathering from a smaller batch of rows to a block row count for large right-side row widths to see if that improves both?
|
Dear @nickitat, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself. |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 897/963 (93.15%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 9 line(s) · Uncovered code |
@harikrishnan94 Great point, here are some results for this:
Seems like the output fill path has the best results while the batch still fits in L2 cache. The row store gather path is not affected as much because CH reduces the build input block size to 8192 to deal with the wide rows. When forcing the build input blocks to remain large manually there is a similar pattern where there is degradation once we leave L2 cache:
Based on this I added batching to |

Resolves #97883
Introduce an alternative layout for hash table payload in hash joins, where fixed and contiguous columns under a certain size are transformed into one row-major block of data.
Before the change:
After the change:
Rational:
The row-major format allows the output construction to be a tight loop over each column, basically one pointer load and data copy:
In comparison the columnar version has to do more work: load columns[j], row_numbers[j], replicated_columns access, insertFrom ...
For the following query
SELECT * FROM rs_probe_200m l JOIN rs_right_10m r ON (l.k % toUInt64(10000000 / 0.9)) = r.k FORMAT Nullfrom04054_hash_join_with_row_store.sqlthe perf counters forLazyOutput::buildOutputinAddedColumnslook as follows:Changes:
RowDataStorethat takes a set of columns and lays them in row major format. It exposes access to individual rows and the layout of each row (offset and size of fields in a row).HashJoin/ConcurrentHashJoinpost processing step that transforms the hash join payload to row-major once all conditions have been fulfilled.ConcurrentHashJointhe payload transformation phase is parallelized.FinalizingRightJoinSideTransformdrives the post build phase.IColumninterfaces that handle building the output columns from the row store, used manly byAddedColumns.cpp.Limitations:
Supporting non fixed rows would make the row-major layout non-uniform between rows and output column construction loop would not be as tight anymore. Also transforming unbounded unfixed types to row major could be very heavy.
Column replicated would need to be materialized or have special handling where only the index in the row store.
If the join side are swapped, meaning the build side is larger than the probe side, the gain from using the row store during probe is overshadowed by the overhead of the transformation. Unfortunately we can not determine this case in advance, for example via statistics (if we could we would have chosen the right join sides). The
max_bytes_for_hash_join_row_storewas introduced for this reason to keep the overhead bounded.Transforming around 128 MiB payload with a single thread:
Benchmarks:
from
04054_hash_join_with_row_store.sql:rs_right_100krs_right_100krs_right_10mrs_right_10mrs_right_100k_nullablers_right_100k_nullablers_right_10m_nullablers_right_10m_nullablers_right_10mrs_right_10mrs_right_1m_x10rs_right_1m_x10rs_right_1m_nullable_x10rs_right_1m_nullable_x10rs_right_mixed_10mrs_right_mixed_10mrs_right_wide_10mrs_right_wide_10mChangelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Introduce an alternative layout for hash table payload in hash joins, where qualifying columns are transformed into one row-major block of data in order to speed up join output reconstruction. Controlled by two new settings:
min_columns_for_hash_join_row_store(default 3, the minimum number of fixed-size payload columns that triggers the row-major layout; set to 0 to disable) andmax_bytes_for_hash_join_row_store(default 128 MiB, the size budget for the row store per hash join).