Fix row drop in parallel reads of `Map` primary key with bucketed serialization by groeneai · Pull Request #104540 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix row drop in parallel reads of Map primary key with bucketed serialization#104540

Closed
groeneai wants to merge 7 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-bucketed-map-parallel-read-row-drop
Closed

Fix row drop in parallel reads of Map primary key with bucketed serialization#104540
groeneai wants to merge 7 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-bucketed-map-parallel-read-row-drop

Conversation

@groeneai

@groeneai groeneai commented May 10, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Fixed missing rows when reading a MergeTree table whose primary key is a Map, contains a Map, or is derived from a Map (mapKeys(m), m.keys, etc.) and whose parts use bucketed Map serialization.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Context

This bug was discovered while investigating flaky 01550_create_map_type (#104515 was an earlier setting-pin attempt that was rejected by @Algunenano and @alexey-milovidov in favor of a real engine fix).

Minimal repro

CREATE TABLE tm (a Map(String, Array(UInt8))) ENGINE = MergeTree() ORDER BY a SETTINGS
    min_bytes_for_wide_part = 0,
    map_serialization_version_for_zero_level_parts = 'with_buckets',
    max_buckets_in_map = 11,
    map_buckets_strategy = 'constant',
    map_buckets_min_avg_size = 2;
INSERT INTO tm VALUES (map('k1', [1,2,3], 'k2', [4,5,6])), (map('k0', [], 'k1', [100,20,90]));
INSERT INTO tm SELECT map('k1', [number, number + 2, number * 2]) FROM numbers(6);
INSERT INTO tm SELECT map('k2', [number, number + 2, number * 2]) FROM numbers(6);

SELECT length(groupArray(a)) FROM tm    -- returns 14 (correct)
SETTINGS merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 0, max_threads = 4;

SELECT length(groupArray(a)) FROM tm    -- returns 13 (one row dropped) before fix
SETTINGS merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 1, max_threads = 4;

Root cause

For a 2-row part of Map(String, Array(UInt8)) with bucketed serialization (11 buckets), one row has the keys {k1, k2}:

  • Index serialization (SerializationMap::serializeBinary -> basic nested Array(Tuple(K, V))) writes the row as [("k1", [1,2,3]), ("k2", [4,5,6])] -- insertion order.
  • Bulk serialization (SerializationMap::serializeBinaryBulkWithMultipleStreams) splits keys into per-bucket streams; on read, collectMapFromBuckets reassembles them in ascending bucket-index order. With hash("k2") % 11 < hash("k1") % 11, the reassembled row is [("k2", [4,5,6]), ("k1", [1,2,3])] -- bucket order.

ColumnMap::compareAt is positional on the nested Array(Tuple(K, V)), so the two representations compare as different values. splitPartsWithRangesByPrimaryKey computes layer/border PK values from the index but the downstream FilterSortedStreamByRange evaluates the filter against actual data rows, so the row that should fall inside a layer's PK range is dropped.

A simple bonus repro of the same underlying behavior (separate from the splitter, but the same root cause):

SELECT id, a FROM (
    SELECT 1 AS id, map('k1', 1, 'k2', 2, 'k3', 3) AS a
) ORDER BY id;
-- in-memory result: {"k1":1,"k2":2,"k3":3}

-- but after MergeTree round-trip with map_serialization_version_for_zero_level_parts = 'with_buckets':
SELECT id, a FROM tm_buckets;
-- result: {"k3":3,"k2":2,"k1":1}

Fix

Two layers in PartsSplitter::isSafePrimaryDataKeyType / isSafePrimaryKey:

  1. Mark Map as not a safe primary-key type. This also covers Tuple(Map, ...) and similar nested cases via the existing recursive walk.
  2. Walk the PK expression's source storage columns and mark the PK unsafe when any source column is itself a Map (or contains a Map). This covers PKs like ORDER BY mapKeys(m) and ORDER BY m.keys, where the resolved expression type is Array(K) -- safe by itself -- but the underlying values still come from a with_buckets Map and so suffer the same index-vs-data mismatch. (Layer 2 was added in response to a regression confirmed by @tiandiwonder on a 4-part bucketed-Map table: ORDER BY mapKeys(m) and ORDER BY m.keys both went 20 -> 19 under injection=1.)

splitPartsWithRangesByPrimaryKey already has a fast path for unsafe primary keys that builds a single in-order merging pipe from all parts without doing the layer split or the intersecting / non-intersecting separation, both of which depend on the index-vs-data comparison being consistent.

This is a targeted fix: only Map-keyed and Map-derived reads lose the splitter optimization, and only when at least one source storage column is a Map (which is uncommon). Non-Map PKs and Map columns used outside the primary key are unaffected.

What I did NOT fix here

The deeper issue -- that bucketed Map serialization is non-idempotent under positional ColumnMap::compareAt semantics, which also breaks e.g. WHERE a = map('k1', 1, 'k2', 2, 'k3', 3) on a bucketed table -- is left for a separate change. It needs maintainer input on whether the right answer is canonical Map storage on write, an order-insensitive Map comparison, or extending the bucketed format to preserve original per-row key positions. This PR only fixes the splitter-level symptom; the broader Map semantics question is unchanged.

Verification

  • 50/50 iterations of regression test 04217_split_intersecting_bucketed_map_row_drop pass with full randomization on debug build (excluding environment-only failures from missing parallel_replicas cluster and timezone DB).
  • 30/30 iterations of 01550_create_map_type pass with full randomization.
  • Confirmed both directions with git stash / restore: without this commit the regression test fails with groupArray=13 / count=13 on the original Map PK shapes, and with groupArray=19 / count=20 on the new mapKeys(m) and m.keys 4-part shapes; with this commit all rows are returned.
  • Non-Map PK split path still exercised: merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=1 on a 200-row table with id UInt32 PK still returns 200 rows.
  • Bucketed Map as a non-PK column still works (read returns full row count).

…ialization

When a `Map` column is the primary key and the part uses
`map_serialization_version_for_zero_level_parts = 'with_buckets'`, the
primary key index and the data files end up storing the same logical row
with different internal key orders:

* The primary key index is written through `SerializationMap::serializeBinary`,
  which delegates to the basic nested `Array(Tuple(K, V))` serialization
  and preserves the in-memory positional order of keys.
* The data files are written by `SerializationMap::serializeBinaryBulkWithMultipleStreams`,
  which hashes each key into a bucket and serializes each bucket
  independently. The read path (`deserializeBinaryBulkWithMultipleStreams`
  + `collectMapFromBuckets`) reassembles each row's keys in ascending
  bucket-index order, not insertion order.

Under positional `ColumnMap::compareAt` these two representations are
different values. `splitPartsWithRangesByPrimaryKey` uses the index for
boundary computation but the splitter's downstream `FilterSortedStreamByRange`
sees the bucket-order data, so rows that fall inside a layer's PK range
under one representation can be dropped under the other.

This is reproducible today via
`merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability`
(testing-only setting), and is also reachable from `FINAL` with
`split_parts_ranges_into_intersecting_and_non_intersecting_final = 0`.

Mark `Map` as not a safe primary-key type so the splitter falls into the
single in-order merging-pipe path, which doesn't depend on this
comparison. The fast path for non-`Map` primary keys is unchanged.

Includes a stateless regression test covering both the
`MergeTree`/non-`FINAL` injection path and the `ReplacingMergeTree` /
`FINAL` path.

Reported via flaky `01550_create_map_type` and confirmed during PR
review of ClickHouse#104515 by @Algunenano and @alexey-milovidov.
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @Algunenano @alexey-milovidov @CurtizJ @Avogar — engine fix for the parallel-reads-of-bucketed-Map-drop-rows issue we discussed during #104515.

The fix is in PartsSplitter::isSafePrimaryDataKeyTypeMap now joins Float* / Nullable / Object / Variant / Dynamic as a type for which the splitter falls back to its single in-order merging-pipe path. Root cause: bucketed Map serialization writes per-row keys in ascending bucket-index order, but SerializationMap::serializeBinary (used for the primary key index) preserves in-memory insertion order, so under positional ColumnMap::compareAt the index value and the actual data row compare as different values and FilterSortedStreamByRange drops rows that should fall inside a layer's PK range.

The deeper issue — bucketed Map round-trip is non-idempotent under positional compareAt (also breaks WHERE a = map_literal even on a single bucketed part) — is broader than the splitter and depends on a Map-storage design call, so I deliberately did not change Map semantics here. Happy to follow up if you have a direction in mind: canonical Map storage on write, order-insensitive Map comparison, or extending the bucketed format to preserve original per-row key positions.

Regression test included; both directions verified via stash-revert; 50/50 iterations clean with full randomization. PR description has the full root-cause walkthrough.

@tiandiwonder tiandiwonder self-assigned this May 11, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [ea2cc9c]

Summary:

job_name test_name status info comment
Fast test (arm_darwin) FAIL
01951_distributed_push_down_limit FAIL cidb
01952_optimize_distributed_group_by_sharding_key FAIL cidb
Unit tests (amd_llvm_coverage) FAIL
-FunctionsStress FAIL cidb, issue
LLVM Coverage DROPPED

AI Review

Summary

This PR fixes row drops when bucketed Map data participates in splitter primary-key boundaries by marking Map, nested Map, and Map-derived primary-key expressions as unsafe for layer splitting. The current diff also covers the symmetric FINAL, lazy final, and query_plan_join_shard_by_pk_ranges paths, with focused stateless regressions for direct, nested, subcolumn/function-derived, and join cases. I found no remaining correctness issue in the current code; the earlier bot findings for Map-derived keys and optimizeJoinByShards are addressed.

Final Verdict

Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 11, 2026
@tiandiwonder

Copy link
Copy Markdown
Contributor

What I did NOT fix here
The deeper issue — that bucketed Map serialization is non-idempotent under positional ColumnMap::compareAt semantics, which also breaks e.g. WHERE a = map('k1', 1, 'k2', 2, 'k3', 3) on a bucketed table — is left for a separate change. It needs maintainer input on whether the right answer is canonical Map storage on write, an order-insensitive Map comparison, or extending the bucketed format to preserve original per-row key positions. This PR only fixes the splitter-level symptom; the broader Map semantics question is unchanged.

@groeneai @alexey-milovidov it seems a more serious issue here.

@@ -0,0 +1,70 @@
-- Regression test for parallel reads of `Map` primary key with `with_buckets` serialization.
--

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test doesn't cover nested PK types — e.g. ORDER BY (id, m) where m is a Map, or ORDER BY tuple(m, ...). The code logic (recursive isSafePrimaryDataKeyType) does handle these cases, but they aren't explicitly exercised by the test.

@Avogar

Avogar commented May 11, 2026

Copy link
Copy Markdown
Member

which also breaks e.g. WHERE a = map('k1', 1, 'k2', 2, 'k3', 3) on a bucketed table

It's a general issue with the Map type, not related to bucketed serialization. WHERE a = map('k1', 1, 'k2', 2) and WHERE a = map('k2', 2, 'k1', 1) will return different results, because we store keys in insertion order.

We can try to rewrite the code of Map comparison so it doesn't depend on keys order, but this comparison will be slower.

Per PR ClickHouse#104540 review feedback from @tiandiwonder: extend the regression
test to also exercise the recursive `isSafePrimaryDataKeyType` check.

Two new sections added, both using the same 14-row, 3-part, multi-bucket
Map shape as the existing top-level-Map cases:

1. `ORDER BY (id, m)` on `MergeTree` and `ReplacingMergeTree` — composite
   primary key with `Map` as one column. All rows use the same `id` so
   parts intersect on `id` and the splitter has to compare `m` values at
   layer boundaries. Exercises the loop in `isSafePrimaryKey` that iterates
   over every PK column.

2. `ORDER BY c` with `c Tuple(Map(...), UInt32)` on `MergeTree` and
   `ReplacingMergeTree` — primary key whose type is itself a `Tuple`
   containing a `Map`. Exercises the recursive `Tuple` branch of
   `isSafePrimaryDataKeyType`, which must descend into the tuple's
   elements and mark the whole tuple as unsafe when any element is `Map`.

For both shapes, both the injection (`merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 1`)
and the `FINAL` (`split_parts_ranges_into_intersecting_and_non_intersecting_final = 0`)
paths are covered.

Verification matrix on local debug build:

  Without fix (`case TypeIndex::Map: return false;` removed from
  `PartsSplitter::isSafePrimaryDataKeyType`):
    - composite-inj groupArray:        13 (FAIL, one row dropped)
    - composite-final-split-0 count:   13 (FAIL)
    - composite-final-split-0 groupArray: 13 (FAIL)
    - tuple-inj groupArray:            13 (FAIL)
    - tuple-final-split-0 count:       13 (FAIL)
    - tuple-final-split-0 groupArray:  13 (FAIL)

  With fix restored:
    - all four new sections return 14 across `count` and
      `length(groupArray(...))`.

`tests/clickhouse-test 04217_split_intersecting_bucketed_map_row_drop
--test-runs 20` is 20/20 green with the current build.
@groeneai

Copy link
Copy Markdown
Contributor Author

Pushed 74741723f9f extending test 04217_split_intersecting_bucketed_map_row_drop.sql with nested PK coverage.

Two new sections (same 14-row / 3-part / multi-bucket shape as the existing cases):

  1. Composite PK ORDER BY (id, m) with m Map(String, Array(UInt8)) — both MergeTree (injection path) and ReplacingMergeTree (FINAL path). All rows use the same id so parts intersect on id, which forces the splitter to compare m values at layer boundaries. Exercises the loop in isSafePrimaryKey that scans every PK column.
  2. Tuple-typed PK ORDER BY c with c Tuple(Map(...), UInt32) — both MergeTree (injection path) and ReplacingMergeTree (FINAL path). Exercises the recursive Tuple branch of isSafePrimaryDataKeyType, which must descend into the tuple's elements and mark the whole tuple as unsafe when any element is a Map.

Verification matrix on local debug build:

Section Without fix (case TypeIndex::Map: return false; removed) With fix
composite-inj groupArray 13 (FAIL) 14
composite-final-split-0 count 13 (FAIL) 14
composite-final-split-0 groupArray 13 (FAIL) 14
tuple-inj groupArray 13 (FAIL) 14
tuple-final-split-0 count 13 (FAIL) 14
tuple-final-split-0 groupArray 13 (FAIL) 14

tests/clickhouse-test 04217_split_intersecting_bucketed_map_row_drop --test-runs 20 is 20/20 green with the current build.


@Avogar — thanks for the confirmation that Map comparison being positional is a general type-level issue, not specific to the bucketed serialization. That's exactly the framing I was hoping for: the splitter fix here is the right scope (don't change Map semantics, just teach the splitter not to assume safe positional comparison), and the broader Map-comparison redesign (order-insensitive compareAt with the perf cost vs canonicalising on write vs extending the bucketed format to preserve per-row key positions) is a separate decision for the Map-type owners.

@tiandiwonder — the recursive type-check coverage you asked for is in 74741723f9f. Let me know if you'd like the cases shaped differently or if there's another nested-PK shape worth adding (e.g. Array(Map(...)) — happy to extend if you want it).

cc @alexey-milovidov @Algunenano @CurtizJ

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 17, 2026

@tiandiwonder tiandiwonder left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changlog is too long, suggest:
Fixed missing rows when reading a MergeTree table whose primary key is Map (or contains Map) and whose parts use bucketed Map serialization (map_serialization_version_for_zero_level_parts = 'with_buckets'). The problem affected parallel reads (visible via merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability) and FINAL queries with split_parts_ranges_into_intersecting_and_non_intersecting_final = 0

/// so `FilterSortedStreamByRange` drops rows that would otherwise fall inside a layer's
/// PK range. Falling back to the single in-order merging pipe (the `!isSafePrimaryKey`
/// branch in `splitPartsWithRangesByPrimaryKey`) bypasses the layer split and the
/// intersecting/non-intersecting separation, both of which depend on this comparison.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: the comment explains the with_buckets reorder very clearly, but doesn't say why we also disable the splitter for parts that use the default basic Map serialization (where index and data agree on insertion order). and also the comment is too long:
```
/// Map is unsafe as a splitter primary key: the PK index uses insertion-order
/// `serializeBinary` while `with_buckets` parts write data in bucket-index order,
/// and positional `ColumnMap::compareAt` then disagrees on equality, dropping rows.
/// Disabled unconditionally because the per-part serialization version is not
/// visible here; Map PKs are rare, so the cost is acceptable

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed a38a518e5062 shrinking the comment to 5 lines: states the with_buckets reorder mechanism and the unconditional-disable rationale (per-part serialization version is not visible here; Map PKs are rare). Good catch on the basic-path coverage gap.

Per @tiandiwonder review on PR ClickHouse#104540: the original 13-line comment was
verbose and did not state why we also disable the splitter for parts
that use `basic` (default) Map serialization, where index and data
agree on insertion order. Shortened to 5 lines covering: the
`with_buckets` reorder mechanism, and the unconditional disable
rationale (per-part serialization version is not visible here; `Map`
PKs are rare).
@groeneai

Copy link
Copy Markdown
Contributor Author

@tiandiwonder good call on both reviews. Two updates pushed at a38a518e5062:

  1. Changelog entry shortened — adopted your suggested wording (with backticks for the literal SQL identifiers) in the PR description above. Reads: "Fixed missing rows when reading a MergeTree table whose primary key is Map (or contains Map) and whose parts use bucketed Map serialization ..."
  2. Inline PartsSplitter.cpp comment shrunk to 5 lines — covers the with_buckets reorder mechanism and the unconditional-disable rationale (per-part serialization version is not visible at this call site; Map PKs are rare). Replied on the inline thread.

Verified clean build locally; only the comment text changed. CI should be unaffected.

Comment thread src/Processors/QueryPlan/PartsSplitter.cpp
@tiandiwonder

Copy link
Copy Markdown
Contributor

@groeneai please make change log entry short?

/// and positional `ColumnMap::compareAt` then disagrees on equality, dropping rows.
/// Disabled unconditionally because the per-part serialization version is not
/// visible here; `Map` PKs are rare, so the cost is acceptable.
case TypeIndex::Map:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empirically confirmed on a 4-part bucketed-Map shape: ORDER BY mapKeys(m) and ORDER BY m.keys both go 20→19 under injection=1. heads-up on the fix — a check on getRequiredColumnsWithTypes for Map won't catch them: optimize_functions_to_subcolumns (default-on at Core/Settings.cpp:4664) rewrites mapKeys(m) to m.keys (FunctionsMapMiscellaneous.cpp:515), and ORDER BY m.keys directly reads the Array(K) sub-column with no Map in the resolved expression. has to look at the original PK AST or sub-column origin.

The previous commit only checked the resolved PK expression type, so PKs that
resolve to a non-Map type but read from a Map storage column under the hood
still hit the same with_buckets index-vs-data mismatch. Cover the remaining
shapes by walking the PK expression's source columns.

Concretely: ORDER BY mapKeys(m) resolves to Array(K) and ORDER BY m.keys reads
the Map sub-column, which also resolves to Array(K). isSafePrimaryDataKeyType
on the resolved type accepts both as safe Arrays, so splitPartsWithRangesByPrimaryKey
still ran the layer split and dropped one row from a 4-part bucketed-Map table
(20 -> 19) when merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=1.

Add a new isSafePrimaryKey overload that takes ColumnsDescription and walks
primary_key.expression->getRequiredColumnsWithTypes(): for each source column
name, look it up via tryGetColumnOrSubcolumn (which resolves m.keys to a Map
sub-column with getTypeInStorage()=Map(K, V)). If any source storage column
is or contains a Map, the PK is unsafe and the splitter falls back to the
single-pipe path.

The existing single-arg isSafePrimaryKey is kept as a conservative fallback
for callers without storage-column metadata. Both PartsSplitter callers
(splitPartsWithRangesByPrimaryKey, findPKRangesForFinalAfterSkipIndex) and
optimizeLazyFinal.cpp now thread metadata_snapshot->getColumns() through.

Empirical confirmation by tiandiwonder on PR ClickHouse#104540 review (2026-06-05):
ORDER BY mapKeys(m) and ORDER BY m.keys both go 20 -> 19 under injection=1.
The naive getRequiredColumnsWithTypes-only check does not work because
optimize_functions_to_subcolumns rewrites mapKeys(m) -> m.keys, which is a
direct Array(K) sub-column read with no Map in the resolved expression.
The sub-column origin lookup against ColumnsDescription handles both shapes.

Test 04217 extended with 4-part shapes for ORDER BY mapKeys(m), ORDER BY m.keys,
and ReplacingMergeTree FINAL with split_parts_ranges_into_intersecting_and_non_intersecting_final=0.

Closes: ClickHouse#104540

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

groeneai commented Jun 5, 2026

Copy link
Copy Markdown
Contributor Author

@tiandiwonder thanks for the empirical confirmation -- reproduced both shapes locally on a 4-part bucketed-Map table (20 -> 19 under injection=1 for ORDER BY mapKeys(m) and ORDER BY m.keys).

You were right that a getRequiredColumnsWithTypes check on the resolved expression types alone does not work: optimize_functions_to_subcolumns rewrites mapKeys(m) -> m.keys, which is a direct Array(K) sub-column read with no Map in the expression. The fix uses sub-column origin lookup against ColumnsDescription instead: tryGetColumnOrSubcolumn resolves m.keys to a sub-column whose getTypeInStorage() is Map(K, V). If any source column of the PK expression is or contains a Map in storage, the PK is marked unsafe.

Pushed fix-up commit 83bcbacc2fd2 -- splitter now rejects mapKeys(m), m.keys, and any other PK derived from a Map storage column. Test 04217 extended with the two new shapes plus a ReplacingMergeTree FINAL variant. Verified 20/20 with the fix; 20/19 without it.

Also shortened the changelog entry to a single sentence per your earlier ask.

Pre-PR validation gate

# Question Answer
a Deterministic repro? Yes. 4-part bucketed-Map table, ORDER BY mapKeys(m) and ORDER BY m.keys, merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability=1, max_threads=4: returns 20 rows (count()) but length(groupArray(m))=19 deterministically before the fix.
b Root cause explained? Yes. The PK index stores mapKeys(m) / m.keys evaluated against the in-memory insertion-ordered Map. The data on disk uses with_buckets, which reorders per-row Map keys by ascending bucket index. The same expression evaluated against the disk-order Map yields a different Array(K) value than the index. splitPartsWithRangesByPrimaryKey builds layer/border PK values from the index but FilterSortedStreamByRange evaluates them against actual data rows, so the row whose PK falls inside a layer in the index does not in the data. The previous commit only marked the resolved expression type unsafe, but the resolved type here is Array(K), which passes isSafePrimaryDataKeyType.
c Fix matches root cause? Yes. Adding a source-column check against the storage column metadata (tryGetColumnOrSubcolumn) marks the PK unsafe when ANY source column is or contains a Map, regardless of the resolved expression type. The splitter's existing fast path for unsafe primary keys then takes over.
d Test intent preserved? / New tests added? New tests: 04217 has three new sections covering ORDER BY mapKeys(m), ORDER BY m.keys, and ReplacingMergeTree FINAL on mapKeys(m). All three exercise the with_buckets + multi-bucket Map + min_bytes_for_wide_part=0 + multiple parts shape that fails before the fix. Existing test sections are unchanged.
e Both directions demonstrated? Yes. Without commit 83bcbacc2fd2: mapKeys(m) and m.keys both return count()=20, length(groupArray(m))=19 under injection=1 (logged in tmp/repro-4part.log from the pre-fix run on the same source tree). With the commit: both return 20 / 20. The pre-existing 04217 sections (the original Map PK and Tuple(Map,...) shapes) also continue to pass 5/5 under no-random and 20/20 under randomization (excluding two environment-only failures unrelated to this PR: missing parallel_replicas cluster on the local server and Invalid time zone: zoneinfo/UTC from a missing TZ DB entry; both reproduce with the unchanged base test as well).
f Fix is general, not a narrow patch? Yes. The check uses a recursive typeContainsMap walk over the storage column type via forEachChild, so it catches arbitrarily-nested Map (e.g. Tuple(UInt8, Map(...)), Array(Map(...)), LowCardinality(Map(...))). All three known callers of isSafePrimaryKey (splitPartsWithRangesByPrimaryKey, findPKRangesForFinalAfterSkipIndex, optimizeLazyFinal) now pass ColumnsDescription so the new check fires uniformly. The single-arg isSafePrimaryKey overload is kept as a conservative fallback for any callers without storage metadata; behaviour for Map-typed PKs remains unchanged via the existing resolved-type check.

Session id: cron:clickhouse-ci-task-worker:20260605-112400

@clickhouse-gh

clickhouse-gh Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.40% +0.00%
Functions 92.30% 92.30% +0.00%
Branches 77.00% 77.00% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 53/58 (91.38%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@@ -0,0 +1,249 @@
-- Regression test for parallel reads of `Map` primary key with `with_buckets` serialization.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: add -- Tags: no-object-storage at the top — this test times out >180s on the asan+s3+keeper sync lane (5/33 reruns).

the with_buckets map serialization writes many small per-bucket files per part, so on object storage each is an S3 round-trip under ASan; it runs in ~0.5s on local storage and the splitter bug is storage-agnostic, so the tag loses no coverage.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done in 0743540 @tiandiwonder. Added -- Tags: no-object-storage with your S3-round-trip rationale; the splitter bug is storage-agnostic so local-only coverage is unchanged.

The with_buckets Map serialization writes many small per-bucket files per
part, so on object storage every read is an S3 round-trip; under ASan the
test times out (>180s, 5/33 reruns on the asan+s3+keeper sync lane). It runs
in ~0.5s on local storage, and the PartsSplitter bug under test is
storage-agnostic, so restricting to local storage loses no coverage.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread src/Processors/QueryPlan/PartsSplitter.cpp
optimizeJoinByShards::apply calls splitIntersectingPartsRangesIntoLayers
directly, bypassing the isSafePrimaryKey guard added for the
splitPartsWithRangesByPrimaryKey callers. With
query_plan_join_shard_by_pk_ranges = 1, a join whose common primary-key
prefix is a Map (or Map-derived, e.g. mapKeys(m)/m.keys) on a bucketed-Map
table (map_serialization_version_for_zero_level_parts = 'with_buckets')
therefore still dropped rows: the PK index is built from the in-memory
insertion-ordered Map while the data is read back in bucket order, so the
layer border filters (FilterSortedStreamByRange) drop rows whose positional
ColumnMap::compareAt disagrees.

Guard apply() with the same isSafePrimaryKey predicate; if any source has an
unsafe PK the sharding optimization is skipped (the join then reads through
the normal path). Non-Map PKs are unaffected.

Reproduced on the PR-head binary (which already fixes the non-join path):
INNER JOIN on a Map PK returned 19/20 rows, mapKeys(m) PK 109/110, with
query_plan_join_shard_by_pk_ranges = 1 and external join disabled; both
return the full count after this change. Found by clickhouse-gh[bot] review.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Follow-up commit ea2cc9c085b closes the optimizeJoinByShards symmetric-path gap raised in the bot review. Pre-PR validation gate for this layer:

# Question Answer
a Deterministic repro? Yes. INNER JOIN on a bucketed-Map PK with query_plan_join_shard_by_pk_ranges = 1, query_plan_join_swap_table = 0, external join disabled, max_threads = 4: shard-off=20, shard-on=19. mapKeys(m) PK: 110 vs 109. Deterministic.
b Root cause explained? optimizeJoinByShards::apply calls splitIntersectingPartsRangesIntoLayers directly, bypassing the isSafePrimaryKey guard the splitPartsWithRangesByPrimaryKey callers use. The PK index is built from the insertion-ordered Map (serializeBinary) while with_buckets reads data back in bucket order, so the layer border filters (FilterSortedStreamByRange, positional ColumnMap::compareAt) drop rows. Same mechanism as the original fix, reached through a different splitter caller.
c Fix matches root cause? Yes. The guard is applied at the exact bypassing call site, reusing the same isSafePrimaryKey(primary_key, columns) predicate. No symptom masking.
d Test intent preserved / new tests added? Two regression sections added to 04217: Map PK join and mapKeys(m) PK join, each asserting equal counts with sharding off vs on. No existing assertions weakened.
e Both directions demonstrated? Yes. Removed the guard, rebuilt (Build ID changed), re-ran: 20->19 and 110->109 (drop reproduced). Restored guard, rebuilt: 20/20 and 110/110. Full test 04217 output matches .reference exactly.
f Fix is general, not a narrow patch? Yes. This was itself the "symmetric path" widening: the original fix guarded splitPartsWithRangesByPrimaryKey; this guards the only other direct splitIntersectingPartsRangesIntoLayers caller (optimizeJoinByShards). Both now route through the single central isSafePrimaryKey predicate, so Map/Map-derived (mapKeys/m.keys) PKs are covered uniformly. Non-Map PKs are unaffected.

Session id: cron:clickhouse-worker-slot-1:20260615-135600

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on ea2cc9c085be. Both red checks are pre-existing infra flakes unrelated to this change (which touches PartsSplitter / optimizeJoinByShards + Map-PK tests):

  • Fast test (arm_darwin): 01951_distributed_push_down_limit and 01952_optimize_distributed_group_by_sharding_key. Known macOS-only EXPLAIN/loopback flake hitting many unrelated PRs on arm_darwin; the blocking Fast test passes.
  • Unit tests (amd_llvm_coverage): -FunctionsStress. Chronic coverage-build unit flake hitting 13+ unrelated PRs per day.

Neither is caused by this PR.

@tiandiwonder

Copy link
Copy Markdown
Contributor

CI finished on ea2cc9c085be. Both red checks are pre-existing infra flakes unrelated to this change (which touches PartsSplitter / optimizeJoinByShards + Map-PK tests):

  • Fast test (arm_darwin): 01951_distributed_push_down_limit and 01952_optimize_distributed_group_by_sharding_key. Known macOS-only EXPLAIN/loopback flake hitting many unrelated PRs on arm_darwin; the blocking Fast test passes.
  • Unit tests (amd_llvm_coverage): -FunctionsStress. Chronic coverage-build unit flake hitting 13+ unrelated PRs per day.

Neither is caused by this PR.

@groeneai please fix the flaky tests.

for (const auto & source : data.sources)
{
const auto & metadata = source->getStorageMetadata();
if (!isSafePrimaryKey(metadata->getPrimaryKey(), metadata->getColumns()))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optional: this uses isSafePrimaryKey but the file doesn't #include <Processors/QueryPlan/PartsSplitter.h> — it compiles via a transitive include.

adding the direct include guards against the include graph changing under it later.

@groeneai

Copy link
Copy Markdown
Contributor Author

@tiandiwonder thanks. Both flaky classes are being handled in separate PRs (not folded into this engine fix), and one is already done:

  1. Fast test (arm_darwin) 01951_distributed_push_down_limit + 01952_optimize_distributed_group_by_sharding_key: root cause was a reused macOS runner leaking lo0 aliases 127.0.0.2..16, which made remote('127.{1,2}', ...) look local and changed the EXPLAIN plan vs the committed reference (0 master hits; 55 unrelated PRs in 14d, arm_darwin only). Fixed in Add loopback interface aliases on macOS to run parallel-replica remote() tests #107435 (merged 2026-06-16): fail-closed lo0 de-alias teardown on the macOS Fast test job plus darwin.skip for these two distributed-EXPLAIN tests (joining the already-skipped 01244/01247/01814). That class should stop appearing on fresh runs now.

  2. Unit tests FunctionsStress (-FunctionsStress): a trunk-wide function-property fuzzer flake (100+ distinct PRs across the sanitizer/coverage lanes, ~38 master hits in 30d). Current shape is a re-execution LOGICAL_ERROR, e.g. Unsupported argument types for function geohashesInBox : Float32, Float32, Float32, Float32 raised after const-folding. The offending function rotates run to run, so each shape is a distinct latent argument-validation gap. Tracked separately; will be addressed in its own PR(s), not here.

Neither is caused by this PR, so it should not block merging the Map-PK splitter fix (which is green on its own changes). I will pursue the FunctionsStress fixes independently.

@Avogar

Avogar commented Jul 2, 2026

Copy link
Copy Markdown
Member

Let's close it in favour of #109178?

@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Agreed, closing this in favor of #109178.

#109178 fixes the root cause: the index stores Map values in insertion order while bucketed serialization reorders the keys by hash in the data, so ColumnMap::compareAt (positional) disagrees. Restoring the original key order on deserialization makes index and data compare consistently, which removes the PartsSplitter boundary mismatch this PR routed around, and also fixes the broader ORDER BY / DISTINCT / GROUP BY / min / max / equality-filter issues on Map columns. The serialization-layer fix is the correct and more general one; this PR only forced Map primary keys onto the single in-order merging-pipe path.

One thing worth carrying over: #109178 doesn't seem to exercise the parallel split path that dropped rows here. 04217_split_intersecting_bucketed_map_row_drop reproduces it deterministically with merge_tree_read_split_ranges_into_intersecting_and_non_intersecting_injection_probability = 1 and max_threads = 4 on a Map primary key (including the composite / Tuple-containing-Map cases and the FINAL layer-split path). Happy to port it into #109178 as a regression guard, or you can pick it up from this branch.

Closing now.

@groeneai

groeneai commented Jul 2, 2026

Copy link
Copy Markdown
Contributor Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants