Fix UBSan in MergeTreeDataPartWriterCompact::cancel by alexey-milovidov · Pull Request #101292 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix UBSan in MergeTreeDataPartWriterCompact::cancel#101292

Merged
alexey-milovidov merged 23 commits into
masterfrom
fix/ubsan-compact-writer-cancel
Apr 29, 2026
Merged

Fix UBSan in MergeTreeDataPartWriterCompact::cancel#101292
alexey-milovidov merged 23 commits into
masterfrom
fix/ubsan-compact-writer-cancel

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Mar 31, 2026

Copy link
Copy Markdown
Member

In MergeTreeDataPartWriterCompact::addStreams, streams_by_codec[codec_id] inserts a null shared_ptr via operator[] before std::make_shared runs. If the allocation throws, the null entry remains in the map. When cancel is later called (e.g. from the background executor's error path), it iterates streams_by_codec and dereferences the null entry, causing UBSan to report undefined behavior.

The fix rewrites the insertion to be exception-safe: use find to check for an existing entry, and only emplace the fully-constructed shared_ptr returned by std::make_shared. If the allocation throws, the map is not modified, so no null entries can be observed by cancel.

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix undefined behavior in MergeTreeDataPartWriterCompact::cancel when a stream allocation fails.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.5.1.164
  • Backported to: 26.4.2.6, 26.3.13.10

`streams_by_codec` can contain null entries when `addStreams` partially
fails: `std::map::operator[]` value-initializes (inserts null) before
`std::make_shared<CompressedStream>` runs, and if the allocation throws,
the null entry persists. When `cancel` is later called (e.g. from the
background executor's catch block), iterating and dereferencing these
null entries is undefined behavior caught by UBSan.

Add a null guard, consistent with the base class
`MergeTreeDataPartWriterOnDisk::cancel` which already guards its
optional stream pointers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added pr-critical-bugfix pr-must-backport Pull request should be backported intentionally. Use this label with great care! labels Mar 31, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Let's create a test using failpoints.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

FYI, this is the top reason for crashes of 26.3.

alexey-milovidov and others added 3 commits March 31, 2026 05:04
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ll check

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

After rewriting the code in an exception-safe way, we can no longer add a specific test.

@alexey-milovidov alexey-milovidov added pr-improvement Pull request with some product improvements v26.3-must-backport and removed pr-must-backport Pull request should be backported intentionally. Use this label with great care! pr-critical-bugfix labels Mar 31, 2026
@tuanpach tuanpach self-assigned this Mar 31, 2026
@tuanpach

tuanpach commented Mar 31, 2026

Copy link
Copy Markdown
Member

The Changelog category should be Bug Fix?

Comment thread src/Storages/MergeTree/MergeTreeDataPartWriterCompact.cpp
@alexey-milovidov

Copy link
Copy Markdown
Member Author

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

antonio2368 added a commit that referenced this pull request Apr 7, 2026
…cellation

The Wide writer's `addStreams` used `operator[]=make_unique<>()` which
modifies the map before evaluating `make_unique`. If the constructor
throws (e.g. `MEMORY_LIMIT_EXCEEDED`), a null entry remains in
`column_streams`. When `cancel()` iterates the map and dereferences
the null entry, it crashes with SIGSEGV.

Replace with `emplace` so `make_unique` is evaluated before the map is
modified — if it throws, no null entry is left. Also add a defensive
null check in `cancel()`.

Same class of bug as #101292 (Compact writer fix).
ClickHouse/clickhouse-private#1518

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
alexey-milovidov added a commit that referenced this pull request Apr 10, 2026
When the join optimizer reorders a multi-table join, type changes
(e.g., `toNullable` from a LEFT JOIN) were applied at the first step
involving the affected relation, regardless of whether the join that
causes the type change was actually being executed at that step.

For example, with `t1 LEFT JOIN t2 INNER JOIN t3`, if the optimizer
reorders to join t2-t3 first, the Nullable wrapping for t2 (from the
LEFT JOIN with t1) was incorrectly applied at the t2-t3 step. This
replaced original input nodes in `current_input_nodes`, so when the
t1-t2 LEFT JOIN expression was later processed, it could not find the
original `t2.id` INPUT node — causing the exception "Cannot fold
actions for projection".

The fix checks the `join_kinds` dependency before applying type changes:
a type change for a relation is only applied when the other side of the
type-changing join is also present in the current step's joined set.

https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=101292&sha=2d129c6df44b0ed28a05397e55abe0cbbff9bb1e&name_0=PR&name_1=Stress%20test%20%28arm_msan%29
#101292

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Apr 29, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.00% 84.00% +0.00%
Functions 91.10% 91.10% +0.00%
Branches 76.60% 76.50% -0.10%

Changed lines: 100.00% (15/15) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Apr 29, 2026
Merged via the queue into master with commit 880f38d Apr 29, 2026
165 checks passed
@alexey-milovidov alexey-milovidov deleted the fix/ubsan-compact-writer-cancel branch April 29, 2026 03:36
@robot-ch-test-poll robot-ch-test-poll added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 29, 2026
@robot-ch-test-poll1 robot-ch-test-poll1 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Apr 29, 2026
robot-ch-test-poll1 added a commit that referenced this pull request May 6, 2026
Cherry pick #101292 to 26.4: Fix UBSan in MergeTreeDataPartWriterCompact::cancel
alexey-milovidov added a commit that referenced this pull request May 6, 2026
Backport #101292 to 26.4: Fix UBSan in MergeTreeDataPartWriterCompact::cancel
@robot-clickhouse robot-clickhouse added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label May 9, 2026
@Algunenano Algunenano mentioned this pull request May 25, 2026
@Algunenano Algunenano self-assigned this May 25, 2026
@robot-ch-test-poll1 robot-ch-test-poll1 removed the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label May 25, 2026
robot-ch-test-poll1 added a commit that referenced this pull request May 25, 2026
Cherry pick #101292 to 26.3: Fix UBSan in MergeTreeDataPartWriterCompact::cancel
@robot-ch-test-poll3 robot-ch-test-poll3 added the pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore label May 25, 2026
Algunenano added a commit that referenced this pull request May 25, 2026
Backport #101292 to 26.3: Fix UBSan in MergeTreeDataPartWriterCompact::cancel
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-backports-created Backport PRs are successfully created, it won't be processed by CI script anymore pr-bugfix Pull request with bugfix, not backported by default pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-synced-to-cloud The PR is synced to the cloud repo v26.3-must-backport v26.4-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants