iframe-proxy

groeneai · 2026-06-26T05:49:08Z

Changelog category (leave one):

Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a server abort when reading a String column stored with the separate size-stream format (WITH_SIZE_STREAM) from a corrupt or truncated part: a garbage per-row length in the .size sub-stream could trigger an out-of-bounds allocation (Too large size ... passed to allocator) and abort the server under assert/sanitizer builds. Such a corrupt part now raises TOO_LARGE_STRING_SIZE at the point of deserialization instead.

Description

SerializationString::deserializeBinaryBulkWithSizeStream reads per-row lengths from a separate .size sub-stream, accumulates them into ColumnString offsets, then sizes the character buffer with data.resize(...) before reading any character bytes. The accumulation had no bound on the per-row size.

The size and data sub-streams are stored separately, so a corrupt part (or a desynced/seeked sizes stream) can deliver a garbage length with bit 63 set, making bytes_to_read >= 2^63. The pre-read resize reaches Allocator::checkSize (size >= 0x8000000000000000), which throws a LOGICAL_ERROR and aborts under assert/sanitizer builds. Top of the observed stack:

checkSize  src/Common/Allocator.cpp:122  ("Too large size ... passed to allocator")
Allocator<false,false>::realloc
DB::SerializationString::deserializeBinaryBulkWithSizeStream
DB::MergeTreeReaderCompact::readData
...
DB::MergeTreeSource::tryGenerate

The existing readBigStrict hardening (#107196) catches the opposite desync (data stream short, sizes intact) but runs after the resize, so it does not help when the sizes stream is the corrupt one.

The fix bounds the per-row size at the accumulation point in appendStringSizesToColumnStringOffsets and throws TOO_LARGE_STRING_SIZE, mirroring the bound the single-stream path already enforces in deserializeBinaryImpl (max_string_size = 16_GiB).

A unit test WithSizeStreamCorruptSizeStreamThrows was added (symmetric to the existing WithSizeStreamShortDataStreamThrows): it corrupts the first .size entry with a bit-63 length and asserts the deserializer throws instead of aborting. Verified the test aborts (SIGABRT) on the unfixed code and passes with the fix; all existing StringSerialization.* unit tests still pass.

…ize stream SerializationString::deserializeBinaryBulkWithSizeStream reads per-row lengths from a separate `.size` sub-stream (WITH_SIZE_STREAM format) and accumulates them into ColumnString offsets in appendStringSizesToColumnStringOffsets, then computes `bytes_to_read = offsets.back() - prev_last_offset` and calls `data.resize(...)` BEFORE reading any character bytes. The accumulation had no bound on the per-row size. A corrupt or desynced sizes sub-stream (the sizes and data streams are stored separately, so a bad granule / version skew / seek can desync them) can deliver a garbage length with bit 63 set, making bytes_to_read >= 2^63. The pre-read resize then reaches Allocator::checkSize, which throws a LOGICAL_ERROR ("Too large size ... passed to allocator") that aborts the server under assert/sanitizer builds. The existing readBigStrict hardening (ClickHouse#107196) catches the opposite desync (data stream short, sizes intact) but runs after the resize, so it does not help when the sizes stream is the corrupt one. Bound the per-row size at the accumulation point and throw TOO_LARGE_STRING_SIZE on a corrupt part, mirroring the bound the single-stream path already enforces in deserializeBinaryImpl (max_string_size = 16_GiB). A corrupt part now fails loudly at the point of deserialization instead of aborting. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

groeneai · 2026-06-26T05:49:27Z

Pre-PR validation gate (click to expand)

#	Question	Answer
a	Deterministic repro?	Yes. Unit test `WithSizeStreamCorruptSizeStreamThrows` corrupts the first `.size` UInt64 to `0x8000000000000001`; the unfixed binary aborts (SIGABRT, exit 134) in `deserializeBinaryBulkWithSizeStream` -> `Allocator::checkSize` every run.
b	Root cause explained?	Per-row lengths come from a separate `.size` sub-stream; `appendStringSizesToColumnStringOffsets` accumulates them with no bound, so `bytes_to_read = offsets.back() - prev_last_offset >= 2^63`; `data.resize(...)` reaches `Allocator::checkSize` (`size >= 0x8000000000000000`) -> LOGICAL_ERROR -> abort under assert/sanitizer.
c	Fix matches root cause?	Yes. Bounds the per-row size at the accumulation point (before it can grow `bytes_to_read`), throwing `TOO_LARGE_STRING_SIZE`.
d	Test intent preserved / new test added?	New test added (symmetric to the existing `WithSizeStreamShortDataStreamThrows`). No existing test weakened; all 4 `StringSerialization.*` tests pass.
e	Demonstrated both directions?	Yes. Aborts (exit 134) without the fix; throws `TOO_LARGE_STRING_SIZE` and the suite passes with the fix.
f	Fix is general, not a narrow patch?	Yes. Checked siblings: the single-stream path already guards this (`deserializeBinaryImpl`, `16_GiB`); `SerializationStringSize` resizes by row-count, not by a stream-derived byte value, so it is not vulnerable. This was the only unguarded variable-length-resize-from-stream-value site.
g	Generalizes across inputs (params/datatypes/wrappers)?	Yes. The guard is per-row and catches both a single bit-63 length and any individual size > 16_GiB; cumulative overflow is bounded away (`<= rows * 16_GiB` cannot reach 2^63 in one call). `Nullable(String)` / `LowCardinality(String)` use the same `SerializationString` underneath, so the same guard covers them.
h	Backward compatible?	Yes. Pure hardening. Any legitimately serialized part has per-row sizes far below 16_GiB; the rejected range (`> 16_GiB`) was always an allocator abort before. No format / setting / default change.
i	Invariants and contracts preserved?	Yes. On the throw path nothing is committed (exception propagates before offsets/chars are finalized); on the valid path behavior is byte-identical. The `offsets.back() == chars.size()` consistency invariant is unaffected.

Session id: cron:clickhouse-worker-slot-6:20260626-052400

groeneai · 2026-06-26T05:49:52Z

cc @Avogar @Ergus — could you review this? It adds the missing per-row size bound on the WITH_SIZE_STREAM String read path so a corrupt/desynced .size sub-stream throws TOO_LARGE_STRING_SIZE instead of aborting in the allocator. This is the complement to the readBigStrict hardening in #107196 (which covers the data-stream-short desync, after the resize); here the sizes stream is the corrupt one and the abort happens at the pre-read data.resize().

clickhouse-gh · 2026-06-26T10:30:51Z

Workflow [PR], commit [c387724]

Summary: ❌

Performance Comparison: Performance dashboard

job_name	test_name	status	info
Stress test (amd_tsan)		FAIL
	Hung check failed, possible deadlock found	FAIL	cidb, issue
Stress test (amd_msan)		FAIL
	Cannot start clickhouse-server	FAIL	cidb
	Logical error: 'Unexpected exception in refresh scheduling' (STID: 2508-3e7b)	FAIL	cidb, issue
	Check failed	FAIL	cidb
Performance Comparison (arm_release, master_head, 4/6)		FAIL	Performance dashboard
	grace_hash_join #1::old	FAIL	query history
	grace_hash_join #1::new	FAIL	query history
	grace_hash_join #5::old	FAIL	query history
	grace_hash_join #5::new	FAIL	query history
	group_by_sundy_li #1::old	FAIL	query history
	group_by_sundy_li #1::new	FAIL	query history
	like_perfect_affix_rewrite #1::old	FAIL	query history
	like_perfect_affix_rewrite #1::new	FAIL	query history
	like_perfect_affix_rewrite #3::old	FAIL	query history
	like_perfect_affix_rewrite #3::new	FAIL	query history
	18 more test cases not shown
Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel)		ERROR
LLVM Coverage		DROPPED

AI Review

Summary

This PR hardens WITH_SIZE_STREAM String deserialization by rejecting oversized per-row lengths before they can drive ColumnString char-buffer allocation. The direction matches the single-stream path, and the PR metadata is appropriate, but the same untrusted size stream is still consumed unchecked for seeked reads, so the corrupt-size-stream fix is incomplete.

Missing context / blind spots

⚠️ I could not run the new gtest locally because this checkout has no build directory or test binary. The Praktika CI report for c387724d currently shows no failed checks.

Findings

⚠️ Majors

[src/DataTypes/Serializations/SerializationString.cpp:675] The new TOO_LARGE_STRING_SIZE guard only covers sizes appended for result rows. In deserializeBinaryBulkWithSizeStream, rows_offset sizes are accumulated into bytes_to_skip before this helper and are never checked. If two skipped corrupt sizes are both 1ULL << 63, the skip sum wraps to zero on 64-bit; the guarded helper then validates only the requested rows, ignore(0) leaves the data stream at row 0, and the seeked read can return bytes from the wrong position instead of rejecting the corrupt size stream. Validate skipped sizes with the same per-row bound and checked addition before ignore.

Tests

⚠️ Add a focused WITH_SIZE_STREAM regression that corrupts one or more skipped .size entries and reads with rows_offset > 0, proving the deserializer throws TOO_LARGE_STRING_SIZE instead of throwing EOF or returning misaligned data.

Final Verdict

Status: ⚠️ Request changes

Minimum required actions: apply the oversized-size validation to skipped rows as well, guard bytes_to_skip against overflow, and add the seeked-read regression.

Ergus

The channge looks good to me. The limit is pretty high, so not sure if it could be hit in a realistic use case. @Avogar WDYT?

clickhouse-gh · 2026-06-26T13:30:13Z

+        /// would otherwise reach `data.resize()` and abort in Allocator::checkSize. Same bound as the
+        /// single-stream path (deserializeBinaryImpl).
+        static constexpr size_t max_string_size = 16_GiB;
+        if (size > max_string_size)


The new bound only applies to rows passed into appendStringSizesToColumnStringOffsets, but deserializeBinaryBulkWithSizeStream consumes the skipped prefix first as bytes_to_skip += sizes_data[i] without the same validation. A seeked read can therefore skip over the corrupt sizes. For example, if the first two skipped UInt64 sizes are both 1ULL << 63 and the read uses rows_offset = 2, the skip sum wraps to zero on 64-bit; this helper only validates the requested rows, ignore(0) leaves the data stream at row 0, and the result row is read from the wrong byte position instead of throwing TOO_LARGE_STRING_SIZE. With one huge skipped size it will usually throw EOF from ignore, so the promised error is still bypassed.

Please validate the skipped sizes with the same per-row bound, and use checked addition for bytes_to_skip before using it to seek the data stream. A focused regression with rows_offset > 0 would close this path.

groeneai · 2026-06-26T17:49:12Z

CI finish ledger — `c387724`

Every failure below has an owner; none is in this PR's diff (SerializationString.cpp, gtest_string_serialization.cpp), so none is PR-caused.

Session id: cron:our-pr-ci-monitor:20260626-173000

Ergus self-assigned this Jun 26, 2026

clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 26, 2026

Ergus reviewed Jun 26, 2026

View reviewed changes

nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 26, 2026

clickhouse-gh Bot reviewed Jun 26, 2026

View reviewed changes

groeneai mentioned this pull request Jun 27, 2026

Fix flaky test 04402_map_functions_lowcardinality #108677

Open

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Throw TOO_LARGE_STRING_SIZE instead of aborting on a corrupt String size stream#108572

Throw TOO_LARGE_STRING_SIZE instead of aborting on a corrupt String size stream#108572
groeneai wants to merge 1 commit into
ClickHouse:masterfrom
groeneai:fix-string-size-stream-allocator-abort

groeneai commented Jun 26, 2026

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

clickhouse-gh Bot commented Jun 26, 2026 •

edited

Loading

Uh oh!

Ergus left a comment

Uh oh!

clickhouse-gh Bot Jun 26, 2026

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

groeneai commented Jun 26, 2026

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Description

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

groeneai commented Jun 26, 2026

Uh oh!

clickhouse-gh Bot commented Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Missing context / blind spots

Findings

Tests

Final Verdict

Uh oh!

Ergus left a comment

Choose a reason for hiding this comment

Uh oh!

clickhouse-gh Bot Jun 26, 2026

Choose a reason for hiding this comment

Uh oh!

groeneai commented Jun 26, 2026

CI finish ledger — c387724

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

clickhouse-gh Bot commented Jun 26, 2026 •

edited

Loading

CI finish ledger — `c387724`