Throw TOO_LARGE_STRING_SIZE instead of aborting on a corrupt String size stream#108572
Throw TOO_LARGE_STRING_SIZE instead of aborting on a corrupt String size stream#108572groeneai wants to merge 1 commit into
Conversation
…ize stream
SerializationString::deserializeBinaryBulkWithSizeStream reads per-row
lengths from a separate `.size` sub-stream (WITH_SIZE_STREAM format) and
accumulates them into ColumnString offsets in
appendStringSizesToColumnStringOffsets, then computes
`bytes_to_read = offsets.back() - prev_last_offset` and calls
`data.resize(...)` BEFORE reading any character bytes.
The accumulation had no bound on the per-row size. A corrupt or desynced
sizes sub-stream (the sizes and data streams are stored separately, so a
bad granule / version skew / seek can desync them) can deliver a garbage
length with bit 63 set, making bytes_to_read >= 2^63. The pre-read resize
then reaches Allocator::checkSize, which throws a LOGICAL_ERROR
("Too large size ... passed to allocator") that aborts the server under
assert/sanitizer builds.
The existing readBigStrict hardening (ClickHouse#107196) catches the opposite
desync (data stream short, sizes intact) but runs after the resize, so it
does not help when the sizes stream is the corrupt one.
Bound the per-row size at the accumulation point and throw
TOO_LARGE_STRING_SIZE on a corrupt part, mirroring the bound the
single-stream path already enforces in deserializeBinaryImpl
(max_string_size = 16_GiB). A corrupt part now fails loudly at the point
of deserialization instead of aborting.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @Avogar @Ergus — could you review this? It adds the missing per-row size bound on the |
|
Workflow [PR], commit [c387724] Summary: ❌
AI ReviewSummaryThis PR hardens Missing context / blind spots
Findings
Tests
Final VerdictStatus: Minimum required actions: apply the oversized-size validation to skipped rows as well, guard |
| /// would otherwise reach `data.resize()` and abort in Allocator::checkSize. Same bound as the | ||
| /// single-stream path (deserializeBinaryImpl). | ||
| static constexpr size_t max_string_size = 16_GiB; | ||
| if (size > max_string_size) |
There was a problem hiding this comment.
The new bound only applies to rows passed into appendStringSizesToColumnStringOffsets, but deserializeBinaryBulkWithSizeStream consumes the skipped prefix first as bytes_to_skip += sizes_data[i] without the same validation. A seeked read can therefore skip over the corrupt sizes. For example, if the first two skipped UInt64 sizes are both 1ULL << 63 and the read uses rows_offset = 2, the skip sum wraps to zero on 64-bit; this helper only validates the requested rows, ignore(0) leaves the data stream at row 0, and the result row is read from the wrong byte position instead of throwing TOO_LARGE_STRING_SIZE. With one huge skipped size it will usually throw EOF from ignore, so the promised error is still bypassed.
Please validate the skipped sizes with the same per-row bound, and use checked addition for bytes_to_skip before using it to seek the data stream. A focused regression with rows_offset > 0 would close this path.
CI finish ledger — c387724Every failure below has an owner; none is in this PR's diff ( Session id: cron:our-pr-ci-monitor:20260626-173000 |

Related: #107196
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fix a server abort when reading a
Stringcolumn stored with the separate size-stream format (WITH_SIZE_STREAM) from a corrupt or truncated part: a garbage per-row length in the.sizesub-stream could trigger an out-of-bounds allocation (Too large size ... passed to allocator) and abort the server under assert/sanitizer builds. Such a corrupt part now raisesTOO_LARGE_STRING_SIZEat the point of deserialization instead.Description
SerializationString::deserializeBinaryBulkWithSizeStreamreads per-row lengths from a separate.sizesub-stream, accumulates them intoColumnStringoffsets, then sizes the character buffer withdata.resize(...)before reading any character bytes. The accumulation had no bound on the per-row size.The size and data sub-streams are stored separately, so a corrupt part (or a desynced/seeked sizes stream) can deliver a garbage length with bit 63 set, making
bytes_to_read >= 2^63. The pre-readresizereachesAllocator::checkSize(size >= 0x8000000000000000), which throws aLOGICAL_ERRORand aborts under assert/sanitizer builds. Top of the observed stack:The existing
readBigStricthardening (#107196) catches the opposite desync (data stream short, sizes intact) but runs after the resize, so it does not help when the sizes stream is the corrupt one.The fix bounds the per-row size at the accumulation point in
appendStringSizesToColumnStringOffsetsand throwsTOO_LARGE_STRING_SIZE, mirroring the bound the single-stream path already enforces indeserializeBinaryImpl(max_string_size = 16_GiB).A unit test
WithSizeStreamCorruptSizeStreamThrowswas added (symmetric to the existingWithSizeStreamShortDataStreamThrows): it corrupts the first.sizeentry with a bit-63 length and asserts the deserializer throws instead of aborting. Verified the test aborts (SIGABRT) on the unfixed code and passes with the fix; all existingStringSerialization.*unit tests still pass.