Guard ColumnString::filter against offsets inconsistent with chars array#109184
Guard ColumnString::filter against offsets inconsistent with chars array#109184groeneai wants to merge 1 commit into
Conversation
ColumnString::filter (both the const and in-place overloads) handed
`offsets` and `chars` straight to filterArraysImpl / filterArraysImplInPlace
without checking that they are consistent. The SIMD fast path in
filterArraysImplGeneric (ColumnsCommon.cpp) trusts `offsets` to compute chunk
sizes: for a full-passing 64-row chunk it does
chunk_size = offsets_pos[63] - offsets_pos[-1], then resize + memcpy that many
bytes. If a corrupt column arrives with offsets.back() != chars.size() (offsets
overshoot chars, or non-monotonic), chunk_size is bogus and the memcpy runs off
the end of the destination/source buffer, an out-of-bounds write.
Observed in CI as a SEGV ("address not mapped") with the stack
__asan_memcpy
ColumnsCommon.cpp ResultOffsetsBuilder::insertChunk<64> / filterArraysImpl<char8_t>
ColumnString::filter -> ColumnNullable::filter -> FilterDescription::filter
-> FilterTransform::doTransform
when filtering a nullable String column produced by the parquet reader on a
metadata-cache-hit path. The corrupt column is produced upstream; this change
does not fix the upstream producer, it converts the silent out-of-bounds write
into a clean INCORRECT_DATA exception so the corruption fails safely and is
diagnosable instead of crashing (or silently corrupting memory) in release
builds.
This mirrors the invariant the ColumnString copy constructor already enforces
(offsets.back() == chars.size()) and the guard recently added to the sibling
ColumnString::insertRangeFrom path.
Adds unit tests that build a ColumnString whose offsets overshoot chars and
assert filter() throws INCORRECT_DATA instead of reading/writing out of bounds
(the large-overshoot case is an ASan heap-buffer-overflow without the guard),
plus a positive test that a consistent column still filters correctly.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @Avogar @CurtizJ @davenger could you review this? It adds an O(1) consistency guard to |
|
Workflow [PR], commit [7ce9c3a] Summary: ❌
AI ReviewSummaryThis PR adds invariant checks in both Findings❌ Blockers
Final Verdict❌ Changes requested. |
| /// Enforce the ColumnString invariant (offsets.back() == chars.size(), same as the copy | ||
| /// constructor) before filterArraysImpl trusts `offsets` to size its memcpys; a violated | ||
| /// invariant would otherwise read/write out of bounds instead of failing cleanly. | ||
| if (offsets.back() != chars.size()) |
There was a problem hiding this comment.
filterArraysImpl and filterArraysImplInPlace rely on more than offsets.back() == chars.size(): copy_array subtracts every offset from the previous one. A corrupt column like offsets = [4, 2, 12] still passes this new check, but filtering the second row makes arr_size = 2 - 4, which underflows and drives a huge resize/memcpy in both paths. Since the upstream producer is still unknown, the guard here needs to reject non-monotonic offsets too, and the new tests should cover the decreasing/dipping cases the same way insertRangeFrom already does.
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 52/53 (98.11%) · Uncovered code |
CI finish ledger — 7ce9c3aEvery failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only
Note: Session id: cron:our-pr-ci-monitor:20260702-173000 |
CI finish ledger — 7ce9c3aEvery failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only Session id: cron:our-pr-ci-monitor:20260703-043000 |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a possible out-of-bounds write in
ColumnString::filterwhen aStringcolumn arrives withoffsetsinconsistent with itscharsarray (for exampleoffsets.back()not equal tochars.size()). Such a column now raises anINCORRECT_DATAexception instead of reading and writing out of bounds.Description
ColumnString::filter(both theconstand in-place overloads) passedoffsetsandcharsstraight tofilterArraysImpl/filterArraysImplInPlacewithout checking that they are consistent. The SIMD fast path infilterArraysImplGeneric(ColumnsCommon.cpp) trustsoffsetsto size its copies: for a full-passing 64-row chunk it computeschunk_size = offsets_pos[63] - offsets_pos[-1], then doesresize + memcpyof that many bytes. If a corrupt column arrives withoffsets.back() != chars.size(),chunk_sizeis bogus and the memcpy runs off the end of the buffer, an out-of-bounds write.This was observed in CI as a SEGV ("address not mapped to object") with the stack:
while filtering a nullable
Stringcolumn produced by the Parquet reader on a metadata-cache-hit path. The inconsistent column is produced upstream and the exact producer path is not deterministically reproducible, so this change does not touch the upstream producer. Instead it guards the point where the corruption turns into memory unsafety:ColumnString::filternow validates the same invariant theColumnStringcopy constructor already enforces (offsets.back() == chars.size()), matching the guard recently added to the siblingColumnString::insertRangeFrompath (#107196). A violated invariant becomes a cleanINCORRECT_DATAexception, which is handled normally in release builds, rather than a silent out-of-bounds write.The guard is O(1) (a single comparison on the already-loaded last offset), so there is no hot-path cost for well-formed columns.
Unit tests in
gtest_column_string.cppbuild aColumnStringwhose offsets overshootcharsand assert thatfilter()(both overloads) throwsINCORRECT_DATAinstead of reading/writing out of bounds. Without the guard, the large-overshoot case is an AddressSanitizer heap-buffer-overflow. A positive test confirms a consistent column still filters correctly over both the SIMD and the tail paths.No related open issue found.