Tests for changes in #107196#107355
Conversation
…rom` The consistency guard added for inconsistent source offsets computed `nested_length` before validating it. With unsigned arithmetic, a source whose offsets decrease (`offsetAt(start + length) < offsetAt(start)`) underflows `nested_length`, and `nested_offset + nested_length` wraps back to the smaller end offset, so the `> chars.size()` check silently passes and the following `resize`/`memcpy` operate on the huge wrapped length. Validate the end offset directly (`nested_end < nested_offset || nested_end > chars.size()`) before computing `nested_length`, and add a regression test with decreasing source offsets that asserts `INCORRECT_DATA`. Addresses the AI review blocker on ClickHouse#107355 Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| /// (the same invariant the copy constructor enforces). Validate the end offset before computing | ||
| /// the length: the offsets must be monotonic (nested_end >= nested_offset) and must stay within | ||
| /// chars, otherwise the memcpy below would read out of bounds. Computing nested_length first would | ||
| /// underflow for decreasing offsets and wrap nested_offset + nested_length back below chars.size(), |
There was a problem hiding this comment.
The new endpoint check still lets malformed intermediate offsets corrupt the destination column. For example, with source offsets [4, 2, 12], chars.size() == 12, and insertRangeFrom(src, 1, 2), the guard sees nested_offset == 4 and nested_end == 12, so the byte copy is bounded. But the offset rewrite below then computes src.offsets[1] - nested_offset, i.e. 2 - 4, which underflows and stores a huge destination offset. The result is a ColumnString whose offsets no longer describe its chars, so later reads can go out of bounds. Please validate every copied offset is monotonic and within [nested_offset, nested_end] before copying, or build offsets in a way that rejects any intermediate offset below the previous one.
| // Intermediate offsets can still be non-monotonic | ||
| /// A copied offset that dips below the previous one (in particular below nested_offset) would underflow the subtraction and | ||
| /// store a corrupt destination offset, so reject it here. | ||
| if (src_offset < prev_src_offset) |
There was a problem hiding this comment.
This validation still runs too late and does not cover the fast path above. With source offsets [4, 2, 12], chars.size() == 12, and insertRangeFrom(src, 0, 3) into an empty destination, the endpoint guard passes, the bytes are copied, and the start == 0 && offsets.empty() branch assigns the non-monotonic offsets unchanged because this loop is skipped. For a non-zero start, this check can also throw only after chars has been resized/copied and offsets has been resized, so a caught INCORRECT_DATA exception leaves the destination partially mutated. Please validate every copied source offset is within [nested_offset, nested_end] and monotonic before mutating chars or offsets, then perform the copy/assign.
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 134/135 (99.26%) | Lost baseline coverage: none · Uncovered code |
bcb49a5
|
Hi @Avogar Since these changes I have been wondering how these changes (specially the last commit) could affect performance (by inhibiting vectorization due to the if-throw inside the look) Also the checked path is only the "slow" one, the direct one is not checked and similarly with the other copy. I actually added this commit to the branch: 69b03b7 after it was merged in order to always check monotonicity but only on debug and asan builds... There is a trade off here... WDYT? |

Related: #107196
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Version info
26.6.1.923