fix possible crash with filesystem_cache_boundary_alignment by bharatnc · Pull Request #92579 · ClickHouse/ClickHouse · GitHub
Skip to content

fix possible crash with filesystem_cache_boundary_alignment#92579

Merged
alexey-milovidov merged 21 commits into
masterfrom
ncb/fix-cache-boundary-algn
Jun 8, 2026
Merged

fix possible crash with filesystem_cache_boundary_alignment#92579
alexey-milovidov merged 21 commits into
masterfrom
ncb/fix-cache-boundary-algn

Conversation

@bharatnc

@bharatnc bharatnc commented Dec 18, 2025

Copy link
Copy Markdown
Contributor

Handle a potential size_t overflow in FileCacheUtils::roundUpToMultiple when num + multiple - 1 would exceed SIZE_MAX, and make FileCacheUtils.h self-contained by including <cstddef> / <cstdint> directly instead of relying on transitive includes from Core/Types.h.

Ref: #85552

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fix a potential overflow in roundUpToMultiple when computing filesystem cache boundary alignment for very large offsets.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Note

Medium Risk
Touches filesystem cache range alignment math and introduces a new overflow-throwing path, which could change behavior for extreme inputs/configs.

Overview
Fixes a potential size_t overflow in FileCacheUtils::roundUpToMultiple by replacing the num + multiple - 1 trick with remainder-based rounding and an explicit overflow check that throws std::overflow_error when the rounded value can’t fit.

Makes FileCacheUtils.h self-contained and header-friendly by switching helpers to inline and including <cstddef>, <cstdint>, and <stdexcept> instead of relying on transitive Core/Types.h includes.

Reviewed by Cursor Bugbot for commit 3a1b284. Bugbot is set up for automated code reviews on this repo. Configure here.

Version info

  • Merged into: 26.6.1.490

@clickhouse-gh

clickhouse-gh Bot commented Dec 18, 2025

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Dec 18, 2025
@bharatnc bharatnc force-pushed the ncb/fix-cache-boundary-algn branch from f110148 to 70706ef Compare December 19, 2025 07:49
@kssenii kssenii self-assigned this Dec 19, 2025
@bharatnc bharatnc force-pushed the ncb/fix-cache-boundary-algn branch 2 times, most recently from 2a08cc5 to 515c457 Compare December 21, 2025 17:50
@bharatnc bharatnc force-pushed the ncb/fix-cache-boundary-algn branch from 515c457 to 9129ff2 Compare December 21, 2025 17:51
Comment thread src/Interpreters/FileCache/FileCacheUtils.h Outdated
@bharatnc bharatnc force-pushed the ncb/fix-cache-boundary-algn branch from ee98161 to 29e757d Compare January 2, 2026 10:36
Comment thread src/Interpreters/Cache/FileCache.cpp Outdated
@kssenii

kssenii commented Jan 5, 2026

Copy link
Copy Markdown
Member

C++ exception with description "Setting 'boundary_alignment' (4194304) cannot be greater than 'max_file_segment_size' (5)" thrown in the test body.
seems unit tests need to be adjusted...

@bharatnc

Copy link
Copy Markdown
Contributor Author

I'm not able to reproduce this anymore in master.

@bharatnc

Copy link
Copy Markdown
Contributor Author

I'm not able to reproduce this anymore in master.

But it might still be nice to have these checks, I'll try to keep some of the changes and remove bug fix label.

alexey-milovidov and others added 2 commits April 6, 2026 22:52
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ary-algn

# Conflicts:
#	src/Interpreters/Cache/FileCacheSettings.cpp
@alexey-milovidov alexey-milovidov self-assigned this Apr 6, 2026
Comment thread src/Interpreters/FileCache/FileCacheUtils.h Outdated
alexey-milovidov and others added 2 commits April 6, 2026 23:49
…ble test

Add `#include <cstddef>` and `#include <cstdint>` to `FileCacheUtils.h` to
replace the removed `#include <Core/Types.h>`, ensuring `size_t` and `SIZE_MAX`
are properly declared.

Remove test `03734_filesystem_cache_boundary_alignment` because the underlying
bug is no longer reproducible on master, causing bugfix validation to fail.
The `boundary_alignment` validation was already added to master independently.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh clickhouse-gh Bot added pr-improvement Pull request with some product improvements and removed pr-bugfix Pull request with bugfix, not backported by default labels Apr 6, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

… smaller value

The previous overflow guard returned `roundDownToMultiple(SIZE_MAX, multiple)`,
which can be strictly less than `num` (e.g. `num = SIZE_MAX - 1`, `multiple = 4`
yields `SIZE_MAX - 3`). This violated the post-condition `result >= num` that
callers rely on (`chassert(desired_size >= downloaded_size)` in
`FileSegment.cpp` and `chassert(aligned_end_offset >= initial_range.right)` in
`FileCache.cpp`), and in release builds could underflow subsequent size
arithmetic.

`FileCacheSettings::validate` already rejects absurdly large
`boundary_alignment` values, so the overflow branch is defensive and unreachable
on valid configurations. Throw `std::overflow_error` instead of returning a
silently corrupted value.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Interpreters/FileCache/FileCacheUtils.h Outdated
alexey-milovidov and others added 2 commits May 23, 2026 13:54
…w check

The previous `num > SIZE_MAX - multiple + 1` guard detects overflow of the
`num + multiple - 1` implementation trick, not overflow of the mathematical
result. For `multiple = SIZE_MAX, num = 2` it computes `limit = 1` and throws,
but the correct rounded-up multiple `SIZE_MAX` is representable.

Switch to a remainder-based computation: `delta = multiple - (num % multiple)`,
throw only when `num + delta` would overflow `size_t`. This keeps the
post-condition `result >= num` without rejecting representable cases.

Make both helpers `inline` instead of `static` so translation units that include
the header without calling one of them (e.g. `FileSegment.cpp` no longer
exercises `roundDownToMultiple` after the new implementation removes the call)
do not hit `-Werror,-Wunused-function`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Interpreters/FileCache/FileCacheUtils.h
Adds focused gtests that pin the contract of `roundUpToMultiple` at the
`size_t` extremes:

- Representable boundaries (e.g. `num = 2`, `multiple = SIZE_MAX`
  returns `SIZE_MAX`) are not rejected by the new remainder-based
  rounding (the old `num + multiple - 1` trick would have falsely
  overflowed).
- True overflows (e.g. `num = SIZE_MAX`, `multiple = SIZE_MAX - 1`)
  throw `std::overflow_error` rather than wrapping.

Also covers the basic and zero-multiple cases for `roundDownToMultiple`
and `roundUpToMultiple` so a future refactor of the rounding formula
cannot silently regress the post-condition `result >= num`.
@alexey-milovidov

Copy link
Copy Markdown
Member

@groeneai, please fix this deadlock Hung check failed, possible deadlock found and send a PR.

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov investigated. The Hung check on Stress test (arm_debug) here is not caused by this PR.

Where it hangs. Single stuck query in hung_check.log: an AsyncInsertFlush of INSERT INTO test_l8brwlta.t (c1) FORMAT Values from 03164_analyzer_validate_tree_size.sql, elapsed 553s, blocked on:

DB::CachePriorityGuard::writeLock()
DB::FileCache::doTryReserve  -- FileCache.cpp:1212  (auto lock = cache_guard.writeLock();)
DB::FileCache::tryReserve
DB::FileSegment::reserve
DB::FileSegmentRangeWriter::write
DB::CachedOnDiskWriteBufferFromFile::cacheData
...
DB::MergeTreeSink::consume

So the thread is waiting on the global cache_guard (a SharedMutex) exclusive lock that protects the priority queue, on a cached_azure disk under stress.

Why this PR cannot cause it. PR #92579 changes one inline helper in src/Interpreters/FileCache/FileCacheUtils.h (roundUpToMultiple) plus a gtest. Its four callers are:

  • FileCache.cpp:822-825 (read-range alignment in getCacheReadRanges)
  • FileSegment.cpp:720 (shrinkFileSegmentSize)
  • FileSegment.cpp:770,772 (getBackgroundDownloadDesiredSize)

None of these appear in the hung stack. The hang is at cache_guard.writeLock(), which is a SharedMutex acquisition for priority-queue mutation, never reached from roundUpToMultiple and unaffected by the new overflow check.

Cross-PR / master data (30d). Hung check on this exact lane:

check_name hits PRs master
Stress test (arm_debug) 17 15 0

Across all stress lanes the chronic Hung check family has 565 master hits in 30d, dominated by Stress test (experimental, serverfuzz, ...) variants. This sighting matches the established PR-only flaky shape for the regular Stress test (arm_debug) lane.

Recommendation. Retrigger Stress test (arm_debug) to unblock the PR. The underlying FileCache priority writeLock contention under heavy async-insert + cached_azure stress is owned by kssenii and recent FileCache.cpp work (c52f7c23, e7321d6d, 9ed6da55) is in that area, but it is not actionable from this PR.

I have not opened a separate fix PR for the cache_guard.writeLock contention because I cannot reproduce it locally and the FileCache priority queue lock design needs the component owner's review. Happy to dig deeper if you have a deterministic repro or want me to file an issue with the CIDB data above.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=92579&sha=cc53b300b1fbbfbf7cf87655c1e5d4789e8516cc&name_0=PR&name_1=Stress+test+%28arm_debug%29

@alexey-milovidov

Copy link
Copy Markdown
Member

Updated the branch with master (clean merge, net diff still just the two files: FileCacheUtils.h + the new gtest_file_cache_utils.cpp).

All four review threads are resolved, and #101239 (the arm_msan SimSIMD fix Alexey referenced) was merged on 2026-04-07 and is already included.

Built unit_tests_dbms on the merged branch — the whole FileCache tree (including the roundUpToMultiple callers in FileCache.cpp/FileSegment.cpp) compiles, and the new gtest passes all 4 cases (representable SIZE_MAX edge returns SIZE_MAX; true overflow throws).

The four CI failures are all unrelated chronic master flakes, none touching FileCacheUtils or cache range math:

  • Server died — 1979 failures across 752 PRs in 30d.
  • Assertion 'px != 0' failed (STID: 3631-2e69) in BackgroundJobsAssignee::threadFunc (pure-virtual-call abort) — also seen on unrelated PRs #100706, #105375, #105157.
  • test_kafka_consumer_failover — Kafka integration flake.
  • Hung check — already shown unrelated by @groeneai (stuck on cache_guard.writeLock(), never reached from roundUpToMultiple).

@clickhouse-gh

clickhouse-gh Bot commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 39 queries analysed

This PR only changes a file-cache size-rounding overflow check (and tightens thread query-id handling); for normal inputs the rounding result is identical and neither change sits on the query execution path that any of the flagged queries run. The two ClickBench improvements (Q4 -14%, Q15 -13%) and two TPC-H regressions (Q2 +15%, Q4 +19%) all land on code these queries never execute, so they read as run-to-run variance rather than real effects of this change — the TPC-H queries were also noisy. All four flagged verdicts are downgraded to not-sure; no actionable performance change is attributable to this PR.

clickbench

⚠️ 2 inconclusive

Flagged queries (2 of 43)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
⚠️ 4 not_sure 265 227 -14.3% <0.0001 File-cache size-rounding overflow fix is off this SELECT's execution path; -14% is run-to-run variance.
⚠️ 15 not_sure 240 210 -12.7% <0.0001 Same off-path rounding/query-id change; this scan-filter SELECT is untouched, -12.7% is noise.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

⚠️ 2 inconclusive

Flagged queries (2 of 22)
Query Verdict Baseline med (ms) PR med (ms) Change q-value Hint
⚠️ 2 not_sure 73 84 +15.1% <0.0001 Overflow-safe rounding fix is off this query's path; +15% is run-to-run variance, not a PR effect.
⚠️ 4 not_sure 2868 3418 +19.2% 0.0001 Off-path correctness fix can't add 550ms to this query; measurements were noisy, reads as variance.

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

Debug info
  • StressHouse run: 715284e1-d27d-4292-9a47-be8ef6ef1327
  • MIRAI run: b285d114-a2c7-4b0c-9f91-0d1cef9d9193
  • PR check IDs:
    • clickbench_364482_1780559057
    • clickbench_364503_1780559057
    • tpch_adapted_1_official_364508_1780559057
    • tpch_adapted_1_official_364511_1780559057
    • tpch_adapted_1_official_364521_1780559057

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't accept performance degradations.

@alexey-milovidov

Copy link
Copy Markdown
Member

Edited manually on GitHub - if it has mistakes, they will be corrected on the next commit.

Comment thread src/Interpreters/FileCache/FileCacheUtils.h Outdated
The previous commits left `FileCacheUtils::roundUpToMultiple` in a
non-compiling and incorrect state:

- `<common/arithmeticOverflow.h>` is not a valid include path; the header
  is `<base/arithmeticOverflow.h>`.
- `common::addOverflow` takes the result as a third out-parameter
  (`addOverflow(x, y, res)`), but it was called with two arguments, which
  failed to compile in the Fast test build.
- The body called `roundDownToMultiple(num, multiple)`, which rounds
  down, dropping the rounding-up entirely.

Implement the remainder-based rounding that was requested in review: add
`multiple - (num % multiple)` and signal overflow only when the actual
rounded-up value does not fit in `size_t`. This is exact at the extreme
top of the `size_t` range (the `num + multiple - 1` trick would falsely
overflow there) and does a single division plus an add rather than a
division and a multiply.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.40% 84.40% +0.00%
Functions 92.40% 92.40% +0.00%
Branches 77.00% 77.00% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 33/33 (100.00%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 8, 2026
Merged via the queue into master with commit 5abd1c9 Jun 8, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the ncb/fix-cache-boundary-algn branch June 8, 2026 02:02
@robot-clickhouse robot-clickhouse added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants