Make filesystem() honor the memory limit when loading file content#104956
Conversation
The content-reading loop in `StorageFilesystem` staged each file's bytes through a temporary `std::string` before moving them into the output column. Allocations made by `std::string` go through the global `operator new`, which is intercepted via `CurrentMemoryTracker::allocNoThrow`: the amount is counted but the hard limit is *never* enforced. With several `filesystem()` streams reading large files in parallel, the per-thread buffers could grow well past `max_server_memory_usage` (or `max_memory_usage`) and the cgroup OOM-killer fired before any check could raise `MEMORY_LIMIT_EXCEEDED`. Read the file directly into the `ColumnString::chars` `PaddedPODArray`, which goes through `CurrentMemoryTracker::alloc` and throws on the hard limit. Push one offset and one null-map byte after each file, so the column ends up in the same state as before. The `else` branch and the existing schema (`Nullable(String)`) are unchanged. Adds 04238_filesystem_content_memory_limit which writes a 32 MiB file and asserts that selecting its content with `--max_memory_usage=1MiB` raises `MEMORY_LIMIT_EXCEEDED` rather than OOM-killing the process. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
filesystem() honor the memory limit when loading file content
The test asserted `grep -c -F MEMORY_LIMIT_EXCEEDED` returned exactly `1`,
but `shell_config.sh` passes `--send_logs_level=warning`, so a server-side
warning line that mirrors the exception code can appear alongside the
client's own error message. The Fast test report observed `2` matches:
https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104956&sha=69d278e3024ea27db74ca0afc6e43ebf1edd6758&name_0=PR&name_1=Fast%20test
The behavior we're asserting on is presence, not multiplicity, so switch
to `grep -F -q ... && echo 1 || echo 0`. The reference file is unchanged.
Original PR: #104956
The reviewer (@clickhouse-gh) pointed out that the single-file `max_memory_usage=1 MiB` scenario could still raise `MEMORY_LIMIT_EXCEEDED` on master, because the final `ColumnString::Chars` allocation went through the throwing `Allocator` path regardless of the intermediate `std::string`. So both pre-fix and post-fix code peaked at ~97 MiB and the test passed either way -- Bugfix validation correctly flagged it as not reproducing the bug. Switch to a three-file scenario sized so that: * the column's chars buffer crosses a power-of-two capacity boundary while reading the third file (128 MiB -> 256 MiB), and * with `max_threads=1` the work is pinned to a single stream so the peaks are deterministic. Empirical peaks on aarch64: * post-fix (stream directly into chars): ~385 MiB * pre-fix (stage through `std::string`): ~449 MiB A 440 MB (~419 MiB) limit lies between them: the pre-fix path raises `MEMORY_LIMIT_EXCEEDED` (now caught by `grep` and reported as `1`), while the post-fix path completes (reported as `0`, then the sanity query prints the total size `201326592`). Bugfix validation will see the test pass on this branch and fail on master, which is what the check expects. Addresses review on PR #104956
…ontent-into-column
`clickhouse_fuzzer` runs the full `clickhouse-local` binary (see
`programs/local/fuzzers/clickhouse_fuzzer.cpp`) which spins up a complete
clickhouse-server-grade runtime: JeMalloc arenas, OpenSSL initializer,
static initializers, and full query execution. Baseline RSS is close to
the libFuzzer default `rss_limit_mb=2048`, so fuzzed inputs that trigger
even moderate query-time allocations push the process past the limit.
CIDB shows 7 unrelated PRs in 7 days all hitting OOM at 2049-2056 Mb —
just barely over the 2048 Mb default. Failure mode is uniform:
libFuzzer out-of-memory (used: 2049-2056Mb; limit: 2048Mb) oom-<hash>
Bumping to 4096 Mb follows existing precedent for fuzzers with similar
workloads:
- `execute_query_fuzzer.options`: rss_limit_mb = 4096 (also runs query
execution; same author bumped this in commit bafb7d7)
- `data_type_deserialization_fuzzer.options`: rss_limit_mb = 4096
- `create_parser_fuzzer.options`: rss_limit_mb = 6144
The limit still bounds the target (it does not disable the check via
`rss_limit_mb=0`), so genuine memory regressions above 4 GB will still
be caught.
CIDB evidence (last 30 days, all 10 hits — 0 on master because libFuzzer
does not run on master regularly):
PR ClickHouse#99740 (4 hits, Apr 19-21)
PR ClickHouse#104231, ClickHouse#104849, ClickHouse#104956, ClickHouse#96844, ClickHouse#104510, ClickHouse#104492 (1 hit each,
May 13-15)
…ontent-into-column
|
@groeneai, the CI failures on this PR are all unrelated to the changes (which only touch
Master has been merged into the branch (commit ed96e0e). Could you retry the failing jobs, or investigate these as separate issues if they persist? Reports:
|
|
@alexey-milovidov confirmed — both failures are chronic and unrelated to this PR. CIDB evidence below. 1.
|
| Day | Master | PR |
|---|---|---|
| 2026-05-16 | 5 | 62 |
| 2026-05-15 | 17 | 89 |
| 2026-05-14 | 4 | 74 |
| 2026-05-13 | 7 | 69 |
| 2026-05-12 | 1 | 14 |
| 2026-05-11 | 6 | 78 |
That matches your "~67–106/day on master" attribution — chronic master pattern, not introduced by this PR. Fix workstreams in flight:
- PR Fix more non-deterministic function results found by the functions stress test #104858 (@Algunenano, open) —
stringVectorConstantFixedStringinFunctionsComparison.hcovers theisDistinctFrom/isNotDistinctFrom/if/nullIfString↔FixedString shapes - PR Fix
transformdefault-branch conversion bug (FixedString/Enum) #104804 (mine, open) — wrapper-layercastColumnfix insrc/Functions/transform.cppfor thecaseWithExpression/transformfamily
The substringUTF8 negative-offset shape is part of the umbrella; nothing actionable on this PR. StorageFilesystem.cpp is untouched by either fix path.
2. BuzzHouse (amd_debug) Segfault (STID 2779-2d69)
Confirmed chronic across unrelated PRs over 17 days. 3 distinct PRs hit in the last 30 days, zero master hits:
| PR | Hit time | Build | Scope (sanity check) |
|---|---|---|---|
| #104956 | 2026-05-15 16:47:54 | BuzzHouse (amd_debug) |
StorageFilesystem.cpp (this PR) |
| #104478 | 2026-05-10 10:52:31 | BuzzHouse (amd_msan) |
(unrelated, as you flagged) |
| #101428 | 2026-04-29 14:58:55 | BuzzHouse (arm_asan_ubsan) |
(unrelated) |
Three different scopes, three different sanitizer builds, and the crash is in RemoteQueryExecutor.cpp:538 (null IConnections in the distributed path) — nothing in any of those PRs touches the remote-query stack. We have a P0 trunk task tracking this STID separately (2026-04-29-ci-buzzhouse-ubsan-stid-2779-2d69).
Master merge (ed96e0e)
Worth noting that the master merge does not clear these — they're chronic on master itself for #104877, and STID 2779-2d69 is a rare distributed-path nullptr that can happen on any BuzzHouse run regardless of base.
Retrying the failing jobs as you suggested is the right call here; both are flaky CI failures, not regressions from this PR.
|
This was fixed by #105146. Let's update the branch. |
|
@groeneai, what is with |
|
Root cause: a Tracking issue: #104219 On this PR the failure is just the chronic surfacing; it doesn't block your |
…ontent-into-column
LLVM Coverage ReportChanged lines: 100.00% (8/8) | lost baseline coverage: 1 line(s) · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Make the
filesystemtable function honormax_memory_usage/max_server_memory_usagewhen loading file content. Previously, large or numerous parallel content reads could push past the limit without raisingMEMORY_LIMIT_EXCEEDEDand end up OOM-killed instead.Motivation
Running
clickhouse-localin a cgroup with a small memory limit (e.g. 8 GiB) and issuingSELECT * FROM filesystem()against a directory of large files reliably ends in an OOM-kill rather than a cleanMEMORY_LIMIT_EXCEEDED. The hard limit is correctly detected at startup (getMemoryAmountreads cgroup v2memory.maxand the limit is installed ontotal_memory_tracker), but the path that loads file content does not enforce it.StorageFilesystemwas staging each file's bytes through a temporarystd::stringbefore moving them into the output column:std::stringallocations go through the globaloperator new, which is intercepted insrc/Common/AllocationInterceptors.cppviaMemory::trackMemory->CurrentMemoryTracker::allocNoThrow->allocImpl(size, throw_if_memory_exceeded=false). When the limit is exceeded on that path,MemoryTracker::allocImpltakes the no-throw branch (incrementAllocationWithoutCheck) instead of raisingMEMORY_LIMIT_EXCEEDED. The amount is counted, but no exception fires - the process keeps allocating until the kernel / cgroup OOM-killer steps in. Withnum_streamsparallelFilesystemSources each growing their own multi-GiBstd::string, an 8 GiB limit is easy to blow past.In contrast,
Allocator-backed buffers (PODArray,ColumnString::chars, ...) go throughCurrentMemoryTracker::alloc, which does throw on the limit.Change
Read file content directly into the destination
ColumnString'scharsbuffer:readStringUntilEOFIntois already specialized forPaddedPODArray<UInt8>, so the bytes land inColumnString::Chars, which usesAllocatorand tripsMEMORY_LIMIT_EXCEEDEDon the limit. The schema (Nullable(String)forcontent) is unchanged.Test plan
04238_filesystem_content_memory_limit.shwrites a 32 MiB file underuser_filesand asserts thatSELECT length(content) FROM filesystem('...')with--max_memory_usage=1MiBproducesMEMORY_LIMIT_EXCEEDED, then re-runs without the limit and checkslength(content) = 33554432.build/programs/clickhouse local:--max_memory_usage=1MiB->Code: 241. MEMORY_LIMIT_EXCEEDED(clean throw, no kill).cityHash64(content)matches the hash produced byfile(..., RawBLOB, 'content String').content IS NULLfor directories) still produce the expected output.Documentation entry for user-facing changes