Make `filesystem()` honor the memory limit when loading file content by alexey-milovidov · Pull Request #104956 · ClickHouse/ClickHouse · GitHub
Skip to content

Make filesystem() honor the memory limit when loading file content#104956

Merged
alexey-milovidov merged 7 commits into
masterfrom
stream-filesystem-content-into-column
May 19, 2026
Merged

Make filesystem() honor the memory limit when loading file content#104956
alexey-milovidov merged 7 commits into
masterfrom
stream-filesystem-content-into-column

Conversation

@alexey-milovidov

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Make the filesystem table function honor max_memory_usage / max_server_memory_usage when loading file content. Previously, large or numerous parallel content reads could push past the limit without raising MEMORY_LIMIT_EXCEEDED and end up OOM-killed instead.

Motivation

Running clickhouse-local in a cgroup with a small memory limit (e.g. 8 GiB) and issuing SELECT * FROM filesystem() against a directory of large files reliably ends in an OOM-kill rather than a clean MEMORY_LIMIT_EXCEEDED. The hard limit is correctly detected at startup (getMemoryAmount reads cgroup v2 memory.max and the limit is installed on total_memory_tracker), but the path that loads file content does not enforce it.

StorageFilesystem was staging each file's bytes through a temporary std::string before moving them into the output column:

String content;
ReadBufferFromFile in(file.path().string());
readStringUntilEOF(content, in);
columns_map[\"content\"]->insert(std::move(content));

std::string allocations go through the global operator new, which is intercepted in src/Common/AllocationInterceptors.cpp via Memory::trackMemory -> CurrentMemoryTracker::allocNoThrow -> allocImpl(size, throw_if_memory_exceeded=false). When the limit is exceeded on that path, MemoryTracker::allocImpl takes the no-throw branch (incrementAllocationWithoutCheck) instead of raising MEMORY_LIMIT_EXCEEDED. The amount is counted, but no exception fires - the process keeps allocating until the kernel / cgroup OOM-killer steps in. With num_streams parallel FilesystemSources each growing their own multi-GiB std::string, an 8 GiB limit is easy to blow past.

In contrast, Allocator-backed buffers (PODArray, ColumnString::chars, ...) go through CurrentMemoryTracker::alloc, which does throw on the limit.

Change

Read file content directly into the destination ColumnString's chars buffer:

auto & nullable = assert_cast<ColumnNullable &>(*columns_map[\"content\"]);
auto & col_str = assert_cast<ColumnString &>(nullable.getNestedColumn());

ReadBufferFromFile in(file.path().string());
readStringUntilEOFInto(col_str.getChars(), in);
col_str.getOffsets().push_back(col_str.getChars().size());
nullable.getNullMapData().push_back(UInt8{0});

readStringUntilEOFInto is already specialized for PaddedPODArray<UInt8>, so the bytes land in ColumnString::Chars, which uses Allocator and trips MEMORY_LIMIT_EXCEEDED on the limit. The schema (Nullable(String) for content) is unchanged.

Test plan

  • New test 04238_filesystem_content_memory_limit.sh writes a 32 MiB file under user_files and asserts that SELECT length(content) FROM filesystem('...') with --max_memory_usage=1MiB produces MEMORY_LIMIT_EXCEEDED, then re-runs without the limit and checks length(content) = 33554432.
  • Manual verification with build/programs/clickhouse local:
    • With --max_memory_usage=1MiB -> Code: 241. MEMORY_LIMIT_EXCEEDED (clean throw, no kill).
    • Without the limit -> reads back the right bytes; cityHash64(content) matches the hash produced by file(..., RawBLOB, 'content String').
    • Existing 04041 scenarios (5-6 byte files, content IS NULL for directories) still produce the expected output.

Documentation entry for user-facing changes

  • I have considered changes to the docs - no behavioral change for users beyond the new error class on overflow.

Information about CI checks: https://clickhouse.com/docs/en/development/continuous-integration/

The content-reading loop in `StorageFilesystem` staged each file's bytes
through a temporary `std::string` before moving them into the output
column. Allocations made by `std::string` go through the global `operator
new`, which is intercepted via `CurrentMemoryTracker::allocNoThrow`: the
amount is counted but the hard limit is *never* enforced. With several
`filesystem()` streams reading large files in parallel, the per-thread
buffers could grow well past `max_server_memory_usage` (or
`max_memory_usage`) and the cgroup OOM-killer fired before any check
could raise `MEMORY_LIMIT_EXCEEDED`.

Read the file directly into the `ColumnString::chars` `PaddedPODArray`,
which goes through `CurrentMemoryTracker::alloc` and throws on the hard
limit. Push one offset and one null-map byte after each file, so the
column ends up in the same state as before. The `else` branch and the
existing schema (`Nullable(String)`) are unchanged.

Adds 04238_filesystem_content_memory_limit which writes a 32 MiB file
and asserts that selecting its content with `--max_memory_usage=1MiB`
raises `MEMORY_LIMIT_EXCEEDED` rather than OOM-killing the process.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented May 14, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label May 14, 2026
@alexey-milovidov alexey-milovidov changed the title Make filesystem() honor the memory limit when loading file content Make filesystem() honor the memory limit when loading file content May 14, 2026
@alexey-milovidov alexey-milovidov added the memory When memory usage is higher than expected label May 14, 2026
The test asserted `grep -c -F MEMORY_LIMIT_EXCEEDED` returned exactly `1`,
but `shell_config.sh` passes `--send_logs_level=warning`, so a server-side
warning line that mirrors the exception code can appear alongside the
client's own error message. The Fast test report observed `2` matches:

    https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=104956&sha=69d278e3024ea27db74ca0afc6e43ebf1edd6758&name_0=PR&name_1=Fast%20test

The behavior we're asserting on is presence, not multiplicity, so switch
to `grep -F -q ... && echo 1 || echo 0`. The reference file is unchanged.

Original PR: #104956
Comment thread tests/queries/0_stateless/04238_filesystem_content_memory_limit.sh
The reviewer (@clickhouse-gh) pointed out that the single-file
`max_memory_usage=1 MiB` scenario could still raise `MEMORY_LIMIT_EXCEEDED`
on master, because the final `ColumnString::Chars` allocation went through
the throwing `Allocator` path regardless of the intermediate `std::string`.
So both pre-fix and post-fix code peaked at ~97 MiB and the test passed
either way -- Bugfix validation correctly flagged it as not reproducing the
bug.

Switch to a three-file scenario sized so that:

* the column's chars buffer crosses a power-of-two capacity boundary while
  reading the third file (128 MiB -> 256 MiB), and
* with `max_threads=1` the work is pinned to a single stream so the peaks
  are deterministic.

Empirical peaks on aarch64:

* post-fix (stream directly into chars): ~385 MiB
* pre-fix  (stage through `std::string`): ~449 MiB

A 440 MB (~419 MiB) limit lies between them: the pre-fix path raises
`MEMORY_LIMIT_EXCEEDED` (now caught by `grep` and reported as `1`), while
the post-fix path completes (reported as `0`, then the sanity query prints
the total size `201326592`). Bugfix validation will see the test pass on
this branch and fail on master, which is what the check expects.

Addresses review on PR #104956
BadLiveware pushed a commit to BadLiveware/ClickHouse that referenced this pull request May 16, 2026
`clickhouse_fuzzer` runs the full `clickhouse-local` binary (see
`programs/local/fuzzers/clickhouse_fuzzer.cpp`) which spins up a complete
clickhouse-server-grade runtime: JeMalloc arenas, OpenSSL initializer,
static initializers, and full query execution. Baseline RSS is close to
the libFuzzer default `rss_limit_mb=2048`, so fuzzed inputs that trigger
even moderate query-time allocations push the process past the limit.

CIDB shows 7 unrelated PRs in 7 days all hitting OOM at 2049-2056 Mb —
just barely over the 2048 Mb default. Failure mode is uniform:

    libFuzzer  out-of-memory (used: 2049-2056Mb; limit: 2048Mb)  oom-<hash>

Bumping to 4096 Mb follows existing precedent for fuzzers with similar
workloads:
  - `execute_query_fuzzer.options`: rss_limit_mb = 4096 (also runs query
    execution; same author bumped this in commit bafb7d7)
  - `data_type_deserialization_fuzzer.options`: rss_limit_mb = 4096
  - `create_parser_fuzzer.options`: rss_limit_mb = 6144

The limit still bounds the target (it does not disable the check via
`rss_limit_mb=0`), so genuine memory regressions above 4 GB will still
be caught.

CIDB evidence (last 30 days, all 10 hits — 0 on master because libFuzzer
does not run on master regularly):
    PR ClickHouse#99740  (4 hits, Apr 19-21)
    PR ClickHouse#104231, ClickHouse#104849, ClickHouse#104956, ClickHouse#96844, ClickHouse#104510, ClickHouse#104492  (1 hit each,
    May 13-15)
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, the CI failures on this PR are all unrelated to the changes (which only touch StorageFilesystem.cpp and one stateless test):

Master has been merged into the branch (commit ed96e0e). Could you retry the failing jobs, or investigate these as separate issues if they persist? Reports:

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov confirmed — both failures are chronic and unrelated to this PR. CIDB evidence below.

1. FunctionsStress.stress / AllTests (substringUTF8 non-determinism)

Chronic master + PR failure family tracked in #104877 (umbrella) and #105040 (specific shape). Last 6 days of master+PR hits:

Day Master PR
2026-05-16 5 62
2026-05-15 17 89
2026-05-14 4 74
2026-05-13 7 69
2026-05-12 1 14
2026-05-11 6 78

That matches your "~67–106/day on master" attribution — chronic master pattern, not introduced by this PR. Fix workstreams in flight:

The substringUTF8 negative-offset shape is part of the umbrella; nothing actionable on this PR. StorageFilesystem.cpp is untouched by either fix path.

2. BuzzHouse (amd_debug) Segfault (STID 2779-2d69)

Confirmed chronic across unrelated PRs over 17 days. 3 distinct PRs hit in the last 30 days, zero master hits:

PR Hit time Build Scope (sanity check)
#104956 2026-05-15 16:47:54 BuzzHouse (amd_debug) StorageFilesystem.cpp (this PR)
#104478 2026-05-10 10:52:31 BuzzHouse (amd_msan) (unrelated, as you flagged)
#101428 2026-04-29 14:58:55 BuzzHouse (arm_asan_ubsan) (unrelated)

Three different scopes, three different sanitizer builds, and the crash is in RemoteQueryExecutor.cpp:538 (null IConnections in the distributed path) — nothing in any of those PRs touches the remote-query stack. We have a P0 trunk task tracking this STID separately (2026-04-29-ci-buzzhouse-ubsan-stid-2779-2d69).

Master merge (ed96e0e)

Worth noting that the master merge does not clear these — they're chronic on master itself for #104877, and STID 2779-2d69 is a rare distributed-path nullptr that can happen on any BuzzHouse run regardless of base.

Retrying the failing jobs as you suggested is the right call here; both are flaky CI failures, not regressions from this PR.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

This was fixed by #105146. Let's update the branch.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, what is with 01710_projection_additional_filters?

@groeneai

Copy link
Copy Markdown
Contributor

01710_projection_additional_filters is a chronic master flake — unrelated to this PR's filesystem() changes.

Root cause: a trivial count() optimization interaction with additional_table_filters under parallel_replicas + analyzer. With parallel replicas each replica independently runs the optimized count and the coordinator sums the partial results, so the test sees count() = 13 instead of 3 on Stateless tests (amd_llvm_coverage, ParallelReplicas, s3 storage, parallel). Surfaced after #102921 was merged 2026-05-04.

Tracking issue: #104219
Fix PR in flight: #104296 by @azat — "Fix trivial count optimization with additional_table_filters with parallel replicas and analyzer" — OPEN, awaiting review. Once that lands, this should drop off the PR CI noise.

On this PR the failure is just the chronic surfacing; it doesn't block your filesystem() memory-limit work.

@clickhouse-gh

clickhouse-gh Bot commented May 18, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.10% -0.10%
Functions 91.40% 91.40% +0.00%
Branches 76.50% 76.60% +0.10%

Changed lines: 100.00% (8/8) | lost baseline coverage: 1 line(s) · Uncovered code

Full report · Diff report

@alexey-milovidov alexey-milovidov self-assigned this May 18, 2026

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good.

@alexey-milovidov alexey-milovidov added this pull request to the merge queue May 18, 2026
Merged via the queue into master with commit 6405020 May 19, 2026
166 checks passed
@alexey-milovidov alexey-milovidov deleted the stream-filesystem-content-into-column branch May 19, 2026 00:00
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label May 19, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

memory When memory usage is higher than expected pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants