Fix data race on MemoryReservation release at query finish by serxa · Pull Request #108391 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix data race on MemoryReservation release at query finish#108391

Merged
serxa merged 4 commits into
masterfrom
fix-memory-reservation-release-race
Jun 25, 2026
Merged

Fix data race on MemoryReservation release at query finish#108391
serxa merged 4 commits into
masterfrom
fix-memory-reservation-release-race

Conversation

@serxa

@serxa serxa commented Jun 24, 2026

Copy link
Copy Markdown
Member

Fixes a ThreadSanitizer data race on master, found by the Stateless tests (amd_tsan) job.

BlockIO::onFinish released the query's workload resources — which resets the MemoryReservationbefore finalizing the pipeline. Pipeline executor threads hold raw pointers to that MemoryReservation (WorkloadResources in PipelineExecutor reads QueryStatus::getMemoryReservation() at thread startup and uses it for syncWithMemoryTracker), so resetting it while the pipeline is still being finalized races with — and can outlive — those threads.

TSan trace: unique_ptr<MemoryReservation>::reset (via releaseWorkloadResources from BlockIO::onFinish) vs getMemoryReservation (in a PipelineExecutor::spawnThreads worker).

onException / onCancelOrConnectionLoss already do the right thing (stop the pipeline, then release); onFinish and BlockIO::reset did the opposite.

The query slot, in contrast, is intentionally released early: until it is released the query keeps occupying a concurrency slot even though the client already considers the query finished, which would needlessly block the next query. Pipeline threads do not touch the query slot, so this is safe. Only the memory reservation needs to wait.

So the release is split:

  • QueryStatus/BlockIO/Context gain releaseQuerySlot(), and QueryStatus/BlockIO also gain releaseMemoryReservation(). releaseWorkloadResources() still releases both and is used by the error/cancel paths (pipeline already stopped).
  • BlockIO::onFinish: release the query slot early, finalize the pipeline (which joins the executor threads — PipelineExecutor::pool is wait()-ed and joined on destruction), then release the memory reservation.
  • BlockIO::reset: reset the pipeline before releasing resources.
  • executeQuery: release only the query slot early; the reservation is released later by streams.onFinish().
  • Removed the now-unused Context::releaseWorkloadResources.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=cd3c529cd3a8002caa1a4ba585a39da636836784&name_0=MasterCI&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29

Changelog category (leave one):

  • Not for changelog (changelog entry is not required)

Fixes: #108393

Version info

  • Merged into: 26.7.1.89

`BlockIO::onFinish` released the memory reservation before finalizing the
pipeline, racing with executor threads that hold raw pointers to it. Release
the query slot early (as before, for slot reuse) but the memory reservation
only after the pipeline is finalized, matching `onException`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-not-for-changelog This PR should not be mentioned in the changelog label Jun 24, 2026
Comment thread src/Interpreters/executeQuery.cpp
@serxa

serxa commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

The race was introduced together with the MemoryReservation / reserve_memory feature in #82414.

@azat azat self-assigned this Jun 24, 2026
Comment on lines -28 to +32
releaseWorkloadResources();
/// Reset the pipeline before releasing workload resources: pipeline threads hold raw pointers
/// to `MemoryReservation` (see `WorkloadResources` in `PipelineExecutor`), so the reservation
/// must outlive them.
resetPipeline(/*cancel=*/false);
releaseWorkloadResources();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this is the fix? (plus releasing query slot separatelly earlier)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. I blindly followed the query slot logic in the original PR, but memory reservations are actually a bit different

@serxa serxa Jun 24, 2026

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably can run sync with MemoryTracker here as well when we release the slot to make sure memory consumption is reduced to as little as possible. But I fear we will introduce unnecessary complexity. Not worth it IMO

@serxa

serxa commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

The failed test is indeed related. Working on it. It's pretty obvious that we moved the memory reservation release to a later point, and we now can't rely on the MemoryReservationDecreases profile event to be triggered after the query stops because the memory reservation now exists longer.

E   AssertionError: Profile event MemoryReservationDecreases check failed for query test_production, got 0

The previous commit released the `MemoryReservation` only at the very end of
`BlockIO::onFinish`, after `finalize_query_pipeline` (and therefore after
`CurrentThread::finalizePerformanceCounters`) had already run. The
`MemoryReservationDecreases` profile event emitted by the reservation's
destructor on the query thread then landed after the `query_log` snapshot and
was lost, so `test_scheduler_memory/test.py::test_reserve_memory` failed with
`MemoryReservationDecreases ... got 0`.

Release the memory reservation inside `finalizeQueryPipelineBeforeLogging`,
between `query_pipeline.reset()` and `finalizePerformanceCounters`. This is the
only point that satisfies every constraint at once: the pipeline (and its
threads) have already been torn down so there is no data race and the query's
real memory is already freed (no under-reporting to the memory scheduler), while
the event is still emitted before the `query_log` snapshot so it stays
attributed to the query. The reservation also stays alive through
`query_finish_callback`, which runs before `streams.onFinish`.

`BlockIO::onFinish` keeps releasing the memory reservation only on the
non-logging path; the logging path releases it inside the finalize step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
/// `MemoryReservationDecreases` profile event is recorded in `query_log`.
context->releaseMemoryReservation();

/// Update performance counters before logging to query_log

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only releases the reservation attached to the outer Context, but BlockIO can also own process-list entries inherited from an inner executeQuery result. For example, rewritten queries such as SHOW TABLES return the internal query's BlockIO, then the outer executeQueryImpl appends its own process_list_entry; the pipeline being reset here belongs to the inner query, while this call releases only the outer query's reservation. Since the logging branch of BlockIO::onFinish no longer calls BlockIO::releaseMemoryReservation, the inner reservation survives until BlockIO is destroyed or cleared, after CurrentThread::finalizePerformanceCounters and after the query_log snapshot. That loses the promised MemoryReservationDecreases attribution for those queries and keeps MemoryReservationApproved charged through finish callbacks/final protocol send. Please release every entry in BlockIO::process_list_entries at this point, e.g. by having finalize_query_pipeline take a release callback and invoking BlockIO::releaseMemoryReservation after query_pipeline.reset but before CurrentThread::finalizePerformanceCounters.

serxa and others added 2 commits June 25, 2026 17:16
…memory

The reservation is released at query teardown (BlockIO::onFinish), after the query's ProfileEvents are snapshotted into query_log, so the decrease is not reliably attributed to the query. Asserting it per-query is race-prone; the test now checks only MemoryReservationIncreases/Killed/Failed.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 24/24 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 2 line(s) · Uncovered code

Full report · Diff report

@serxa

serxa commented Jun 25, 2026

Copy link
Copy Markdown
Member Author

@groeneai, investigate the failure: https://github.com/ClickHouse/ClickHouse/actions/runs/28187956117/job/83517213709 and provide a fix in a separate PR. If the fix is already in progress, link it here.

Integration tests (amd_asan_ubsan, db disk, old analyzer, 2/6)test_quorum_inserts_parallel/test.py::test_parallel_quorum_actually_quorum fails with helpers.client.QueryTimeoutExceedException: Client timed out! on a plain INSERT INTO q VALUES(3, 'Hi'). There is no server crash or logical error, and it is unrelated to this PR: this PR only changes the MemoryReservation release point at query finish (Context/ProcessList/BlockIO), and that path is dormant unless a memory RESOURCE + query-level reserve_memory are configured, which this test does not do. Looks like an infra/timing flake; I couldn't find an existing tracking issue.

For completeness, the other red — AST fuzzer (amd_debug): Block structure mismatch in UnionStep stream (STID 0993-27f0) — was already auto-marked flaky and matched to #108142, so no action is needed there.

@groeneai

Copy link
Copy Markdown
Contributor

@serxa serxa added this pull request to the merge queue Jun 25, 2026
Merged via the queue into master with commit 474a3c1 Jun 25, 2026
164 of 167 checks passed
@serxa serxa deleted the fix-memory-reservation-release-race branch June 25, 2026 23:49
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 26, 2026
@robot-clickhouse-ci-2 robot-clickhouse-ci-2 added the pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR label Jun 26, 2026
JTCunning added a commit to JTCunning/ClickHouse that referenced this pull request Jun 26, 2026
…ouse#99475)

Master (ClickHouse#108391, "Fix data race on MemoryReservation release at query
finish") split BlockIO::releaseWorkloadResources() into releaseQuerySlot()
(safe while the pipeline runs) and releaseMemoryReservation() (safe only
after the pipeline has been finalized), because pipeline threads hold raw
pointers to the MemoryReservation. BlockIO::onFinish() and executeQuery()
now release the query slot early and the memory reservation later, inside
onFinish, after the pipeline is finalized.

The Prometheus query_log fix released full workload resources (including the
memory reservation) before query_finish_callback / io.onFinish, which after
that change would re-introduce the same data race in the Prometheus path.
Switch the early release to io.releaseQuerySlot() to match the updated
executeQuery: the slot is still freed before the slow HTTP final flush, while
the memory reservation is released by io.onFinish() once the pipeline has
been finalized.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-must-backport-synced The `*-must-backport` labels are synced into the cloud Sync PR pr-not-for-changelog This PR should not be mentioned in the changelog pr-synced-to-cloud The PR is synced to the cloud repo v26.6-must-backport

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ThreadSanitizer: data race in QueryStatus memory reservation (releaseWorkloadResources vs getMemoryReservation) (STID: 4071-3348)

6 participants