Fix data race on MemoryReservation release at query finish#108391
Conversation
`BlockIO::onFinish` released the memory reservation before finalizing the pipeline, racing with executor threads that hold raw pointers to it. Release the query slot early (as before, for slot reuse) but the memory reservation only after the pipeline is finalized, matching `onException`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The race was introduced together with the |
| releaseWorkloadResources(); | ||
| /// Reset the pipeline before releasing workload resources: pipeline threads hold raw pointers | ||
| /// to `MemoryReservation` (see `WorkloadResources` in `PipelineExecutor`), so the reservation | ||
| /// must outlive them. | ||
| resetPipeline(/*cancel=*/false); | ||
| releaseWorkloadResources(); |
There was a problem hiding this comment.
So this is the fix? (plus releasing query slot separatelly earlier)
There was a problem hiding this comment.
Exactly. I blindly followed the query slot logic in the original PR, but memory reservations are actually a bit different
There was a problem hiding this comment.
We probably can run sync with MemoryTracker here as well when we release the slot to make sure memory consumption is reduced to as little as possible. But I fear we will introduce unnecessary complexity. Not worth it IMO
|
The failed test is indeed related. Working on it. It's pretty obvious that we moved the memory reservation release to a later point, and we now can't rely on the |
The previous commit released the `MemoryReservation` only at the very end of `BlockIO::onFinish`, after `finalize_query_pipeline` (and therefore after `CurrentThread::finalizePerformanceCounters`) had already run. The `MemoryReservationDecreases` profile event emitted by the reservation's destructor on the query thread then landed after the `query_log` snapshot and was lost, so `test_scheduler_memory/test.py::test_reserve_memory` failed with `MemoryReservationDecreases ... got 0`. Release the memory reservation inside `finalizeQueryPipelineBeforeLogging`, between `query_pipeline.reset()` and `finalizePerformanceCounters`. This is the only point that satisfies every constraint at once: the pipeline (and its threads) have already been torn down so there is no data race and the query's real memory is already freed (no under-reporting to the memory scheduler), while the event is still emitted before the `query_log` snapshot so it stays attributed to the query. The reservation also stays alive through `query_finish_callback`, which runs before `streams.onFinish`. `BlockIO::onFinish` keeps releasing the memory reservation only on the non-logging path; the logging path releases it inside the finalize step. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| /// `MemoryReservationDecreases` profile event is recorded in `query_log`. | ||
| context->releaseMemoryReservation(); | ||
|
|
||
| /// Update performance counters before logging to query_log |
There was a problem hiding this comment.
This only releases the reservation attached to the outer Context, but BlockIO can also own process-list entries inherited from an inner executeQuery result. For example, rewritten queries such as SHOW TABLES return the internal query's BlockIO, then the outer executeQueryImpl appends its own process_list_entry; the pipeline being reset here belongs to the inner query, while this call releases only the outer query's reservation. Since the logging branch of BlockIO::onFinish no longer calls BlockIO::releaseMemoryReservation, the inner reservation survives until BlockIO is destroyed or cleared, after CurrentThread::finalizePerformanceCounters and after the query_log snapshot. That loses the promised MemoryReservationDecreases attribution for those queries and keeps MemoryReservationApproved charged through finish callbacks/final protocol send. Please release every entry in BlockIO::process_list_entries at this point, e.g. by having finalize_query_pipeline take a release callback and invoking BlockIO::releaseMemoryReservation after query_pipeline.reset but before CurrentThread::finalizePerformanceCounters.
…ery_log" This reverts commit 2db56a1.
…memory The reservation is released at query teardown (BlockIO::onFinish), after the query's ProfileEvents are snapshotted into query_log, so the decrease is not reliably attributed to the query. Asserting it per-query is race-prone; the test now checks only MemoryReservationIncreases/Killed/Failed. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 24/24 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 2 line(s) · Uncovered code |
|
@groeneai, investigate the failure: https://github.com/ClickHouse/ClickHouse/actions/runs/28187956117/job/83517213709 and provide a fix in a separate PR. If the fix is already in progress, link it here.
For completeness, the other red — |
…ouse#99475) Master (ClickHouse#108391, "Fix data race on MemoryReservation release at query finish") split BlockIO::releaseWorkloadResources() into releaseQuerySlot() (safe while the pipeline runs) and releaseMemoryReservation() (safe only after the pipeline has been finalized), because pipeline threads hold raw pointers to the MemoryReservation. BlockIO::onFinish() and executeQuery() now release the query slot early and the memory reservation later, inside onFinish, after the pipeline is finalized. The Prometheus query_log fix released full workload resources (including the memory reservation) before query_finish_callback / io.onFinish, which after that change would re-introduce the same data race in the Prometheus path. Switch the early release to io.releaseQuerySlot() to match the updated executeQuery: the slot is still freed before the slow HTTP final flush, while the memory reservation is released by io.onFinish() once the pipeline has been finalized.

Fixes a ThreadSanitizer data race on master, found by the
Stateless tests (amd_tsan)job.BlockIO::onFinishreleased the query's workload resources — which resets theMemoryReservation— before finalizing the pipeline. Pipeline executor threads hold raw pointers to thatMemoryReservation(WorkloadResourcesinPipelineExecutorreadsQueryStatus::getMemoryReservation()at thread startup and uses it forsyncWithMemoryTracker), so resetting it while the pipeline is still being finalized races with — and can outlive — those threads.TSan trace:
unique_ptr<MemoryReservation>::reset(viareleaseWorkloadResourcesfromBlockIO::onFinish) vsgetMemoryReservation(in aPipelineExecutor::spawnThreadsworker).onException/onCancelOrConnectionLossalready do the right thing (stop the pipeline, then release);onFinishandBlockIO::resetdid the opposite.The query slot, in contrast, is intentionally released early: until it is released the query keeps occupying a concurrency slot even though the client already considers the query finished, which would needlessly block the next query. Pipeline threads do not touch the query slot, so this is safe. Only the memory reservation needs to wait.
So the release is split:
QueryStatus/BlockIO/ContextgainreleaseQuerySlot(), andQueryStatus/BlockIOalso gainreleaseMemoryReservation().releaseWorkloadResources()still releases both and is used by the error/cancel paths (pipeline already stopped).BlockIO::onFinish: release the query slot early, finalize the pipeline (which joins the executor threads —PipelineExecutor::pooliswait()-ed and joined on destruction), then release the memory reservation.BlockIO::reset: reset the pipeline before releasing resources.executeQuery: release only the query slot early; the reservation is released later bystreams.onFinish().Context::releaseWorkloadResources.CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=cd3c529cd3a8002caa1a4ba585a39da636836784&name_0=MasterCI&name_1=Stateless%20tests%20%28amd_tsan%2C%20parallel%2C%202%2F2%29
Changelog category (leave one):
Fixes: #108393
Version info
26.7.1.89