Add PartialAggregateCache for part-level aggregate caching. by zex-hyd · Pull Request #93757 · ClickHouse/ClickHouse · GitHub
Skip to content

Add PartialAggregateCache for part-level aggregate caching.#93757

Open
zex-hyd wants to merge 20 commits into
ClickHouse:masterfrom
zex-hyd:feature/partial-query-cache
Open

Add PartialAggregateCache for part-level aggregate caching.#93757
zex-hyd wants to merge 20 commits into
ClickHouse:masterfrom
zex-hyd:feature/partial-query-cache

Conversation

@zex-hyd

@zex-hyd zex-hyd commented Jan 9, 2026

Copy link
Copy Markdown

Add experimental part-level partial aggregate cache for GROUP BY on MergeTree tables, with planning-stage and execution-stage integration. Motivation: avoid re-reading unchanged parts by reusing mergeable intermediate aggregate states keyed by query semantics and part identity.

Related issue: #93756

Changelog category (leave one):

  • Experimental Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add experimental setting use_partial_aggregate_cache for MergeTree GROUP BY queries to cache per-part intermediate aggregate states and reuse them either at planning stage in ReadFromMergeTree or at execution stage in AggregatingTransform; include a correctness fix so planning-stage hit chunks carrying PartialAggregatePlanHitInfo are preserved through pre-aggregation FilterTransform and are not dropped before aggregation.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

How it works

Two integration points:

  • Planning stage (ReadFromMergeTree): probe PartialAggregateCache by part before readers are built. On hit, emit PartialAggregatePlanHitChunkSource and exclude the part from the read pipeline.
  • Execution stage (AggregatingTransform): on miss, buffer rows per part and put mergeable states into cache when complete; on hit, merge cached states via mergeOnBlock and skip subsequent chunks from that part.

Cache stores mergeable intermediate states, not finalized results.

Cache key:
(query_hash, table_uuid, part_name, part_mutation_version)

Invalidation is key-based: merged/mutated parts naturally produce new keys.

Current limits / behavior

  • Planning-stage probing is enabled for legacy interpreter flow; analyzer-side planning support is follow-up.
  • GROUPING SETS use execution-stage caching only.
  • Not used with in-order aggregation (AggregatingInOrderTransform).
  • With execution-stage cache enabled, aggregation pipeline is resized to one stream for per-part ownership; partial_aggregate_cache_allow_parallel_aggregation_streams=1 keeps parallelism and disables only execution-stage cache (planning-stage hits still apply).

Tests

  • 03800_partial_aggregate_cache.sql
  • 04064_partial_aggregate_cache_analyzer.sql
  • 04065_partial_aggregate_cache_grouping_sets.sql
  • Existing CI integration failure 04131_prometheus_query_parser is reproducible without random settings and appears unrelated to this change set.

Note

High Risk
Touches core query planning/execution paths (ReadFromMergeTree, AggregatingStep, AggregatingTransform) and adds new caching behavior and invalidation logic, which risks subtle correctness/performance regressions despite being gated behind experimental settings.

Overview
Adds an experimental partial aggregate cache for repeated GROUP BY queries on MergeTree, caching per-part intermediate aggregate states and reusing them across semantically identical queries.

Integrates the cache at both planning time (probing in ReadFromMergeTree to skip reading parts on cache hits via PartialAggregatePlanHitChunkSource) and execution time (buffering per-part misses in AggregatingTransform, then merging and put-ing completed intermediate states). Cache keys incorporate query semantics (incl. current_database, apply_deleted_mask, and guards for subqueries/row policies/additional filters) plus part identity (table_uuid, part_name, part_mutation_version), with hit/miss metrics/events added.

Introduces new settings (use_partial_aggregate_cache, partial_aggregate_cache_allow_parallel_aggregation_streams), server config knobs/defaults for cache sizing, SYSTEM CLEAR|DROP AGGREGATE CACHE support with a new privilege, and documentation/tests (including analyzer and grouping-sets coverage); also tightens function-factory query logging and makes query-cache nondeterminism checks robust to gated/deprecated functions.

Reviewed by Cursor Bugbot for commit 4f92f09. Bugbot is set up for automated code reviews on this repo. Configure here.

@zex-hyd zex-hyd force-pushed the feature/partial-query-cache branch from ffeb3a4 to 8a4aebe Compare January 9, 2026 00:53
@CLAassistant

CLAassistant commented Jan 9, 2026

Copy link
Copy Markdown

@zex-hyd

zex-hyd commented Jan 9, 2026

Copy link
Copy Markdown
Author

Hi @alexey-milovidov @rschu1ze,

This PR is related to my feature request #93756 for partial aggregate caching (from Intern Tasks 2025/2026).

I've started with a skeleton implementation to show the proposed cache structure. Would appreciate feedback on the approach! Happy to adjust based on your suggestions.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label Jan 9, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jan 9, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [4f92f09]

Summary:

job_name test_name status info comment
Fast test FAIL
03710_array_join_in_map_bug FAIL cidb
Docs check DROPPED
Fast test (arm_darwin) DROPPED
Build (amd_debug) DROPPED
Build (amd_asan_ubsan) DROPPED
Build (amd_tsan) DROPPED
Build (amd_msan) DROPPED
Build (amd_binary) DROPPED
Build (arm_debug) DROPPED
Build (arm_asan_ubsan) DROPPED

AI Review

Summary

This PR introduces an experimental PartialAggregateCache path through planning and execution for MergeTree GROUP BY queries. The current implementation still has correctness-critical gaps in cache keying and plan-hit transport, plus cache-size accounting that can violate configured limits. Because these issues can produce incorrect query results or memory limit violations, this cannot be approved yet.

Findings

❌ Blockers

  • [src/Processors/QueryPlan/ReadFromMergeTree.cpp:3721] Plan-time cache hits are emitted as zero-row chunks with PartialAggregatePlanHitInfo, but pre-aggregation JOIN paths can drop these chunks before AggregatingTransform. The part is removed from read (miss_parts path), while its cached contribution is never merged, violating the "each selected part contributes exactly once" contract and returning incorrect aggregates.
    Suggested fix: disable plan-time probing when pre-aggregation JOIN exists (or guarantee PartialAggregatePlanHitInfo propagation through all pre-aggregation transforms, not only FilterTransform).

  • [src/Interpreters/Cache/PartialAggregateCache.cpp:25] estimateBlockWeightBytes uses block.allocatedBytes/block.bytes, which does not fully account for aggregate-state arena memory referenced by ColumnAggregateFunction. PartialAggregateCache entry weights are therefore underestimated, so LRU capacity is not a real upper bound and memory usage can exceed configured cache limits.
    Suggested fix: include aggregate-state arena allocations in entry weight (or switch to a conservative serialized/deep-size accounting that upper-bounds resident memory for cached aggregate states).

  • [src/Interpreters/Aggregator.cpp:4051] [dismissed by author -- https://github.com/ClickHouse/ClickHouse/pull/93757#discussion_r3262957509] partialAggregateCacheSemanticKey still derives from calculateCacheKey, which does not include select.with. Queries that differ only in WITH alias definitions can share a key while producing different row sets (e.g. different alias values used in WHERE), so incompatible per-part states can be reused.
    Suggested fix: include WITH in the semantic key or conservatively disable use_partial_aggregate_cache when WITH is present.

Tests
  • ⚠️ Add a stateless regression that exercises plan-time cache hits with pre-aggregation JOIN and verifies hit parts are still counted correctly.
  • ⚠️ Add a stateless regression for WITH alias-sensitive predicates under use_partial_aggregate_cache, proving key separation (or explicit cache disable).
Final Verdict

Status: ❌ Block
Minimum required actions:

  1. Prevent plan-time hit loss through pre-aggregation JOIN paths.
  2. Fix PartialAggregateCache weight accounting so configured size limits are enforceable.
  3. Address WITH-sensitive key collisions (or prove safety with a focused regression and explicit gating).

@clickhouse-gh clickhouse-gh Bot added the pr-experimental Experimental Feature label Jan 9, 2026
@zex-hyd zex-hyd force-pushed the feature/partial-query-cache branch from 7bce5fe to 9fa7ee0 Compare January 11, 2026 23:38
@zex-hyd zex-hyd changed the title Add PartialAggregateCache skeleton for part-level aggregate caching. Add PartialAggregateCache for part-level aggregate caching. Jan 12, 2026
@zex-hyd zex-hyd force-pushed the feature/partial-query-cache branch from 9fa7ee0 to 9a03c03 Compare January 13, 2026 21:08
@zex-hyd

zex-hyd commented Jan 14, 2026

Copy link
Copy Markdown
Author

Hi ClickHouse Team (@alexey-milovidov @rschu1ze @Avogar @clickhouse-admin), PR is ready for review.

The failing checks are unrelated to this PR. All relevant checks pass: builds (amd/arm), style, stateless tests (including 03800_partial_aggregate_cache), and unit tests.

Failures are in integration test infrastructure, fuzzer edge cases, and unrelated features. Stack traces show nothing related to aggregation or caching. The partial aggregate cache test passes.

PR is ready for review. Thanks in advance for any comments, suggestions, or approval.

@zex-hyd zex-hyd force-pushed the feature/partial-query-cache branch from 9a03c03 to 66170ba Compare January 18, 2026 22:56
Comment thread tests/queries/0_stateless/03800_partial_aggregate_cache.sql Outdated
Comment thread tests/queries/0_stateless/03800_partial_aggregate_cache.sql Outdated
/// produce the same partial aggregates on every execution, so we cache and reuse them.
///
/// Enable with: SET use_partial_aggregate_cache = 1
class PartialAggregateCache

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new cache will live side-by-side to the regular query cache which isn't really great. I would rather like to see the normal query cache develop into an intermediate result cache.

Would you please check if the existing query result cache can be extended into this direction? Perhaps we can have just another key/entry type.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your review! I spent some time looking into this.

So bascially the QueryResultCache and PartialAggregateCache both use CacheBase, but the key/entry and invalidation story is quite different: the query cache is AST + user/roles with TTL, partial aggregates are semantic hash + part + mutation version, and they hook into the pipeline in different places. Folding that into QueryResultCache as “just another entry kind” would need a proper design pass, not a small tweak.

I suggest kept them split here to keep the PR focused. And possible in the future open a tracking issue for a unified intermediate-result cache if that direction still makes sense.

Comment thread programs/local/LocalServer.cpp Outdated
Comment thread programs/server/Server.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/Transforms/PartialAggregateCachingTransform.cpp Outdated
@zex-hyd

zex-hyd commented Jan 22, 2026

Copy link
Copy Markdown
Author

@rschu1ze Thank you so much for the code review! This really helps a lot! I will be working on them based on the feedback in the next few days and continuously making contribution to ClickHouse!

@zex-hyd zex-hyd force-pushed the feature/partial-query-cache branch 3 times, most recently from 005c8f2 to d17e874 Compare February 2, 2026 07:19
@@ -264,6 +265,16 @@ MergeTreeSelectProcessor::readCurrentTask(MergeTreeReadTask & current_task, IMer
res.read_mark_ranges));
}

if (reader_settings.use_partial_aggregate_cache)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can know the set of parts to read on the planning stage. This logic is both weird and inefficient, because we shouldn't spend time reading this data at all.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plan-time decisions now happen in ReadFromMergeTree using the known part set, and MergeTreeSelectProcessor::buildPartialAggregateInfoFromCurrentTask only attaches part identity to chunks for the execution-time path.

@nickitat nickitat left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please outline the entire design, and let's first make sure it will work. Then we can talk about the implementation.

@zex-hyd

zex-hyd commented Feb 12, 2026

Copy link
Copy Markdown
Author

Please outline the entire design, and let's first make sure it will work. Then we can talk about the implementation.

Thank you for reviewing this! I am currently working on it and will upload the design soon!

@nickitat nickitat self-assigned this Mar 26, 2026
Comment thread src/Processors/Transforms/AggregatingTransform.cpp Outdated
Comment thread src/Processors/Transforms/AggregatingTransform.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/Transforms/AggregatingTransform.cpp Outdated
Comment thread src/Interpreters/Context.cpp Outdated
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp Outdated
Comment thread src/Processors/Transforms/AggregatingTransform.cpp
Comment thread src/Processors/QueryPlan/AggregatingStep.cpp
@zex-hyd

zex-hyd commented May 14, 2026

Copy link
Copy Markdown
Author

Please outline the entire design, and let's first make sure it will work. Then we can talk about the implementation.

@nickitat @rschu1ze @alexey-milovidov
Sorry for the delay on the design outline you asked for before.

I agree that making the reuse decision in MergeTreeSelectProcessor was the wrong level: at that point we may already have built readers or started reading data that could have been skipped. The current design moves the reuse decision to ReadFromMergeTree, where the selected part set is already known.

Goal

The original goal for this PR is to reuse per-part intermediate GROUP BY states for eligible MergeTree queries. For the same semantic query key, immutable parts produce the same intermediate aggregate states, so cached states can be reused instead of re-scanning those parts.

Design

There are two places involved in my design:

  • ReadFromMergeTree: probes the cache per selected part before readers are created. If a part hits, it is removed from the normal read pipeline and represented by a zero-row chunk carrying PartialAggregatePlanHitInfo. If it misses, it stays in the normal read path and is marked to skip the execution-time cache lookup.

  • AggregatingTransform: consumes plan-time hits and populates cache entries for misses. For plan-time hits, it merges the cached state via mergeOnBlock. For misses, it accumulates rows in per-part local buffers and writes complete per-part states to cache at end-of-input.

MergeTreeSelectProcessor no longer decides whether a part should be served from cache. It attaches part identity to chunks on the normal read path (buildPartialAggregateInfoFromCurrentTask); MergeTreeSource handles identity attachment for the in-order / fixed-part scenario. Both serve the miss path so AggregatingTransform can populate the cache.

Correctness rules

  • Each selected part contributes exactly once: either through a plan-time cache hit or through raw-row aggregation for a miss.
  • Plan-time hit chunks are zero-row metadata chunks and must not be dropped by pre-aggregation transforms. FilterTransform now passes chunks carrying PartialAggregatePlanHitInfo instead of treating them as empty data chunks.
  • Plan-time misses set skip_execution_time_cache_lookup, so we do not do a second cache get in AggregatingTransform.
  • A part is inserted into the cache only after its complete input was seen. Parts that are early-flushed due to memory pressure or that belong to a cancelled query are excluded from put.

Cache key

(query_hash, table_uuid, part_name, part_mutation_version)

query_hash is computed by computePartialAggregateCacheQueryHash and covers GROUP BY keys, aggregate functions, overflow behavior, apply_deleted_mask, and group_by_use_nulls. If the query has row-level filters or additional table filters that are not safely represented by the key, the cache is disabled fail-close.

Invalidation is key-based: merges and mutations produce new part names or mutation versions, so stale entries are not reused for changed data.

Parallelism

When execution-time cache population is enabled, the aggregation pipeline is resized to one stream so one transform observes the complete input for each part before writing it to cache.

If partial_aggregate_cache_allow_parallel_aggregation_streams = 1, we preserve parallelism but disable execution-time cache population. Plan-time hits still work because they do not depend on per-stream part ownership.

Disabled cases

Currently disabled for in-order aggregation, grouping sets at plan-time, max_rows_to_group_by != 0 with non-throw overflow, row-level filters, additional table filters, and parallel execution-time cache population. Plan-time probing also has additional pipeline-shape guards in ReadFromMergeTree (e.g. FINAL, sampling, index build context, parallel replicas).

Tests

The tests cover the legacy path, analyzer path, plan-time hits, execution-time population, grouping sets disabling, parallel-stream behavior, early-flush / skip-cache-put for incomplete states, and the FilterTransform regression for zero-row plan-hit chunks.

Relevant tests: 03800, 04064, 04065, 04078, 04079, 04080, 04081, 04082, 04083, 04084.

We would really appreciate a check on whether this design matches your expectations!

@zex-hyd zex-hyd marked this pull request as ready for review May 14, 2026 19:01
Comment thread src/Interpreters/Aggregator.cpp
Comment thread src/Interpreters/Aggregator.cpp
@clickhouse-gh

clickhouse-gh Bot commented May 19, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.20% 84.10% -0.10%
Functions 91.40% 91.40% +0.00%
Branches 76.60% 76.50% -0.10%

Changed lines: 92.90% (746/803) | lost baseline coverage: 22 line(s) · Uncovered code

Full report · Diff report

Comment thread src/Interpreters/InterpreterSelectQuery.cpp Outdated
Comment thread src/Interpreters/Aggregator.cpp
Comment thread src/Interpreters/InterpreterSelectQuery.cpp
logging_context = context;
/// Skip internal functions (e.g. `_CAST` from the analyzer); user query lists `CAST` and logging both is noisy.
if (logging_context && !logging_context->isGlobalContext() && logging_context->getSettingsRef()[Setting::log_queries]
&& !name.starts_with('_'))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FunctionFactory logging skips all underscore-prefixed functions

Low Severity

The new filter !name.starts_with('_') suppresses addQueryFactoriesInfo logging for ALL functions whose resolved name starts with underscore, not just the internal _CAST. This changes existing query log behavior — previously all resolved functions were logged. Any user-defined or built-in function beginning with _ will silently disappear from query_log.used_functions, affecting observability and audit trails.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5e94dd5. Configure here.

Comment thread src/Functions/FunctionFactory.cpp

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.

There are 3 total unresolved issues (including 1 from previous review).

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 4f92f09. Configure here.

key.part_mutation_version = part_info->part_mutation_version;

if (partial_aggregate_cache_parts_served.contains(key))
return;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Execution-time hit silently drops chunk data without tracking

Low Severity

When partial_aggregate_cache_parts_served.contains(key) is true (duplicate chunk from an already-served part), consume returns immediately without updating src_rows, src_bytes, or rows_before_aggregation. This makes the aggregation progress/logging counters inaccurate, under-reporting the volume of data processed. The same issue applies to the plan-hit deduplication path.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4f92f09. Configure here.

}

partial_aggregate_miss_buffers.clear();
partial_aggregate_miss_buffered_input_bytes = 0;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early flush puts incomplete per-part states into global variants separately

Medium Severity

tryFlushPartialAggregateMissBuffersIfNeeded calls flushPartialAggregateMissBuffers(false) which clears partial_aggregate_miss_buffers. Subsequent chunks from the same part create a new buffer entry. While the skip-cache-put guard prevents caching incomplete states, this means a single part's data is split across multiple mergeOnBlock calls with separately-aggregated intermediate states. For non-commutative aggregation edge cases or if mergeOnBlock triggers is_consume_finished on the first partial merge, the remaining chunks for that part still arrive and create a new local aggregator, potentially producing subtly wrong final results when no_more_keys becomes true mid-part.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4f92f09. Configure here.

const auto & settings = context->getSettingsRef();
const bool partial_cache_compatible_with_group_by_limits
= settings[Setting::max_rows_to_group_by] == 0 || settings[Setting::group_by_overflow_mode] == OverflowMode::THROW;
const bool can_plan_partial_aggregate_cache_hits = build_pipeline_settings.partial_aggregate_cache_query_hash.has_value()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ Plan-time partial-cache hits are unsafe when pre-aggregation JOIN is present.

ReadFromMergeTree can remove hit parts and replace them with zero-row PartialAggregatePlanHitInfo chunks, but JoiningTransform::transform drops non-virtual zero-row chunks (it sets output_chunk only when res.rows() > 0, and only preserves ChunkInfos on the has_virtual_row branch).

So for queries like SELECT ... FROM t JOIN d ... GROUP BY ..., a hit part can be excluded from reads while its cached state metadata is lost before AggregatingTransform, which drops that part's contribution from the result.

Can we gate plan-time probing to query shapes without pre-aggregation joins (or otherwise guarantee PartialAggregatePlanHitInfo survives all pre-aggregation transforms, not only FilterTransform)?

@clickhouse-gh

clickhouse-gh Bot commented Jun 1, 2026

Copy link
Copy Markdown
Contributor

Dear @nickitat, you haven't been active on this PR for 30 days. You will be unassigned. Will you continue working on it? If so, please feel free to reassign yourself.

@nickitat nickitat self-assigned this Jun 1, 2026
@zex-hyd

zex-hyd commented Jun 4, 2026

Copy link
Copy Markdown
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-experimental Experimental Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants