Add query plan cache for complex queries (views, joins, subqueries) by alexey-milovidov · Pull Request #107125 · ClickHouse/ClickHouse · GitHub
Skip to content

Add query plan cache for complex queries (views, joins, subqueries)#107125

Open
alexey-milovidov wants to merge 28 commits into
masterfrom
query-plan-cache-complex
Open

Add query plan cache for complex queries (views, joins, subqueries)#107125
alexey-milovidov wants to merge 28 commits into
masterfrom
query-plan-cache-complex

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 11, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Experimental Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add an experimental query plan cache supporting complex SELECT queries: views (including nested views), joins, UNION, IN subqueries and multiple tables over any table engine. When allow_experimental_query_plan_cache = 1 and enable_query_plan_cache = 1, repeated identical queries skip query analysis and logical planning (the query is still parsed) while still executing against current data. SYSTEM DROP QUERY PLAN CACHE clears the cache; events QueryPlanCacheHits/QueryPlanCacheMisses/QueryPlanCacheValidationMisses/QueryPlanCacheStaleMisses and metrics QueryPlanCacheBytes/QueryPlanCacheEntries provide observability.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Description

An experimental query plan cache for SELECT queries that supports complex queries — views (including nested views), joins, UNION, IN subqueries and multiple tables over any table engine. It complements #99309, which covers single-table MergeTree queries with a universalize/materialize design; this PR takes a different approach to reach query shapes where planning cost is actually concentrated (deep view chains, many joins).

Unlike the query result cache, a plan cache hit still executes the query: only query analysis and logical planning are skipped (the query is still parsed, since the cache key is built from the parsed AST), so hits are transactionally consistent and always see current data.

Design

On a miss, the plan is built in the planner's existing build_logical_plan mode (the machinery used by distributed plan shipping), with a new SelectQueryOptions::cacheable_logical_plan mode that makes the logical plan self-contained:

  • Leaf reads are storage-agnostic ReadFromTable placeholders for any engine.
  • Views are expanded at plan time, recursively in logical mode (propagated through StorageView::readImpl via SelectQueryInfo::build_logical_plan), so the cached plan embeds the analyzed view bodies instead of deferring the expensive expansion to execution.
  • Key-value direct-join lookup steps (JoinStepLogicalLookup), which bind live storages into the plan, are not used; a regular join is planned instead and the physical join algorithm is chosen at materialization. Distributed and parallel-replicas logical plans do not set this mode and are unaffected.
  • The plan is serialized with the standard query plan serialization and built with compile_expressions = 0 (JIT-fused function nodes cannot be serialized; the JIT still applies when pipelines are built from the plan).

Both hits and misses execute through QueryPlan::resolveStorages, which re-binds every leaf to a fresh storage snapshot.

Cache key and invalidation

The key is the normalized AST hash, a hash of plan-affecting settings, the current database, the user and the sorted role set. Each entry stores a dependency fingerprint over every referenced storage, discovered from the plan's ReadFromTable leaves plus an AST closure through view definitions (which also catches tables referenced only from scalar subqueries): UUID, metadata version (or a schema content hash including the view body), and the row policy hash. Every hit revalidates all dependencies and re-checks SELECT access, so DROP/CREATE, ALTER, view redefinition (including nested views), row policy changes and permission revocation are handled transparently.

Eligibility

  • Non-deterministic functions anywhere — including inside expanded view bodies — make a query uncacheable (arrayJoin is exempt: it is multi-valued but pure).
  • Scalar subqueries are evaluated during analysis and baked into the plan as constants, so they are gated behind the opt-in setting query_plan_cache_allow_scalar_subqueries.
  • Table functions, temporary tables, system tables (except system.one), remote/Merge storages and SQL SECURITY DEFINER views are rejected.
  • Plans containing steps without serialization support (e.g. window functions) execute normally and are simply not stored.

Performance, proven on DOOMbench

DOOMbench is a DOOM-like raycasting engine implemented entirely in SQL; its screen query renders a frame through a deep chain of views over joins, unions, array joins and scalar subqueries. Ported to ClickHouse SQL, the workload is query-analysis-bound: profiling showed IQueryTreeNode::getTreeHash and analyzer resolution dominating, and SELECT ... LIMIT 0 was as slow as the full query. With the plan cache (local server, single connection):

DOOMbench frame render median speedup
cache off 551-577 ms
cache on (hit) 227-246 ms 2.4x
cache on, max_threads = 4 148-176 ms 3.3x

On hits the analyzer cost disappears from profiles (getTreeHash: 25.8k samples → 174); the remaining time is genuine execution. Correctness validated on the same workload: frames are byte-identical between cache off/miss/hit, game-state mutations are visible on every cached render (HTAP freshness), nested view redefinition is picked up immediately, and rand() inside a view body refuses to cache.

Notes

  • This PR includes the ActionsDAG input-constant serialization fix Preserve constant columns of ActionsDAG input nodes in serialization #107124 as its first commit; it will be rebased once that PR is merged.
  • The miss path executes through the same logical-plan machinery as hits, so hit and miss behavior is identical by construction; storage-specific planning shortcuts (e.g. trivial count()) are not applied to cache-enabled queries.

Related: #99309
Related: #107124

…ries)

A new experimental query plan cache for `SELECT` queries that, unlike the
single-table design of #99309,
supports complex queries: views (including nested views), joins, `UNION`,
`IN` subqueries and multiple tables. Unlike the query result cache, a plan
cache hit still executes the query against current data; only parsing, query
analysis and logical planning are skipped.

Design: instead of universalizing storage-bound steps after planning, the
miss path builds the plan in the planner's existing `build_logical_plan` mode
(battle-tested by distributed plan shipping), where leaf reads are
storage-agnostic `ReadFromTable` placeholders for any engine. On every
execution - both hit and miss - `QueryPlan::resolveStorages` re-binds the
leaves to fresh storage snapshots, so cached plans are transactionally
consistent. A new `SelectQueryOptions::cacheable_logical_plan` mode makes the
logical plan self-contained: views are expanded at plan time (recursively in
logical mode, propagated via `SelectQueryInfo::build_logical_plan` through
`StorageView::readImpl`), and key-value direct-join lookup steps
(`JoinStepLogicalLookup`), which bind live storages, fall back to regular
joins. Distributed and parallel-replicas logical plans are unaffected.

The cache key is the normalized AST hash, a hash of plan-affecting settings,
the current database, the user and the sorted role set. Each entry stores a
dependency fingerprint over every referenced storage - discovered from the
plan's `ReadFromTable` leaves plus an AST closure through view bodies (which
also catches tables referenced only from scalar subqueries) - with UUID,
metadata version or schema content hash (including the view body), and the
row policy hash. Every hit revalidates all dependencies and re-checks
`SELECT` access; `DROP`/`CREATE`, `ALTER`, view redefinition (including
nested views) and row policy changes invalidate transparently.

Eligibility: non-deterministic functions anywhere (including expanded view
bodies) make a query uncacheable; `arrayJoin` is exempt since it is pure.
Scalar subqueries are evaluated during analysis and baked into the plan as
constants, so they are gated behind the new opt-in setting
`query_plan_cache_allow_scalar_subqueries`. Table functions, temporary
tables, system tables (except `system.one`), remote/`Merge` storages and
`SQL SECURITY DEFINER` views are rejected. Plans with unserializable steps
(e.g. window functions) execute normally and are simply not stored.

Also fixes a pre-existing gap in `ActionsDAG` serialization exposed by this
work: lambda capture DAGs reference enclosing constants as input nodes
carrying a constant column (`addInputConstantColumnIfNecessary`), but the
serializer dropped columns for all input nodes, so functions requiring
constant arguments (e.g. `tupleElement` in `arrayMap(t -> t.2, ...)`) failed
to re-resolve on deserialization. Input-node constant columns are now
serialized; the format change is self-describing via the per-node flag.
Cacheable plans are built with `compile_expressions = 0` because JIT-fused
function nodes cannot be serialized; the JIT still applies when pipelines are
built from the plan.

User interface: settings `allow_experimental_query_plan_cache`,
`enable_query_plan_cache`, `query_plan_cache_allow_scalar_subqueries`,
`query_plan_cache_size_in_bytes_quota`; server settings
`query_plan_cache.max_size_in_bytes` and `query_plan_cache.max_entries`;
`SYSTEM DROP QUERY PLAN CACHE`; profile events `QueryPlanCacheHits` (counted
only when a cached plan is actually used), `QueryPlanCacheMisses`,
`QueryPlanCacheValidationMisses`, `QueryPlanCacheStaleMisses`; metrics
`QueryPlanCacheBytes`, `QueryPlanCacheEntries`.

Proven on DOOMbench (clickhouse/doombench, a raycasting engine in pure SQL
whose `screen` view is a deep chain of views over joins, unions, array joins
and scalar subqueries; planning dominated its render cost). Local server,
single connection, 64x32 map, 128x64 viewport:

  render (frame) cache OFF:        551-577 ms median
  render (frame) cache ON:         227-246 ms median   (2.4x)
  render cache ON, max_threads=4:  148-176 ms median   (3.3x)

with frames byte-identical between cache off/miss/hit, state mutations from
the game tick visible on every cached render (HTAP freshness), nested view
redefinition picked up immediately, and `rand()` inside a view body refusing
to cache. The remaining hit-side cost is dominated by actual execution
(aggregation of the pixel grid), not planning: the analyzer/tree-hash storm
visible in profiles before (25.8k samples in `IQueryTreeNode::getTreeHash`)
drops to noise (174 samples).

Related: #99309

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov force-pushed the query-plan-cache-complex branch from 3629ad4 to 9d3a72e Compare June 12, 2026 19:50
@clickhouse-gh

clickhouse-gh Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [e12f472]

Summary:

job_name test_name status info comment
Stateless tests (arm_binary, parallel) FAIL
04061_spilling_hash_join_overflow_limits FAIL cidb

AI Review

Summary

This PR adds an experimental query plan cache for analyzer-based SELECT queries, including dependency revalidation, access rechecks, per-user quotas, SYSTEM DROP QUERY PLAN CACHE, documentation, and focused stateless coverage. The overall design is close, but I still see one correctness blocker in the cache-hit path, one new concurrency bug in quota accounting, and two remaining query-log metadata gaps.

Findings

❌ Blockers

  • [src/Interpreters/executeQuery.cpp:1174] [dismissed by author -- https://github.com/Add query plan cache for complex queries (views, joins, subqueries) #107125#discussion_r3470391281] Validation and execution are still not tied to the same storage identity. validateQueryPlanCacheEntry checks dep.uuid, but materializeCachedQueryPlan deserializes ReadFromTable steps that only store table_name, and QueryPlan::resolveStorages resolves that name again. A concurrent Atomic DROP/CREATE can therefore validate UUID A and then execute the cached plan against UUID B, including an engine/schema that the cache never validated or would normally reject. Minimum fix: carry the validated storage identity into materialization, or make resolveStorages reject when the resolved UUID no longer matches the dependency fingerprint.

⚠️ Majors

  • [src/Interpreters/Cache/QueryPlanCache.cpp:234] clear can erase per_user_bytes / entry_weights after a concurrent set has pre-charged them but before Base::set inserts the entry. Because Context::clearQueryPlanCache intentionally runs without holding the context mutex, SYSTEM DROP QUERY PLAN CACHE can leave a resident entry with no tracked weight, so query_plan_cache_size_in_bytes_quota becomes understated until the key is reinserted or the cache is cleared again. Minimum fix: synchronize clear with the pre-charge/insert path, or teach set to notice a concurrent clear and restore/remove the accounting before returning.
  • [src/Interpreters/Cache/QueryPlanCacheUtils.cpp:128] [dismissed by author -- https://github.com/Add query plan cache for complex queries (views, joins, subqueries) #107125#discussion_r3470391278] Cache hits still replay stale row-policy names into system.query_log. The dependency fingerprint hashes only the effective filter expression, while entry.used_row_policies is persisted verbatim and later replayed on hits. Replacing or renaming a policy with another one that has the same effective expression keeps validation green but leaves used_row_policies reporting the old names. This is audit metadata, not execution semantics, but it is still incorrect user-visible logging.
  • [src/Interpreters/Cache/QueryPlanCacheUtils.cpp:729] [dismissed by author -- https://github.com/Add query plan cache for complex queries (views, joins, subqueries) #107125#discussion_r3501144202] Cache-hit access replay still under-reports accessed objects in system.query_log. columns_unknown dependencies have their column list cleared during dependency merge, so hits log the table name but not the actually read columns, and view dependencies are routed to query_views only even though the miss path records them through addQueryAccessInfo. The result is that query_columns / query_tables on a hit can be materially weaker than on a miss for the same query.
Final Verdict

Status: ❌ Block

Minimum required actions:

  • Bind cache-hit validation and materialization to the same storage identity so a concurrent DROP/CREATE cannot execute a cached plan against a different table UUID.
  • Fix the clear / set accounting race so query_plan_cache_size_in_bytes_quota remains correct across SYSTEM DROP QUERY PLAN CACHE.

@clickhouse-gh clickhouse-gh Bot added the pr-experimental Experimental Feature label Jun 12, 2026
Comment thread src/Storages/StorageView.cpp Outdated
Comment thread src/Interpreters/Cache/QueryPlanCacheUtils.cpp Outdated
Comment thread src/Interpreters/Cache/QueryPlanCacheUtils.cpp Outdated
Comment thread src/Interpreters/Cache/QueryPlanCache.cpp Outdated
alexey-milovidov and others added 3 commits June 13, 2026 07:32
…n setting

Two related fixes to query plan cache eligibility and keying for scalar
subqueries:

- `query_plan_cache_allow_scalar_subqueries` is no longer ignored when
  computing the cache key. An entry stored while it was enabled could
  otherwise be reused by a later execution that explicitly disabled it,
  reusing a stale baked scalar value. The setting now participates in the
  key, so the two modes do not share an entry.

- In the view-body eligibility walk (`ASTDependencyCollector`), only the
  right-hand argument of `x IN (subquery)` is the set (re-executed on every
  run). The left operand is an ordinary expression, so a subquery there is a
  scalar subquery folded into a constant and must be gated like any other
  scalar subquery. Previously every child of a set function was marked
  set-position, which let a left-operand scalar subquery escape the
  `query_plan_cache_allow_scalar_subqueries` gate inside view bodies.

Also remove the unused `QueryPlanCacheHits` event extern in
`QueryPlanCache.cpp` (the event is incremented by the caller in
`executeQuery.cpp`), which the style check flagged as defined but not used.

Adds `04337_query_plan_cache_scalar_subquery_gating`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…naffected

`StorageView::readImpl` derived `options.cacheable_logical_plan` from
`query_info.build_logical_plan`, but `build_logical_plan` is also set for the
existing distributed and parallel-replica logical-plan paths. That made view
subqueries use the cacheable mode (recursive view expansion, no
storage-bound `JoinStepLogicalLookup`) even outside the experimental query
plan cache, changing behavior for those paths.

Propagate a separate `cacheable_logical_plan` bit through `SelectQueryInfo`,
set it only on the cache path in `buildQueryPlanForTableExpression`, and read
it in `StorageView::readImpl`. Distributed and parallel-replica logical plans
set `build_logical_plan` but not `cacheable_logical_plan`, so their view
expansion keeps its previous, non-cacheable behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Interpreters/executeQuery.cpp
Comment thread src/Interpreters/Cache/QueryPlanCacheUtils.cpp Outdated
alexey-milovidov and others added 4 commits June 17, 2026 13:14
The query plan cache is a single, server-wide cache. `04324` and `04337`
inspect it through the global `SYSTEM DROP QUERY PLAN CACHE` and assert exact
`QueryPlanCacheHits` counts between two runs of the same query. That only holds
when the test runs in isolation: concurrent queries (including the flaky check,
which runs several copies of the same test in parallel) drop or evict the
shared entries, turning an expected hit into a miss. The same assertions depend
on the exact logical plan, so random plan-affecting settings change whether a
query is cacheable, and the cache only engages with the analyzer.

Tag both tests `no-parallel`, `no-random-settings`,
`no-random-merge-tree-settings` and `no-old-analyzer`. This fixes the
`Server died` failures in the flaky checks (the flaky check aborted after
`--max-failures` and force-killed the running test processes, which is reported
as `Server died`), the old-analyzer result diff, and the random-settings result
diffs in the regular runs.

#107125
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The requested `QueryProcessingStage` was not part of the cache key, but the
cache could be used for any stage the client requests. A non-`Complete` stage
such as `with_mergeable_state` changes the logical plan contract and the output
header (intermediate aggregate states vs. finalized values), so an entry first
cached under one stage could be reused for an identical query later executed
under another, returning a plan built for the wrong contract.

Only enable the query plan cache when `stage == QueryProcessingStage::Complete`.

#107125
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After merging master, prefixes 04324 and 04337 collide with tests added on
master (the 04324_* and 04337_* groups). Renamed the query plan cache tests to
the next free prefixes:

- 04324_query_plan_cache_complex_queries -> 04356_query_plan_cache_complex_queries
- 04337_query_plan_cache_scalar_subquery_gating -> 04357_query_plan_cache_scalar_subquery_gating

#107125
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Interpreters/Cache/QueryPlanCacheUtils.cpp Outdated
Comment thread src/Interpreters/Cache/QueryPlanCacheUtils.cpp
alexey-milovidov and others added 2 commits June 24, 2026 19:10
After merging master, the query plan cache tests `04356`/`04357` collided
with new tests added on master at those prefixes (master now reaches
`04408`). Renumber them to the next free prefixes so they no longer clash.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Interpreters/Cache/QueryPlanCache.cpp
IASTHash getRowPolicyHash(const ContextPtr & context, const String & database, const String & table)
{
IASTHash row_policy_hash{};
auto row_policy = context->getRowPolicyFilter(database, table, RowPolicyFilterType::SELECT_FILTER);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

row_policy_hash only covers the combined filter expression, but cache hits later replay entry.used_row_policies into system.query_log. If a policy is replaced or renamed with another policy that has the same effective expression, validation passes and hits keep logging the old policy names. That leaves audit/query-log metadata stale even though the row filter is equivalent.

Please include the policy identities/full names in the dependency fingerprint, or recompute the used row-policy names on each hit when the expression hash matches.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The row filter itself stays correct on a hit (the filter-expression hash still matches), so this is an audit-metadata accuracy issue: a hit replays the stored used_row_policies names into system.query_log, which can be stale after a same-expression rename/replace. The fix is to fold the sorted policy full names into the dependency fingerprint (or recompute the used-policy names on each hit when the expression hash matches). Since it affects only logged policy names, not results or access, I have left it as a follow-up. Leaving open for your decision.

/// falling through to normal planning.
checkAccessForQueryPlanCacheHit(*cached_entry, context);

auto plan = materializeCachedQueryPlan(cached_entry->serialized_plan, context);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation and materialization are not tied to the same storage snapshot. validateQueryPlanCacheEntry resolves dep.database/dep.table and checks the UUID/metadata, but materializeCachedQueryPlan then deserializes the plan and resolveStorages resolves the table name again and takes the lock there. If an Atomic table is dropped and recreated between these calls, validation can pass for UUID A and the plan can execute against UUID B, including an engine/schema that was never validated or eligible for caching. That breaks the DROP/CREATE invalidation contract.

Please either carry the validated StoragePtr/snapshot/lock into materialization, or include the expected UUID in the ReadFromTable placeholder and have resolveStorages reject/replan if the resolved storage no longer matches.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a genuine but narrow TOCTOU: a concurrent Atomic DROP/CREATE between validateQueryPlanCacheEntry and resolveStorages could validate UUID A and then execute against UUID B. The robust fix is to bind resolution to the validated identity - serialize the expected UUID into the ReadFromTable placeholder and have QueryPlan::resolveStorages reject/replan on a UUID mismatch. That changes the plan serialization format, which is also used by distributed plan shipping, so I have left it for a dedicated change rather than risk it in this round. Leaving this thread open for your decision.


/// Unresolvable names are not necessarily errors: the identifier may refer to a CTE.
/// Skipping them is safe - a CTE is not a storage, so it cannot go stale.
auto storage = DatabaseCatalog::instance().tryGetTable(table_id, context);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This raw AST dependency walk resolves every table-looking identifier through DatabaseCatalog, but it does not track in-scope CTE names. For example, WITH cte AS (SELECT 1) SELECT * FROM cte while a real default.cte table exists will add default.cte as a dependency even though the analyzed plan reads the CTE, not that table. On hits this can require SELECT on an unrelated table, invalidate on unrelated metadata changes, or reject caching because the shadowed table uses an unsupported engine.

Please track CTE aliases, including view-local CTEs, and skip those identifiers before catalog lookup, or derive these extra dependencies from the analyzed query tree instead of raw AST names.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the current behavior here is fail-close (over-strict), not fail-open: a CTE name that shadows a real table adds a spurious dependency, which can trigger an unnecessary access check or invalidation, or refuse to cache - but it never drops a real dependency. A precise fix must track CTE names per scope, because a global "skip names that match a WITH alias" would risk the opposite failure (skipping an identifier that is actually a real table read = a missed dependency, i.e. a real access/staleness hole). Given that risk and that the current direction is safe, I have left the precise scope-tracking fix for your decision rather than introduce an imprecise one.

…ProfileEvents conflict (keep both sides' events)
Comment thread src/Interpreters/Cache/QueryPlanCacheUtils.cpp Outdated
Comment thread src/Interpreters/executeQuery.cpp Outdated
Comment thread src/Interpreters/Cache/QueryPlanCache.cpp
Comment thread src/Core/Settings.cpp
alexey-milovidov and others added 2 commits June 26, 2026 15:30
…story.cpp

The `26.6` release was cut, so `02995_new_settings_history` now compares the
current settings against the `02995_settings_26_6_1.tsv` reference and requires
every setting added after `26.6.1` to be recorded under a version block greater
than `26.6`. The four query plan cache settings
(`allow_experimental_query_plan_cache`, `enable_query_plan_cache`,
`query_plan_cache_allow_scalar_subqueries`, `query_plan_cache_size_in_bytes_quota`)
were still under the `26.6` block, so the `Fast test` job reported
`PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: ... WAS ADDED`.
Move them to the `26.7` block.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107125&sha=fd0fd283b9bbf13a6e90f4e68627c5f95628650e&name_0=PR&name_1=Fast%20test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new `docs/en/operations/query-plan-cache.md` was a "floating" page that
no sidebar referenced, which fails the `Build docusaurus` check. Link it from
`docs/en/operations/caches.md` (mirroring `query-condition-cache.md`) and add
the `SYSTEM CLEAR|DROP QUERY PLAN CACHE` statement to the system docs, which
also gives the page an inbound reference.
…idation

Fixes several correctness and security findings on the experimental query plan
cache (PR review of #107125):

- Fall back to the normal interpreter for ineligible queries. The cacheable
  logical-plan execution path expands view bodies at plan time and resolves
  their leaf reads under the invoker context; that is wrong for `SQL SECURITY
  DEFINER`/`NONE` views. Previously an ineligible query (or one with an
  unsupported dependency) was not stored but was still executed through the
  cacheable plan, which could change its result or security context. Now
  `tryInterpretWithQueryPlanCache` returns `nullptr` so the normal interpreter
  runs, and the cacheable plan is executed only for queries that are actually
  eligible.

- Reject `SQL SECURITY NONE` views, not only `DEFINER`. Both run the view body
  under an overridden context the cache cannot replay, so only `INVOKER` views
  (and views with no explicit security) are eligible. The engine/view-security
  eligibility check is extracted into `isStorageEligibleForPlanCache`.

- Re-check storage eligibility during validation. In UUID-less databases a
  `DROP`/`CREATE` can swap in an unsupported engine (`Distributed`/`Merge`) or
  change a view's SQL security while the schema content hash stays equal; a hit
  could then run against a storage a miss would have refused. Validation now
  re-runs `isStorageEligibleForPlanCache`.

- Extend the fallback schema fingerprint (`computeSchemaHash`) with table
  constraints (an `ASSUME` constraint can rewrite predicates during analysis)
  and, for views, the SQL security type and definer, so changing any of them
  invalidates a stale entry.

- Fix the access-control dependency model. Duplicate dependencies for one
  storage (self-joins, a table read in both the outer plan and a subquery) are
  now merged by unioning their column sets instead of keeping one occurrence,
  so revoking `SELECT` on any read column denies a later hit. A table reached
  only through a scalar subquery is folded into a constant and has no plan leaf,
  so its columns are unknown (`columns_unknown`); such a dependency now requires
  table-level `SELECT` on a hit, because a column-level recheck could pass while
  the baked plan still reads a revoked column. Regular view-body tables remain
  plan leaves with exact columns and are unaffected.
…urity

- `04492_query_plan_cache_access_control`: a self-join's column requirements
  are unioned across both plan leaves (revoking `SELECT` on a column read by
  either leaf denies the next hit), and a scalar-subquery table requires
  table-level `SELECT` on a hit (column grants alone are not enough).
- `04493_query_plan_cache_view_security`: a `SQL SECURITY NONE` view query is
  never cached (falls back to normal execution), and changing a view from
  `INVOKER` to `SQL SECURITY NONE` invalidates a cached entry.
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 230ba5365c9. Merged current master (resolves the conflicting state) and addressed the review.

Docs build (Build docusaurus) — the new query-plan-cache.md was a floating page with no sidebar reference. Linked it from caches.md and documented SYSTEM CLEAR|DROP QUERY PLAN CACHE in the system statements (commit f5a3e4df667).

AI review findings — addressed (commit ec109e09b50, with regression tests 04492/04493; arrayJoin and the parsing wording were already addressed in 5e1f87a999f/a7101abcafc with test 04491):

  • Ineligible queries now fall back to the normal interpreter (return nullptr) instead of executing the cacheable logical plan, so cache-enabled execution can't change a query's result or security context.
  • SQL SECURITY NONE views are rejected (not only DEFINER); only INVOKER views are eligible.
  • Validation re-runs the engine/view-security eligibility check, so a UUID-less DROP/CREATE into an unsupported engine or a view security change is caught even when the schema hash is unchanged.
  • computeSchemaHash now includes table constraints and, for views, the SQL security type and definer.
  • Access model: duplicate dependencies are merged by unioning column sets (self-joins, tables read in multiple places); scalar-subquery tables are folded out of the plan so they require table-level SELECT on a hit.

Left open for your decision (replied on the threads, not resolved):

  • Validation→materialization UUID TOCTOU — the robust fix serializes the expected UUID into ReadFromTable and re-checks in resolveStorages, which touches the plan serialization format used by distributed execution; deferred to a dedicated change.
  • CTE-name shadowing — current behavior is fail-close (over-strict, never drops a real dependency); a precise per-scope fix is needed and an imprecise one would risk fail-open.
  • Row-policy identity in the fingerprint — affects only logged policy names in query_log, not results or access.

CI — the Hung check failed, possible deadlock found (AST fuzzer) is a master-wide flake tracked by #107941 (40–150 failures/day across dozens of unrelated PRs); the plan cache is off by default, so it is not caused by this PR.

Comment thread docs/en/operations/query-plan-cache.md Outdated
…heable

The eligibility check in `isStorageEligibleForPlanCache` rejects any view
whose `sql_security_type` is set to anything other than `INVOKER`, so both
`SQL SECURITY DEFINER` and `SQL SECURITY NONE` views are ineligible. The docs
listed only `SQL SECURITY DEFINER` as excluded, which incorrectly implied
`SQL SECURITY NONE` views could use the plan cache. State the rule directly:
only `SQL SECURITY INVOKER` views are eligible.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update on the AI-review "⚠️ Request changes" verdict (head d66a486e92c):

Addressed

  • Docs nit — SQL SECURITY NONE eligibility mismatch (docs/en/operations/query-plan-cache.md): fixed in d66a486e92c. The eligibility text now says only SQL SECURITY INVOKER views are eligible (and that both SQL SECURITY DEFINER and SQL SECURITY NONE views are not cached), matching isStorageEligibleForPlanCache, which rejects any view whose sql_security_type is set to anything other than INVOKER. This was the one open thread not in the earlier "left open" list.

Row-policy audit metadata (Major) — the verdict asks to "fix or explicitly scope out". This was explicitly scoped out on the thread: a same-expression policy rename/replace only makes the used_row_policies names replayed into system.query_log stale; it never affects results or access (the filter-expression hash still revalidates). Folding the sorted policy full names into the dependency fingerprint is the follow-up.

UUID validation→materialization TOCTOU (Blocker) — left as your decision. The window is strictly between validateQueryPlanCacheEntry (checks UUID A) and resolveStorages inside materializeCachedQueryPlan, which re-resolves each leaf by name and pins a StorageSnapshot + lockForShare. Once resolveStorages binds, the snapshot is pinned, so the residual exposure is only that single window.

I did not push a fix for this here: the robust fix touches the query-plan serialization format shared with distributed plan shipping (your stated reason for deferring), and the alternative is a real change to the shared resolveStorages hot path that I cannot build or test in this environment — pushing untested concurrency code to a correctness-sensitive shared path against an explicit maintainer deferral would be the wrong call. For when you do the dedicated change, a non-format-changing option that stays fail-close:

  • Have QueryPlan::resolveStorages return (or expose via an out-parameter) the (StorageID, uuid) set it actually bound, then re-validate those bound identities against entry.dependencies after materializeCachedQueryPlan. On any mismatch, treat it as a stale miss (QueryPlanCacheStaleMisses) and fall through to normal planning. Because the bound storages are exactly the ones whose snapshots are pinned, re-checking them fully closes the window without changing the serialized plan format. The serialize-the-expected-UUID-into-ReadFromTable approach remains the more direct fix if the format change is acceptable in that dedicated PR.

CI — the only red checks are Hung check failed, possible deadlock found on Stress (amd_tsan / arm_tsan). The CI report links them to #107941, and the stacks are AST-fuzzer queries (executeASTFuzzerQueries, e.g. SELECT arrayZip('')) with no plan-cache frames; the cache is off by default, so these are the known master-wide flake, not caused by this PR. No re-merge of master was done: the merge-base is from earlier today, the branch is MERGEABLE with no conflicts, and the reds are not fixable by merging master.

Comment thread src/Interpreters/Cache/QueryPlanCacheUtils.cpp Outdated
… on a hit

A view is expanded into the cached logical plan and leaves no `ReadFromTable`
leaf for the view storage itself, so `collectQueryPlanCacheDependencies` could
not record the columns selected from the view. The view dependency therefore
had an empty `columns` set with `columns_unknown = false`, and on a hit
`checkAccessForQueryPlanCacheHit` only required `SELECT` on any one column of
the view. That is weaker than a miss, which checks `SELECT` on the precise
selected view columns in `prepareBuildQueryPlanForTableExpression` before
expanding the view: after warming `SELECT a FROM v` while granted `SELECT(a)`
on `v`, revoking `SELECT(a)` while keeping `SELECT(b)` would let the hit still
return `a`.

Mark view-only AST dependencies `columns_unknown = true`, exactly as already
done for scalar-subquery tables (whose columns are likewise unrecoverable from
the plan): a hit now requires table-level `SELECT` on the view, which is at
least as strict as the per-column access a non-cached query checks. The view's
body tables are still recorded with their precise columns as ordinary plan
leaves, so only the recheck of the view storage itself is tightened.

Adds test `04494_query_plan_cache_view_column_access` (revoke a selected view
column between the miss and the hit) and documents the conservative behavior.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update on the AI-review ❌ Block verdict (now head 66e3e3eee42).

Fixed — view-column privilege recheck on a hit (the new blocker, QueryPlanCacheUtils.cpp:340)

This was a genuine fail-open access gap and was not one of the previously-deferred items, so I fixed it in 66e3e3eee42. A view is expanded into the cached plan and leaves no ReadFromTable leaf for the view storage itself, so the view dependency had an empty columns set with columns_unknown = false; on a hit checkAccessForQueryPlanCacheHit then only required SELECT on any one column of the view. A miss checks the precise selected view columns in prepareBuildQueryPlanForTableExpression, so a user could warm SELECT a FROM v with SELECT(a) on v, lose SELECT(a) while keeping SELECT(b), and still get a on the hit.

I took the reviewer's second option (require table-level SELECT for view-only dependencies) rather than the first (record the exact selected view columns), because:

  • It is consistent with how scalar-subquery tables are already handled — they have the same "no plan leaf, columns unrecoverable" property and already set columns_unknown = true (see case 2 of 04492). Recording precise columns would be the natural follow-up for both cases at once.
  • It is the minimal, strictly fail-close change: table-level SELECT is at least as strict as the per-column access a non-cached query checks, so it can never serve a hit with weaker access.

Note this is the deny-on-hit direction, identical to the accepted scalar-subquery behavior: a query that only a column-level grant on a view makes runnable is now denied on a hit (rather than served). I documented this in query-plan-cache.md. If you would prefer column-restricted-view users to still get hits, the precise-column-recording approach is the follow-up (and would cover the scalar-subquery case symmetrically).

Regression: 04494_query_plan_cache_view_column_access — warms SELECT a FROM v under per-column grants on the view, asserts the hit is denied and stays denied after revoking the selected column (another remains), then that granting table-level SELECT makes the hit succeed.

Still open by your decision (unchanged from the previous round)

  • UUID validation→materialization TOCTOU (executeQuery.cpp:1174) — left for the dedicated change you described; the robust fix touches the plan serialization format shared with distributed plan shipping. I again did not push speculative concurrency code to that shared path against your explicit deferral.
  • Row-policy audit-metadata (Major, QueryPlanCacheUtils.cpp:106) — explicitly scoped out as a follow-up (affects only used_row_policies names replayed into system.query_log, never results or access).
  • CTE-shadowing (QueryPlanCacheUtils.cpp:319) — current behavior is fail-close (over-strict), as you noted; a precise per-scope fix is the follow-up.

CI — the only reds on the prior commit were Hung check failed on Stress (amd_tsan / arm_tsan), which the CI report links to #107941 (AST-fuzzer stacks, no plan-cache frames; cache is off by default) — the known master-wide flake, not caused by this PR. A fresh run is now in progress on 66e3e3eee42. No master re-merge: the merge-base is from earlier today, the branch is MERGEABLE with no conflicts, and the reds are not fixable by merging master.

}
in_set_or_table_position = false;
}
else if (!ast.as<ASTWithElement>())

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ASTWithElement represents a table CTE, but this branch preserves the caller's positional context before visiting the CTE subquery. At the top level that context is false, so WITH cte AS (SELECT ... FROM t) SELECT ... FROM cte is treated as if the CTE body were a scalar subquery: with the default query_plan_cache_allow_scalar_subqueries = 0 the query is never cached, and with the setting enabled the tables inside the CTE are marked columns_unknown and later require table-level SELECT even though the plan leaves know the actual columns.

Please traverse ASTWithElement::subquery in table/set position, the same way FROM (subquery) is handled, and add a regression that a table CTE can hit the plan cache without enabling scalar subqueries and without upgrading column-level grants to table-level access.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The observation is correct: a table CTE currently reaches the scalar-subquery branch, so with the default query_plan_cache_allow_scalar_subqueries = 0 such a query is not cached. But the suggested fix (visit ASTWithElement::subquery in table/set position) is not safe on its own, for the same reason the CTE-name walk is left open in #107125 (comment).

The raw AST walk records no column list — visitTableIdentifier sets only UUID/metadata/row-policy and the columns_unknown flag, not dep.columns. On a hit, checkAccessForQueryPlanCacheHit then picks the recheck from that flag:

  • columns_unknown = true → table-level SELECT recheck (fail-close).
  • columns_unknown = false with an empty column list → falls through to the weakest "SELECT on at least one column" path.

Consider a table CTE referenced only in scalar position, e.g. WITH cte AS (SELECT max(x) FROM t) SELECT (SELECT * FROM cte). Here t is folded into a constant during analysis and has no plan leaf, so its only dependency comes from this AST walk. Visiting the CTE body in table position would set columns_unknown = false, downgrading the hit recheck from table-level SELECT to "at least one column". A user granted SELECT(a) but not SELECT(x) would then pass on a hit, while the miss path (which checks SELECT(x) on the folded column) correctly denies — a missed-access hole, exactly the opposite failure flagged on the CTE-name walk.

A correct fix has to distinguish table-position from scalar-position CTE references (per-scope CTE usage tracking), which is the same reserved work as r3470391284. Leaving this open for the same decision.

if (dep.is_view)
query_context->addViewAccessInfo(storage_id.getFullTableName());
else
query_context->addQueryAccessInfo(storage_id, dep.columns);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For columns_unknown dependencies, collectQueryPlanCacheDependencies clears dep.columns, and this replay path then records the table with an empty column list. Context::addQueryAccessInfo only inserts the columns it is given, so a hit for a query such as SELECT (SELECT b FROM t) logs t in query_tables but omits t.b from query_columns, whereas the miss path records the actual scalar-subquery column during analysis.

Since this is audit/query-log metadata, under-reporting the columns read by a cached hit is unsafe. Please either keep precise columns for logging separately from the stricter access recheck, or log a conservative set such as all columns from the validated metadata whenever columns_unknown is true.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct that a columns_unknown hit logs the table into query_tables without its columns in query_columns, so a column read only through a scalar subquery (or a view) is under-reported relative to the miss path. This is audit/query-log metadata only: it does not affect results or the access recheck, which for a columns_unknown dependency is already the stricter table-level SELECT (checkAccessForQueryPlanCacheHit), not a column-level one.

It is the same class as the row-policy audit-metadata gap in #107125 (comment), which is left as a follow-up. The conservative fix suggested here (log all columns from the validated metadata whenever columns_unknown is true) fits that same follow-up. Leaving open for your decision.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 9e72ebfaf47: merged current master into the branch. The merge is conflict-free, so the GitHub CONFLICTING/DIRTY state was stale (the branch had fallen ~560 commits behind a fast-moving master); the new merge brings it up to date and should flip the PR back to MERGEABLE. The merge touched only non-overlapping additions in Settings.cpp and SettingsChangesHistory.cpp; all plan-cache settings, history entries, profile events and new files are intact, and there are no leftover conflict markers or duplicate version keys.

CI on the previous head 66e3e3ee — both reds are unrelated, pre-existing flakes:

  • Unit tests (tsan, function_prop_fuzzer)FunctionsStress.stress / AllTests: the function-property fuzzer hit SELECT reinterpret(a, 'Decimal256(21)') ... ARRAY JOIN CAST(... AS Array(Decimal(76, 35))) returning Decimal256 instead of Decimal(76, 21) (INCORRECT_DATA). This is a pre-existing bug in reinterpret's result-scale handling found by the fuzzer, tracked by the CI auto-matching issue FunctionsStress #107242. CIDB shows FunctionsStress.stress failing across 52 distinct PRs on 2026-06-30 alone and every day for the past month — a master-wide flake. This PR touches no Functions/ code, so it cannot be the cause.
  • Stress test (amd_tsan) / Stress test (arm_tsan)Hung check failed, possible deadlock found: the chronic AST-fuzzer Hung-check flake Hung check failed, possible deadlock found #107941 (the plan cache is off by default and the stacks are unrelated BackupsWorker/fuzzer frames).

Open AI-review "⚠️ Request changes" items (left for your decision):

The two remaining "minimum required actions" are, in my reading, both fail-close / audit-only — the same class as the items you already deferred (:106, :319, executeQuery.cpp:1174) — and I did not push speculative fixes for them (I cannot build or test this branch in this environment, and the obvious CTE fix is fail-open; see below):

  • QueryPlanCacheUtils.cpp:293 (CTE traversal). The literal suggestion — "traverse ASTWithElement::subquery in table position, like FROM (subquery)" — is fail-open as written. For a CTE referenced only in scalar position, e.g. WITH cte AS (SELECT a FROM t) SELECT (SELECT sum(a) FROM cte), t is folded into a constant and has no plan leaf, yet that change would record it with columns_unknown = false; on a hit checkAccessForQueryPlanCacheHit then only requires "SELECT on any one column", so a user who lost SELECT(a) but kept SELECT(b) would still get the baked a. A correct fix needs per-scope CTE tracking — the same work you deferred on the related thread :319. The current behavior is fail-close: a table CTE is just not cached by default (its body is seen as a scalar subquery) or is over-strict (table-level SELECT on a hit). A safe partial fix that only removes the over-rejection (cache the query, but keep the body's tables columns_unknown = true, i.e. table-level recheck) is possible without scope tracking, but it does not address the over-strictness the reviewer also flags.
  • QueryPlanCacheUtils.cpp:686 (cached-hit column logging). This is audit-metadata only — the same class as the row-policy-name audit gap at :106 you scoped out. The precise-column version is not implementable for scalar-subquery tables without re-running the analysis the cache exists to skip (the AST walk only sees raw FROM t, never the resolved column t.b). The only implementable option is the reviewer's conservative fallback: log all columns from the validated metadata for columns_unknown dependencies (over-report rather than under-report). Happy to do that if you want it in this PR rather than as the follow-up that also covers :106.

New CI is running on 9e72ebfaf47.

Comment thread src/Interpreters/Cache/QueryPlanCacheUtils.cpp Outdated
Comment thread src/Interpreters/Cache/QueryPlanCache.cpp Outdated
alexey-milovidov and others added 2 commits July 1, 2026 02:52
A `SELECT` row policy filter may itself contain a subquery, e.g.
`USING x <= (SELECT max(v) FROM policy_limits)`. `buildFilterInfo`
analyzes such a filter into the plan, folding the subquery result (a
scalar boundary, or an `IN` set) into a baked constant. But the plan
cache dependency collector walks only the user query AST and the plan
leaves: the table read by the policy subquery never becomes a
`ReadFromTable` leaf, nor an entry of the AST closure walked by
`ASTDependencyCollector`. Its data therefore is not part of the entry's
dependency fingerprint, so a cache hit could keep enforcing a stale row
boundary after that table changes, exposing or denying rows incorrectly.
The policy filter also escaped `query_plan_cache_allow_scalar_subqueries`,
which is applied only to the user query AST.

Make any plan over a table whose effective `SELECT` row policy contains a
subquery uncacheable. The check lives in `fillDependency`, the single
choke point used by both the plan-leaf and AST-closure dependency walks,
and is repeated in `validateQueryPlanCacheEntry` as defense in depth. A
plain row policy without a subquery stays cacheable.

Adds `04495_query_plan_cache_row_policy_subquery`, which changes the
table read by a row-policy subquery between runs and asserts the query is
never cached (and stays fresh), while a plain policy is still cached.

Addresses review: #107125 (comment)...

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t charges

`QueryPlanCache::set` added the new entry's weight to `per_user_bytes`
and recorded `entry_weights` only after `Base::set` returned. But a
newly inserted entry lands in the SLRU probationary queue and can be
evicted synchronously inside `Base::set` (through `removeOverflow`) when
the cache is full of protected entries. That eviction runs
`onEntryRemoval`, which had nothing to subtract yet, so the subsequent
post-`set` accounting left a charge for a key that was no longer
resident. Such a ghost charge could make
`query_plan_cache_size_in_bytes_quota` reject future entries for the user
until the same key was inserted again or the cache was cleared.

Charge the entry (and record `entry_weights`) BEFORE `Base::set`, so a
synchronous eviction's `onEntryRemoval` finds and undoes the charge and
the accounting stays balanced whether or not the entry survives
insertion. Roll back the pre-charge if `Base::set` throws. The three
copies of the "subtract a weight from a user's bucket" logic
(same-key replacement, rollback, `onEntryRemoval`) are folded into a
`decrementUserBytes` helper. The oversized-entry early return is kept as
an optimization to avoid pointless insert/evict churn.

Addresses review: #107125 (comment)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 9e72ebfaf47..d11fc3fbdc6 addressing the two new AI-review findings on the current head (both now resolved):

  • Blocker — QueryPlanCacheUtils.cpp:177 (row-policy subquery): a SELECT row policy such as USING x <= (SELECT max(v) FROM policy_limits) bakes the subquery result into the plan, but the table it reads never becomes a plan leaf or an AST-closure dependency, so a hit could keep enforcing a stale boundary and it escaped query_plan_cache_allow_scalar_subqueries. fillDependency now makes any plan over a table whose effective row policy filter contains a subquery uncacheable (repeated in validateQueryPlanCacheEntry as defense in depth); a plain policy stays cacheable. New test 04495_query_plan_cache_row_policy_subquery changes the table read by the policy subquery between runs and asserts the query is never cached and stays fresh.
  • Major — QueryPlanCache.cpp:193 (per-user quota ghost charge): the per-user quota was charged after Base::set, but a just-inserted probationary entry can be evicted synchronously inside removeOverflow, whose onEntryRemoval then had nothing to subtract, leaving a ghost charge that shrank the user's query_plan_cache_size_in_bytes_quota. The charge is now recorded before Base::set (rolled back if it throws), so a synchronous eviction balances it; the bucket-decrement logic is folded into a decrementUserBytes helper.

Both C++ files pass a clang -fsyntax-only check including thread-safety analysis.

The remaining unresolved threads are your explicit deferrals/dismissals, unchanged: the validate/materialize UUID TOCTOU (executeQuery.cpp:1174), the audit-only row_policy_hash/used_row_policies identity gap (:106), the CTE-scope items (:319, :293), and the columns_unknown column-logging under-report (:686).

@alexey-milovidov alexey-milovidov marked this pull request as ready for review July 2, 2026 07:04
Comment thread src/Interpreters/Cache/QueryPlanCache.cpp
`QueryPlanCache::canStoreForUser` compared the user's total resident
bytes against `query_plan_cache_size_in_bytes_quota` without accounting
for the entry it is about to replace. Because a same-key insertion is a
replacement (`set` releases the old entry's weight before charging the
new one), a user already at the quota could never refresh an invalidated
plan: after e.g. an `ALTER` fails validation and the query replans, the
identical-key `set` was rejected while the stale entry stayed resident,
so every later run re-paid validation and replanning without updating the
cache.

Subtract the existing key's tracked weight before the quota comparison,
mirroring the same-key bookkeeping already done in `set`.

Regression test `04496_query_plan_cache_quota_refresh`: with the quota
sized to one plan, an invalidated entry refreshes and the next run hits.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed `18e02892983`: fixed the per-user quota admission bug flagged in `QueryPlanCache.cpp` (the new bot finding). canStoreForUser was comparing the user's resident bytes against query_plan_cache_size_in_bytes_quota without releasing the weight of the entry it is about to replace, so a user at the quota could never refresh an invalidated plan — after an ALTER failed validation and the query replanned, the identical-key set was rejected, the stale entry stayed resident, and every later run re-paid validation + replanning. Now the existing key's tracked weight is subtracted before the comparison (mirroring the same-key bookkeeping in set). Added regression test 04496_query_plan_cache_quota_refresh. Verified the change with clang -fsyntax-only (TSA-clean under -Werror).

The remaining 5 unresolved threads are the design/scope calls you reserved (QueryPlanCacheUtils.cpp:128/:332/:358/:729, executeQuery.cpp:1174) — left untouched for your decision.

CI: the sole red is AST fuzzer (amd_debug, targeted)Logical error: 'Inconsistent AST formatting' on a fuzzed CREATE ... Array(Array(...)) query, which is the known master-wide fuzzer flake (#108372 / #109158) and unrelated to this PR (the plan cache is off by default and the failing query never touches it).

…edundant --database

The regression test added `--database="$CLICKHOUSE_DATABASE"` to the
`clickhouse-client` invocations that run as the test user, but `$CLICKHOUSE_CLIENT`
already sets `--database`, so the client rejected every such query with
`option '--database' cannot be specified more than once`. As a result `W` and
`QUOTA` were both measured as `0` and every subsequent run errored out, leaving
empty `-- hits:` lines that did not match the reference.

The queries are fully qualified (`${CLICKHOUSE_DATABASE}.t_quota`), so the extra
`--database` is unnecessary. Drop it to match the sibling tests
`04494`/`04495`, which run as a user with `$CLICKHOUSE_CLIENT --user="$user" $SETTINGS`.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107125&sha=18e0289298368d5ce1223b97434e02c5548e155b&name_0=PR&name_1=Fast%20test
PR: #107125

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

void QueryPlanCache::clear()
{
Base::clear();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clear can race with a concurrent set and leave a resident entry completely unaccounted for.

A concrete interleaving is:

  1. Query A gets past canStoreForUser, records per_user_bytes / entry_weights at lines 173-195, and then blocks in Base::set.
  2. SYSTEM DROP QUERY PLAN CACHE runs Base::clear() here and then erases both maps below.
  3. Query A resumes Base::set and inserts the new entry, but its accounting was just wiped.

After that, the entry stays resident with no entry_weights record and no per-user charge, so query_plan_cache_size_in_bytes_quota can be undercounted until the same key is inserted again or the whole cache is cleared. Context::clearQueryPlanCache explicitly allows this to happen concurrently, so this is not just theoretical.

Please synchronize clear with the pre-charge/insert path, or make set detect that a clear happened between charging and insertion and restore/remove the accounting before returning.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed, this is a real race. set charges per_user_bytes/entry_weights inside a per_user_mutex block that it releases before Base::set, while clear wipes those maps under per_user_mutex after Base::clear. When clear's Base::clear runs before our Base::set, and our pre-charge runs before clear's map-wipe, the entry ends up resident with no entry_weights record and no per-user charge, so query_plan_cache_size_in_bytes_quota gets undercounted until that key is re-inserted or the cache is cleared again. The cached plan itself stays valid — no wrong results and no access impact — and the drift is bounded by the cache size and self-heals, but the hole is genuine.

Both obvious fixes are traps here:

  • Holding per_user_mutex across Base::set (or across Base::clear) inverts the lock order and deadlocks: onEntryRemoval acquires per_user_mutex while CacheBase already holds its own mutex (synchronous eviction inside set), so the discipline is CacheBase::mutexper_user_mutex.
  • A post-set "generation changed → re-charge" check cannot tell the orphan ordering (Base::clear before our Base::set, entry resident) from the opposite one (Base::clear after our Base::set, entry already removed, since CacheBase::clear does not run onEntryRemoval per entry). Distinguishing them needs a residency check, which needs CacheBase::mutex under per_user_mutex — deadlock again — and re-charging blindly reintroduces exactly the ghost charge the pre-charge structure was added to avoid.

A correct fix needs charge+insert to be atomic under the same lock that guards residency — e.g. a CacheBase on-insert counterpart to onEntryRemoval, so admission accounting happens inside the CacheBase-locked region rather than in a separate per_user_mutex critical section around Base::set. That is a design change to the accounting model.

Since the feature is off by default and the impact is a bounded, self-healing quota undercount, leaving this open for your decision alongside the other lock/format items rather than rushing an untestable concurrency change.

@clickhouse-gh

clickhouse-gh Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.60% 85.60% +0.00%
Functions 92.70% 92.70% +0.00%
Branches 77.80% 77.80% +0.00%

Changed lines: Changed C/C++ lines covered: 592/694 (85.30%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-experimental Experimental Feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant