Add query plan cache for complex queries (views, joins, subqueries)#107125
Add query plan cache for complex queries (views, joins, subqueries)#107125alexey-milovidov wants to merge 28 commits into
Conversation
…ries) A new experimental query plan cache for `SELECT` queries that, unlike the single-table design of #99309, supports complex queries: views (including nested views), joins, `UNION`, `IN` subqueries and multiple tables. Unlike the query result cache, a plan cache hit still executes the query against current data; only parsing, query analysis and logical planning are skipped. Design: instead of universalizing storage-bound steps after planning, the miss path builds the plan in the planner's existing `build_logical_plan` mode (battle-tested by distributed plan shipping), where leaf reads are storage-agnostic `ReadFromTable` placeholders for any engine. On every execution - both hit and miss - `QueryPlan::resolveStorages` re-binds the leaves to fresh storage snapshots, so cached plans are transactionally consistent. A new `SelectQueryOptions::cacheable_logical_plan` mode makes the logical plan self-contained: views are expanded at plan time (recursively in logical mode, propagated via `SelectQueryInfo::build_logical_plan` through `StorageView::readImpl`), and key-value direct-join lookup steps (`JoinStepLogicalLookup`), which bind live storages, fall back to regular joins. Distributed and parallel-replicas logical plans are unaffected. The cache key is the normalized AST hash, a hash of plan-affecting settings, the current database, the user and the sorted role set. Each entry stores a dependency fingerprint over every referenced storage - discovered from the plan's `ReadFromTable` leaves plus an AST closure through view bodies (which also catches tables referenced only from scalar subqueries) - with UUID, metadata version or schema content hash (including the view body), and the row policy hash. Every hit revalidates all dependencies and re-checks `SELECT` access; `DROP`/`CREATE`, `ALTER`, view redefinition (including nested views) and row policy changes invalidate transparently. Eligibility: non-deterministic functions anywhere (including expanded view bodies) make a query uncacheable; `arrayJoin` is exempt since it is pure. Scalar subqueries are evaluated during analysis and baked into the plan as constants, so they are gated behind the new opt-in setting `query_plan_cache_allow_scalar_subqueries`. Table functions, temporary tables, system tables (except `system.one`), remote/`Merge` storages and `SQL SECURITY DEFINER` views are rejected. Plans with unserializable steps (e.g. window functions) execute normally and are simply not stored. Also fixes a pre-existing gap in `ActionsDAG` serialization exposed by this work: lambda capture DAGs reference enclosing constants as input nodes carrying a constant column (`addInputConstantColumnIfNecessary`), but the serializer dropped columns for all input nodes, so functions requiring constant arguments (e.g. `tupleElement` in `arrayMap(t -> t.2, ...)`) failed to re-resolve on deserialization. Input-node constant columns are now serialized; the format change is self-describing via the per-node flag. Cacheable plans are built with `compile_expressions = 0` because JIT-fused function nodes cannot be serialized; the JIT still applies when pipelines are built from the plan. User interface: settings `allow_experimental_query_plan_cache`, `enable_query_plan_cache`, `query_plan_cache_allow_scalar_subqueries`, `query_plan_cache_size_in_bytes_quota`; server settings `query_plan_cache.max_size_in_bytes` and `query_plan_cache.max_entries`; `SYSTEM DROP QUERY PLAN CACHE`; profile events `QueryPlanCacheHits` (counted only when a cached plan is actually used), `QueryPlanCacheMisses`, `QueryPlanCacheValidationMisses`, `QueryPlanCacheStaleMisses`; metrics `QueryPlanCacheBytes`, `QueryPlanCacheEntries`. Proven on DOOMbench (clickhouse/doombench, a raycasting engine in pure SQL whose `screen` view is a deep chain of views over joins, unions, array joins and scalar subqueries; planning dominated its render cost). Local server, single connection, 64x32 map, 128x64 viewport: render (frame) cache OFF: 551-577 ms median render (frame) cache ON: 227-246 ms median (2.4x) render cache ON, max_threads=4: 148-176 ms median (3.3x) with frames byte-identical between cache off/miss/hit, state mutations from the game tick visible on every cached render (HTAP freshness), nested view redefinition picked up immediately, and `rand()` inside a view body refusing to cache. The remaining hit-side cost is dominated by actual execution (aggregation of the pixel grid), not planning: the analyzer/tree-hash storm visible in profiles before (25.8k samples in `IQueryTreeNode::getTreeHash`) drops to noise (174 samples). Related: #99309 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
3629ad4 to
9d3a72e
Compare
|
Workflow [PR], commit [e12f472] Summary: ❌
AI ReviewSummaryThis PR adds an experimental query plan cache for analyzer-based Findings❌ Blockers
Final VerdictStatus: ❌ Block Minimum required actions:
|
…n setting Two related fixes to query plan cache eligibility and keying for scalar subqueries: - `query_plan_cache_allow_scalar_subqueries` is no longer ignored when computing the cache key. An entry stored while it was enabled could otherwise be reused by a later execution that explicitly disabled it, reusing a stale baked scalar value. The setting now participates in the key, so the two modes do not share an entry. - In the view-body eligibility walk (`ASTDependencyCollector`), only the right-hand argument of `x IN (subquery)` is the set (re-executed on every run). The left operand is an ordinary expression, so a subquery there is a scalar subquery folded into a constant and must be gated like any other scalar subquery. Previously every child of a set function was marked set-position, which let a left-operand scalar subquery escape the `query_plan_cache_allow_scalar_subqueries` gate inside view bodies. Also remove the unused `QueryPlanCacheHits` event extern in `QueryPlanCache.cpp` (the event is incremented by the caller in `executeQuery.cpp`), which the style check flagged as defined but not used. Adds `04337_query_plan_cache_scalar_subquery_gating`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…naffected `StorageView::readImpl` derived `options.cacheable_logical_plan` from `query_info.build_logical_plan`, but `build_logical_plan` is also set for the existing distributed and parallel-replica logical-plan paths. That made view subqueries use the cacheable mode (recursive view expansion, no storage-bound `JoinStepLogicalLookup`) even outside the experimental query plan cache, changing behavior for those paths. Propagate a separate `cacheable_logical_plan` bit through `SelectQueryInfo`, set it only on the cache path in `buildQueryPlanForTableExpression`, and read it in `StorageView::readImpl`. Distributed and parallel-replica logical plans set `build_logical_plan` but not `cacheable_logical_plan`, so their view expansion keeps its previous, non-cacheable behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The query plan cache is a single, server-wide cache. `04324` and `04337` inspect it through the global `SYSTEM DROP QUERY PLAN CACHE` and assert exact `QueryPlanCacheHits` counts between two runs of the same query. That only holds when the test runs in isolation: concurrent queries (including the flaky check, which runs several copies of the same test in parallel) drop or evict the shared entries, turning an expected hit into a miss. The same assertions depend on the exact logical plan, so random plan-affecting settings change whether a query is cacheable, and the cache only engages with the analyzer. Tag both tests `no-parallel`, `no-random-settings`, `no-random-merge-tree-settings` and `no-old-analyzer`. This fixes the `Server died` failures in the flaky checks (the flaky check aborted after `--max-failures` and force-killed the running test processes, which is reported as `Server died`), the old-analyzer result diff, and the random-settings result diffs in the regular runs. #107125 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The requested `QueryProcessingStage` was not part of the cache key, but the cache could be used for any stage the client requests. A non-`Complete` stage such as `with_mergeable_state` changes the logical plan contract and the output header (intermediate aggregate states vs. finalized values), so an entry first cached under one stage could be reused for an identical query later executed under another, returning a plan built for the wrong contract. Only enable the query plan cache when `stage == QueryProcessingStage::Complete`. #107125 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After merging master, prefixes 04324 and 04337 collide with tests added on master (the 04324_* and 04337_* groups). Renamed the query plan cache tests to the next free prefixes: - 04324_query_plan_cache_complex_queries -> 04356_query_plan_cache_complex_queries - 04337_query_plan_cache_scalar_subquery_gating -> 04357_query_plan_cache_scalar_subquery_gating #107125 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After merging master, the query plan cache tests `04356`/`04357` collided with new tests added on master at those prefixes (master now reaches `04408`). Renumber them to the next free prefixes so they no longer clash. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| IASTHash getRowPolicyHash(const ContextPtr & context, const String & database, const String & table) | ||
| { | ||
| IASTHash row_policy_hash{}; | ||
| auto row_policy = context->getRowPolicyFilter(database, table, RowPolicyFilterType::SELECT_FILTER); |
There was a problem hiding this comment.
row_policy_hash only covers the combined filter expression, but cache hits later replay entry.used_row_policies into system.query_log. If a policy is replaced or renamed with another policy that has the same effective expression, validation passes and hits keep logging the old policy names. That leaves audit/query-log metadata stale even though the row filter is equivalent.
Please include the policy identities/full names in the dependency fingerprint, or recompute the used row-policy names on each hit when the expression hash matches.
There was a problem hiding this comment.
The row filter itself stays correct on a hit (the filter-expression hash still matches), so this is an audit-metadata accuracy issue: a hit replays the stored used_row_policies names into system.query_log, which can be stale after a same-expression rename/replace. The fix is to fold the sorted policy full names into the dependency fingerprint (or recompute the used-policy names on each hit when the expression hash matches). Since it affects only logged policy names, not results or access, I have left it as a follow-up. Leaving open for your decision.
| /// falling through to normal planning. | ||
| checkAccessForQueryPlanCacheHit(*cached_entry, context); | ||
|
|
||
| auto plan = materializeCachedQueryPlan(cached_entry->serialized_plan, context); |
There was a problem hiding this comment.
Validation and materialization are not tied to the same storage snapshot. validateQueryPlanCacheEntry resolves dep.database/dep.table and checks the UUID/metadata, but materializeCachedQueryPlan then deserializes the plan and resolveStorages resolves the table name again and takes the lock there. If an Atomic table is dropped and recreated between these calls, validation can pass for UUID A and the plan can execute against UUID B, including an engine/schema that was never validated or eligible for caching. That breaks the DROP/CREATE invalidation contract.
Please either carry the validated StoragePtr/snapshot/lock into materialization, or include the expected UUID in the ReadFromTable placeholder and have resolveStorages reject/replan if the resolved storage no longer matches.
There was a problem hiding this comment.
This is a genuine but narrow TOCTOU: a concurrent Atomic DROP/CREATE between validateQueryPlanCacheEntry and resolveStorages could validate UUID A and then execute against UUID B. The robust fix is to bind resolution to the validated identity - serialize the expected UUID into the ReadFromTable placeholder and have QueryPlan::resolveStorages reject/replan on a UUID mismatch. That changes the plan serialization format, which is also used by distributed plan shipping, so I have left it for a dedicated change rather than risk it in this round. Leaving this thread open for your decision.
|
|
||
| /// Unresolvable names are not necessarily errors: the identifier may refer to a CTE. | ||
| /// Skipping them is safe - a CTE is not a storage, so it cannot go stale. | ||
| auto storage = DatabaseCatalog::instance().tryGetTable(table_id, context); |
There was a problem hiding this comment.
This raw AST dependency walk resolves every table-looking identifier through DatabaseCatalog, but it does not track in-scope CTE names. For example, WITH cte AS (SELECT 1) SELECT * FROM cte while a real default.cte table exists will add default.cte as a dependency even though the analyzed plan reads the CTE, not that table. On hits this can require SELECT on an unrelated table, invalidate on unrelated metadata changes, or reject caching because the shadowed table uses an unsupported engine.
Please track CTE aliases, including view-local CTEs, and skip those identifiers before catalog lookup, or derive these extra dependencies from the analyzed query tree instead of raw AST names.
There was a problem hiding this comment.
Note that the current behavior here is fail-close (over-strict), not fail-open: a CTE name that shadows a real table adds a spurious dependency, which can trigger an unnecessary access check or invalidation, or refuse to cache - but it never drops a real dependency. A precise fix must track CTE names per scope, because a global "skip names that match a WITH alias" would risk the opposite failure (skipping an identifier that is actually a real table read = a missed dependency, i.e. a real access/staleness hole). Given that risk and that the current direction is safe, I have left the precise scope-tracking fix for your decision rather than introduce an imprecise one.
…ProfileEvents conflict (keep both sides' events)
…story.cpp The `26.6` release was cut, so `02995_new_settings_history` now compares the current settings against the `02995_settings_26_6_1.tsv` reference and requires every setting added after `26.6.1` to be recorded under a version block greater than `26.6`. The four query plan cache settings (`allow_experimental_query_plan_cache`, `enable_query_plan_cache`, `query_plan_cache_allow_scalar_subqueries`, `query_plan_cache_size_in_bytes_quota`) were still under the `26.6` block, so the `Fast test` job reported `PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: ... WAS ADDED`. Move them to the `26.7` block. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107125&sha=fd0fd283b9bbf13a6e90f4e68627c5f95628650e&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The new `docs/en/operations/query-plan-cache.md` was a "floating" page that no sidebar referenced, which fails the `Build docusaurus` check. Link it from `docs/en/operations/caches.md` (mirroring `query-condition-cache.md`) and add the `SYSTEM CLEAR|DROP QUERY PLAN CACHE` statement to the system docs, which also gives the page an inbound reference.
…idation Fixes several correctness and security findings on the experimental query plan cache (PR review of #107125): - Fall back to the normal interpreter for ineligible queries. The cacheable logical-plan execution path expands view bodies at plan time and resolves their leaf reads under the invoker context; that is wrong for `SQL SECURITY DEFINER`/`NONE` views. Previously an ineligible query (or one with an unsupported dependency) was not stored but was still executed through the cacheable plan, which could change its result or security context. Now `tryInterpretWithQueryPlanCache` returns `nullptr` so the normal interpreter runs, and the cacheable plan is executed only for queries that are actually eligible. - Reject `SQL SECURITY NONE` views, not only `DEFINER`. Both run the view body under an overridden context the cache cannot replay, so only `INVOKER` views (and views with no explicit security) are eligible. The engine/view-security eligibility check is extracted into `isStorageEligibleForPlanCache`. - Re-check storage eligibility during validation. In UUID-less databases a `DROP`/`CREATE` can swap in an unsupported engine (`Distributed`/`Merge`) or change a view's SQL security while the schema content hash stays equal; a hit could then run against a storage a miss would have refused. Validation now re-runs `isStorageEligibleForPlanCache`. - Extend the fallback schema fingerprint (`computeSchemaHash`) with table constraints (an `ASSUME` constraint can rewrite predicates during analysis) and, for views, the SQL security type and definer, so changing any of them invalidates a stale entry. - Fix the access-control dependency model. Duplicate dependencies for one storage (self-joins, a table read in both the outer plan and a subquery) are now merged by unioning their column sets instead of keeping one occurrence, so revoking `SELECT` on any read column denies a later hit. A table reached only through a scalar subquery is folded into a constant and has no plan leaf, so its columns are unknown (`columns_unknown`); such a dependency now requires table-level `SELECT` on a hit, because a column-level recheck could pass while the baked plan still reads a revoked column. Regular view-body tables remain plan leaves with exact columns and are unaffected.
…urity - `04492_query_plan_cache_access_control`: a self-join's column requirements are unioned across both plan leaves (revoking `SELECT` on a column read by either leaf denies the next hit), and a scalar-subquery table requires table-level `SELECT` on a hit (column grants alone are not enough). - `04493_query_plan_cache_view_security`: a `SQL SECURITY NONE` view query is never cached (falls back to normal execution), and changing a view from `INVOKER` to `SQL SECURITY NONE` invalidates a cached entry.
|
Pushed Docs build ( AI review findings — addressed (commit
Left open for your decision (replied on the threads, not resolved):
CI — the |
…heable The eligibility check in `isStorageEligibleForPlanCache` rejects any view whose `sql_security_type` is set to anything other than `INVOKER`, so both `SQL SECURITY DEFINER` and `SQL SECURITY NONE` views are ineligible. The docs listed only `SQL SECURITY DEFINER` as excluded, which incorrectly implied `SQL SECURITY NONE` views could use the plan cache. State the rule directly: only `SQL SECURITY INVOKER` views are eligible. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Status update on the AI-review " Addressed
Row-policy audit metadata (Major) — the verdict asks to "fix or explicitly scope out". This was explicitly scoped out on the thread: a same-expression policy rename/replace only makes the UUID validation→materialization TOCTOU (Blocker) — left as your decision. The window is strictly between I did not push a fix for this here: the robust fix touches the query-plan serialization format shared with distributed plan shipping (your stated reason for deferring), and the alternative is a real change to the shared
CI — the only red checks are |
… on a hit A view is expanded into the cached logical plan and leaves no `ReadFromTable` leaf for the view storage itself, so `collectQueryPlanCacheDependencies` could not record the columns selected from the view. The view dependency therefore had an empty `columns` set with `columns_unknown = false`, and on a hit `checkAccessForQueryPlanCacheHit` only required `SELECT` on any one column of the view. That is weaker than a miss, which checks `SELECT` on the precise selected view columns in `prepareBuildQueryPlanForTableExpression` before expanding the view: after warming `SELECT a FROM v` while granted `SELECT(a)` on `v`, revoking `SELECT(a)` while keeping `SELECT(b)` would let the hit still return `a`. Mark view-only AST dependencies `columns_unknown = true`, exactly as already done for scalar-subquery tables (whose columns are likewise unrecoverable from the plan): a hit now requires table-level `SELECT` on the view, which is at least as strict as the per-column access a non-cached query checks. The view's body tables are still recorded with their precise columns as ordinary plan leaves, so only the recheck of the view storage itself is tightened. Adds test `04494_query_plan_cache_view_column_access` (revoke a selected view column between the miss and the hit) and documents the conservative behavior. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Status update on the AI-review ❌ Block verdict (now head Fixed — view-column privilege recheck on a hit (the new blocker, This was a genuine fail-open access gap and was not one of the previously-deferred items, so I fixed it in I took the reviewer's second option (require table-level
Note this is the deny-on-hit direction, identical to the accepted scalar-subquery behavior: a query that only a column-level grant on a view makes runnable is now denied on a hit (rather than served). I documented this in Regression: Still open by your decision (unchanged from the previous round)
CI — the only reds on the prior commit were |
| } | ||
| in_set_or_table_position = false; | ||
| } | ||
| else if (!ast.as<ASTWithElement>()) |
There was a problem hiding this comment.
ASTWithElement represents a table CTE, but this branch preserves the caller's positional context before visiting the CTE subquery. At the top level that context is false, so WITH cte AS (SELECT ... FROM t) SELECT ... FROM cte is treated as if the CTE body were a scalar subquery: with the default query_plan_cache_allow_scalar_subqueries = 0 the query is never cached, and with the setting enabled the tables inside the CTE are marked columns_unknown and later require table-level SELECT even though the plan leaves know the actual columns.
Please traverse ASTWithElement::subquery in table/set position, the same way FROM (subquery) is handled, and add a regression that a table CTE can hit the plan cache without enabling scalar subqueries and without upgrading column-level grants to table-level access.
There was a problem hiding this comment.
The observation is correct: a table CTE currently reaches the scalar-subquery branch, so with the default query_plan_cache_allow_scalar_subqueries = 0 such a query is not cached. But the suggested fix (visit ASTWithElement::subquery in table/set position) is not safe on its own, for the same reason the CTE-name walk is left open in #107125 (comment).
The raw AST walk records no column list — visitTableIdentifier sets only UUID/metadata/row-policy and the columns_unknown flag, not dep.columns. On a hit, checkAccessForQueryPlanCacheHit then picks the recheck from that flag:
columns_unknown = true→ table-levelSELECTrecheck (fail-close).columns_unknown = falsewith an empty column list → falls through to the weakest "SELECT on at least one column" path.
Consider a table CTE referenced only in scalar position, e.g. WITH cte AS (SELECT max(x) FROM t) SELECT (SELECT * FROM cte). Here t is folded into a constant during analysis and has no plan leaf, so its only dependency comes from this AST walk. Visiting the CTE body in table position would set columns_unknown = false, downgrading the hit recheck from table-level SELECT to "at least one column". A user granted SELECT(a) but not SELECT(x) would then pass on a hit, while the miss path (which checks SELECT(x) on the folded column) correctly denies — a missed-access hole, exactly the opposite failure flagged on the CTE-name walk.
A correct fix has to distinguish table-position from scalar-position CTE references (per-scope CTE usage tracking), which is the same reserved work as r3470391284. Leaving this open for the same decision.
| if (dep.is_view) | ||
| query_context->addViewAccessInfo(storage_id.getFullTableName()); | ||
| else | ||
| query_context->addQueryAccessInfo(storage_id, dep.columns); |
There was a problem hiding this comment.
For columns_unknown dependencies, collectQueryPlanCacheDependencies clears dep.columns, and this replay path then records the table with an empty column list. Context::addQueryAccessInfo only inserts the columns it is given, so a hit for a query such as SELECT (SELECT b FROM t) logs t in query_tables but omits t.b from query_columns, whereas the miss path records the actual scalar-subquery column during analysis.
Since this is audit/query-log metadata, under-reporting the columns read by a cached hit is unsafe. Please either keep precise columns for logging separately from the stricter access recheck, or log a conservative set such as all columns from the validated metadata whenever columns_unknown is true.
There was a problem hiding this comment.
Correct that a columns_unknown hit logs the table into query_tables without its columns in query_columns, so a column read only through a scalar subquery (or a view) is under-reported relative to the miss path. This is audit/query-log metadata only: it does not affect results or the access recheck, which for a columns_unknown dependency is already the stricter table-level SELECT (checkAccessForQueryPlanCacheHit), not a column-level one.
It is the same class as the row-policy audit-metadata gap in #107125 (comment), which is left as a follow-up. The conservative fix suggested here (log all columns from the validated metadata whenever columns_unknown is true) fits that same follow-up. Leaving open for your decision.
|
Pushed CI on the previous head
Open AI-review " The two remaining "minimum required actions" are, in my reading, both fail-close / audit-only — the same class as the items you already deferred (
New CI is running on |
A `SELECT` row policy filter may itself contain a subquery, e.g. `USING x <= (SELECT max(v) FROM policy_limits)`. `buildFilterInfo` analyzes such a filter into the plan, folding the subquery result (a scalar boundary, or an `IN` set) into a baked constant. But the plan cache dependency collector walks only the user query AST and the plan leaves: the table read by the policy subquery never becomes a `ReadFromTable` leaf, nor an entry of the AST closure walked by `ASTDependencyCollector`. Its data therefore is not part of the entry's dependency fingerprint, so a cache hit could keep enforcing a stale row boundary after that table changes, exposing or denying rows incorrectly. The policy filter also escaped `query_plan_cache_allow_scalar_subqueries`, which is applied only to the user query AST. Make any plan over a table whose effective `SELECT` row policy contains a subquery uncacheable. The check lives in `fillDependency`, the single choke point used by both the plan-leaf and AST-closure dependency walks, and is repeated in `validateQueryPlanCacheEntry` as defense in depth. A plain row policy without a subquery stays cacheable. Adds `04495_query_plan_cache_row_policy_subquery`, which changes the table read by a row-policy subquery between runs and asserts the query is never cached (and stays fresh), while a plain policy is still cached. Addresses review: #107125 (comment)... Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…t charges `QueryPlanCache::set` added the new entry's weight to `per_user_bytes` and recorded `entry_weights` only after `Base::set` returned. But a newly inserted entry lands in the SLRU probationary queue and can be evicted synchronously inside `Base::set` (through `removeOverflow`) when the cache is full of protected entries. That eviction runs `onEntryRemoval`, which had nothing to subtract yet, so the subsequent post-`set` accounting left a charge for a key that was no longer resident. Such a ghost charge could make `query_plan_cache_size_in_bytes_quota` reject future entries for the user until the same key was inserted again or the cache was cleared. Charge the entry (and record `entry_weights`) BEFORE `Base::set`, so a synchronous eviction's `onEntryRemoval` finds and undoes the charge and the accounting stays balanced whether or not the entry survives insertion. Roll back the pre-charge if `Base::set` throws. The three copies of the "subtract a weight from a user's bucket" logic (same-key replacement, rollback, `onEntryRemoval`) are folded into a `decrementUserBytes` helper. The oversized-entry early return is kept as an optimization to avoid pointless insert/evict churn. Addresses review: #107125 (comment) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed
Both C++ files pass a clang The remaining unresolved threads are your explicit deferrals/dismissals, unchanged: the validate/materialize UUID TOCTOU ( |
`QueryPlanCache::canStoreForUser` compared the user's total resident bytes against `query_plan_cache_size_in_bytes_quota` without accounting for the entry it is about to replace. Because a same-key insertion is a replacement (`set` releases the old entry's weight before charging the new one), a user already at the quota could never refresh an invalidated plan: after e.g. an `ALTER` fails validation and the query replans, the identical-key `set` was rejected while the stale entry stayed resident, so every later run re-paid validation and replanning without updating the cache. Subtract the existing key's tracked weight before the quota comparison, mirroring the same-key bookkeeping already done in `set`. Regression test `04496_query_plan_cache_quota_refresh`: with the quota sized to one plan, an invalidated entry refreshes and the next run hits. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed `18e02892983`: fixed the per-user quota admission bug flagged in `QueryPlanCache.cpp` (the new bot finding). The remaining 5 unresolved threads are the design/scope calls you reserved ( CI: the sole red is |
…edundant --database
The regression test added `--database="$CLICKHOUSE_DATABASE"` to the
`clickhouse-client` invocations that run as the test user, but `$CLICKHOUSE_CLIENT`
already sets `--database`, so the client rejected every such query with
`option '--database' cannot be specified more than once`. As a result `W` and
`QUOTA` were both measured as `0` and every subsequent run errored out, leaving
empty `-- hits:` lines that did not match the reference.
The queries are fully qualified (`${CLICKHOUSE_DATABASE}.t_quota`), so the extra
`--database` is unnecessary. Drop it to match the sibling tests
`04494`/`04495`, which run as a user with `$CLICKHOUSE_CLIENT --user="$user" $SETTINGS`.
CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107125&sha=18e0289298368d5ce1223b97434e02c5548e155b&name_0=PR&name_1=Fast%20test
PR: #107125
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
|
||
| void QueryPlanCache::clear() | ||
| { | ||
| Base::clear(); |
There was a problem hiding this comment.
clear can race with a concurrent set and leave a resident entry completely unaccounted for.
A concrete interleaving is:
- Query A gets past
canStoreForUser, recordsper_user_bytes/entry_weightsat lines 173-195, and then blocks inBase::set. SYSTEM DROP QUERY PLAN CACHErunsBase::clear()here and then erases both maps below.- Query A resumes
Base::setand inserts the new entry, but its accounting was just wiped.
After that, the entry stays resident with no entry_weights record and no per-user charge, so query_plan_cache_size_in_bytes_quota can be undercounted until the same key is inserted again or the whole cache is cleared. Context::clearQueryPlanCache explicitly allows this to happen concurrently, so this is not just theoretical.
Please synchronize clear with the pre-charge/insert path, or make set detect that a clear happened between charging and insertion and restore/remove the accounting before returning.
There was a problem hiding this comment.
Confirmed, this is a real race. set charges per_user_bytes/entry_weights inside a per_user_mutex block that it releases before Base::set, while clear wipes those maps under per_user_mutex after Base::clear. When clear's Base::clear runs before our Base::set, and our pre-charge runs before clear's map-wipe, the entry ends up resident with no entry_weights record and no per-user charge, so query_plan_cache_size_in_bytes_quota gets undercounted until that key is re-inserted or the cache is cleared again. The cached plan itself stays valid — no wrong results and no access impact — and the drift is bounded by the cache size and self-heals, but the hole is genuine.
Both obvious fixes are traps here:
- Holding
per_user_mutexacrossBase::set(or acrossBase::clear) inverts the lock order and deadlocks:onEntryRemovalacquiresper_user_mutexwhileCacheBasealready holds its own mutex (synchronous eviction insideset), so the discipline isCacheBase::mutex→per_user_mutex. - A post-
set"generation changed → re-charge" check cannot tell the orphan ordering (Base::clearbefore ourBase::set, entry resident) from the opposite one (Base::clearafter ourBase::set, entry already removed, sinceCacheBase::cleardoes not runonEntryRemovalper entry). Distinguishing them needs a residency check, which needsCacheBase::mutexunderper_user_mutex— deadlock again — and re-charging blindly reintroduces exactly the ghost charge the pre-charge structure was added to avoid.
A correct fix needs charge+insert to be atomic under the same lock that guards residency — e.g. a CacheBase on-insert counterpart to onEntryRemoval, so admission accounting happens inside the CacheBase-locked region rather than in a separate per_user_mutex critical section around Base::set. That is a design change to the accounting model.
Since the feature is off by default and the impact is a bounded, self-healing quota undercount, leaving this open for your decision alongside the other lock/format items rather than rushing an untestable concurrency change.
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 592/694 (85.30%) · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add an experimental query plan cache supporting complex
SELECTqueries: views (including nested views), joins,UNION,INsubqueries and multiple tables over any table engine. Whenallow_experimental_query_plan_cache = 1andenable_query_plan_cache = 1, repeated identical queries skip query analysis and logical planning (the query is still parsed) while still executing against current data.SYSTEM DROP QUERY PLAN CACHEclears the cache; eventsQueryPlanCacheHits/QueryPlanCacheMisses/QueryPlanCacheValidationMisses/QueryPlanCacheStaleMissesand metricsQueryPlanCacheBytes/QueryPlanCacheEntriesprovide observability.Documentation entry for user-facing changes
Description
An experimental query plan cache for
SELECTqueries that supports complex queries — views (including nested views), joins,UNION,INsubqueries and multiple tables over any table engine. It complements #99309, which covers single-tableMergeTreequeries with a universalize/materialize design; this PR takes a different approach to reach query shapes where planning cost is actually concentrated (deep view chains, many joins).Unlike the query result cache, a plan cache hit still executes the query: only query analysis and logical planning are skipped (the query is still parsed, since the cache key is built from the parsed AST), so hits are transactionally consistent and always see current data.
Design
On a miss, the plan is built in the planner's existing
build_logical_planmode (the machinery used by distributed plan shipping), with a newSelectQueryOptions::cacheable_logical_planmode that makes the logical plan self-contained:ReadFromTableplaceholders for any engine.StorageView::readImplviaSelectQueryInfo::build_logical_plan), so the cached plan embeds the analyzed view bodies instead of deferring the expensive expansion to execution.JoinStepLogicalLookup), which bind live storages into the plan, are not used; a regular join is planned instead and the physical join algorithm is chosen at materialization. Distributed and parallel-replicas logical plans do not set this mode and are unaffected.compile_expressions = 0(JIT-fused function nodes cannot be serialized; the JIT still applies when pipelines are built from the plan).Both hits and misses execute through
QueryPlan::resolveStorages, which re-binds every leaf to a fresh storage snapshot.Cache key and invalidation
The key is the normalized AST hash, a hash of plan-affecting settings, the current database, the user and the sorted role set. Each entry stores a dependency fingerprint over every referenced storage, discovered from the plan's
ReadFromTableleaves plus an AST closure through view definitions (which also catches tables referenced only from scalar subqueries): UUID, metadata version (or a schema content hash including the view body), and the row policy hash. Every hit revalidates all dependencies and re-checksSELECTaccess, soDROP/CREATE,ALTER, view redefinition (including nested views), row policy changes and permission revocation are handled transparently.Eligibility
arrayJoinis exempt: it is multi-valued but pure).query_plan_cache_allow_scalar_subqueries.system.one), remote/Mergestorages andSQL SECURITY DEFINERviews are rejected.Performance, proven on DOOMbench
DOOMbench is a DOOM-like raycasting engine implemented entirely in SQL; its
screenquery renders a frame through a deep chain of views over joins, unions, array joins and scalar subqueries. Ported to ClickHouse SQL, the workload is query-analysis-bound: profiling showedIQueryTreeNode::getTreeHashand analyzer resolution dominating, andSELECT ... LIMIT 0was as slow as the full query. With the plan cache (local server, single connection):max_threads = 4On hits the analyzer cost disappears from profiles (
getTreeHash: 25.8k samples → 174); the remaining time is genuine execution. Correctness validated on the same workload: frames are byte-identical between cache off/miss/hit, game-state mutations are visible on every cached render (HTAP freshness), nested view redefinition is picked up immediately, andrand()inside a view body refuses to cache.Notes
ActionsDAGinput-constant serialization fix Preserve constant columns ofActionsDAGinput nodes in serialization #107124 as its first commit; it will be rebased once that PR is merged.count()) are not applied to cache-enabled queries.Related: #99309
Related: #107124