Speed up the analyzer / planner for SELECT count() and similar lightweight queries#104513
Conversation
|
Workflow [PR], commit [cedbaba] Summary: ❌
AI ReviewSummaryThis PR reduces analyzer/planner overhead for lightweight queries by making table-expression column structures lazy, memoizing Final Verdict
|
Address review feedback on PR ClickHouse#104513: the per-Counters `any_trace_in_chain` short-circuit in `Counters::increment` was inherited from the immediate parent's flag at attach time. If `setTraceProfileEvents` is later called on an ancestor, the ancestor walks up via `markChainTracing` but never reaches already-attached descendants, so they keep `any_trace_in_chain == false` and skip the per-event trace-bit check. Make `inheritTracingFromParent` walk the full parent chain at attach time and copy the flag from the first ancestor that has it set, so any tracing state present at the moment of attach is captured. We accept the remaining race -- a child whose attach happens before its ancestor calls `setTraceProfileEvents` will still miss events on that thread until the next attach -- because the only caller is `ProcessList::insert` during query setup, before worker threads attach, so worker threads (where the hot increments happen) always see the correct flag. Closing the remaining window would require children-tracking on every `Counters` (≈24 extra bytes plus a mutex), which we want to avoid. Move the trace-toggle helpers (`setUserCounters`, `setParent`, `setTraceAllProfileEvents`, `setTraceProfileEvent`, `markChainTracing`, `inheritTracingFromParent`) out of the header and into `ProfileEvents.cpp` so editing them does not trigger a wide rebuild of every TU that includes `ProfileEvents.h`.
Per review on PR ClickHouse#104513: `inheritTracingFromParent` only ever flips the per-counter flag to `true`, never back to `false`, so descendants kept a stale `true` after the first traced query and permanently took the slow path. Refreshing the flag at the right moment fixed correctness but the optimization only saved ~3% overall, and the chain-of-flags bookkeeping was a source of correctness bugs (this is the third try). Counter packing (`COUNTER_TRACE_BIT` in bit 63 of the atomic counter) stays — it halves counter array size and lets us read both the post-fetch value and the trace bit out of the same `fetch_add`. `increment` is now a single chain walk that does the `fetch_add`, OR-accumulates `(prev & COUNTER_TRACE_BIT)` across parents, and emits one trace at the end if any ancestor had the bit set. Removed: - `Counters::any_trace_in_chain` field - `markChainTracing()`, `inheritTracingFromParent()`, `refreshTracingFromParent()`, `anyTraceInChain()` - `ProcessList::insert` no longer refreshes the flag after `setTraceProfileEvents` (the workaround that closed the race is no longer needed since there's no flag to be stale) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per review on PR ClickHouse#104513: `std::memset` over an array of `std::atomic<Count>` is technically UB per the standard, even though it works on every platform we ship (where the atomic is lock-free with the same layout as `Count`). Wrap the `memset` in `if constexpr (std::atomic<Count>::is_always_lock_free && same size/alignment)` and provide a relaxed-store loop fallback so the code is well-defined on any future target where that no longer holds. The fast path is unchanged on x86_64 / AArch64 / RISC-V64.
Reverts the changes from PR ClickHouse#104513: 3ed1678 Guard resetCounters memset behind a constexpr lock-free check 285c33c Drop the any_trace_in_chain gate, keep the counter packing 70241ae Refresh tracing flag on the calling thread after setTraceProfileEvents 4f549c2 Walk parent chain when inheriting any_trace_in_chain ea92033 Whitelist ProfileEvents::COUNTER_TRACE_BIT and COUNTER_VALUE_MASK in style check 1b0194c Pack ProfileEvents counter into 8 bytes and gate trace check on a chain bit Packing the trace flag into bit 63 of each counter required every external reader of `Counters::operator[]` to mask the value with `COUNTER_VALUE_MASK`. Reviewers found more than one site that still read the raw packed value (e.g. `ExecutionSpeedLimits::throttle` and `MetricsTransmitter::transmit`), and the failure mode is silent: the trace bit only flips on when `setTraceProfileEvent(s)` is called, so the bug stays dormant until tracing is enabled. The optimization saved only ~3% on lightweight queries and was not worth the long tail of correctness traps it created. Restores `src/Common/ProfileEvents.{h,cpp}`, `src/Interpreters/ProcessList.cpp`, the six external readers that were updated to apply the mask, and the style-check whitelist to their state on master.
In `deserializeOffsets`, hoist `num_trailing_defaults` and `has_value_after_defaults` into locals so the inner loop stops writing two state fields to memory every iteration in the steady state where they remain 0/false. Drop the now-redundant `!offsets.empty()` guard when computing `start_of_group` (it can only be empty while `first` is still true). In `readOrGetCachedSparseOffsets`, compute the cache key directly from `settings.path` and skip the heavy `Substream` construction in `path.push_back/pop_back` on cache hits, and reuse the same key for `cache->insert_or_assign` on cache misses instead of recomputing it inside `addElementToSubstreamsCache`. Both changes preserve existing behaviour; the cache key matches what `getSubcolumnNameForStream` would produce after the original push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`StorageSnapshot::getColumns(All|Subcolumns|...)` is called from `QueryAnalyzer::initializeTableExpressionData` for every table in every query. For a wide table such as `hits` it spent ~3% of the per-query TCPHandler CPU on two avoidable things: * Building a `NameSet` over every regular column of the table just to filter out a handful of virtuals. With ~110 columns and ~5 virtuals, a linear `find` per virtual is cheaper than hashing all 110 names. * Calling `VirtualColumnsDescription::getSampleBlock` to obtain names/types of virtuals. That constructs a full `Block`, which calls `IDataType::createColumn()` per virtual just so the result can be destructured into a `NamesAndTypesList` immediately after. Add `VirtualColumnsDescription::getNamesAndTypes(kind, place)` that emits the same names/types directly, fast-path the `metadata->virtuals.empty()` case in `StorageSnapshot::getColumns`, and do a linear de-dup against `all_columns` instead of building a hash set. Measured on clickbench Q0 `SELECT count() FROM hits`: * ~995 -> ~1019 QPS (+2.4%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For wide tables (e.g. ClickBench `hits`, ~100 columns), populating the `name -> ColumnNode` map for every column on every query during `initializeTableExpressionData` is the dominant analyzer cost when the query touches no column identifiers (`SELECT count() FROM t`). Build only the cheap membership-check sets (`column_names`, `column_identifier_first_parts`) eagerly and defer the `ColumnNode`-allocating map to first use via a `populate_column_node_map` closure. The closure also takes over ALIAS column resolution that the old eager path performed inline, so it captures the storage snapshot and the parent `IdentifierResolveScope` and runs only when an identifier actually requires a `ColumnNode`. `column_name_to_column_node` becomes `std::optional<map>` so `has_value()` doubles as the populated flag. `ensureColumnNodeMapIsPopulated` emplaces the empty map up front so re-entrant lookups during alias expression resolution (e.g. `y ALIAS x + x*2` looking up `x`) see the placeholder ColumnNodes the populator has already inserted. Subqueries / unions still populate eagerly: their projection lists are small and the map is needed immediately by the surrounding resolution. ClickBench Q0 (`SELECT count() FROM hits`): ~1019 -> ~1216 QPS (+19%). Wider impact wherever a query references few columns of a wide table.
After making `column_name_to_column_node` lazy, perf showed `initializeTableExpressionData` still spending ~1.4% inserting `column_names` strings and ~1.0% constructing `Identifier` objects to populate `column_identifier_first_parts` for every column of the table. These two sets are membership-check helpers used only when an identifier is being bound to a table expression. Trivial queries like `SELECT count() FROM t` never consult them, so move their construction into `ensureColumnMembershipSetsArePopulated()`, called on demand from `hasFullIdentifierName`, `canBindIdentifier`, `tryGetSubcolumnInfo`, and the JOIN USING parent-scope check. `ensureColumnNodeMapIsPopulated` calls the membership-set populator first so any code path that needs the `ColumnNode` map continues to see a fully built table expression data. ClickBench Q0 (`SELECT count() FROM hits`): ~1216 -> ~1350 QPS (+11%).
Analyzer and planner each call `StorageSnapshot::getColumns` once per query with the same option set (`All|Virtuals|Subcolumns` from the analyzer, `AllPhysical|Subcolumns` from the planner). Each call iterates the columns multi-index and runs `addSubcolumnsToList` (one hash lookup in `columns` and one `equal_range` on `subcolumns` per row) — O(N + N*M) work on every query, ~2.7% of CPU on `SELECT count() FROM hits` for a wide table. Move the memoization to `ColumnsDescription::get`, where it lives for the lifetime of the descriptor (i.e. one StorageInMemoryMetadata version). Two queries on the same table version reuse the same cached `NamesAndTypesList`s. The cache is keyed by the four scalar fields of `GetColumnsOptions`, packed into a 4-byte struct, and a small linear search is fast enough for the handful of distinct option sets a query mix produces. Every mutator that touches `columns` or `subcolumns` (`add`, `remove`, `rename`, `modifyColumnOrder`, `flattenNested`, `addSubcolumns`, `removeSubcolumns`, copy/move assignment) calls `invalidateGetCache` so ALTER and similar paths see consistent data. ClickBench Q0 (`SELECT count() FROM hits`): ~1352 -> ~1397 QPS (+3%).
`CurrentMemoryTracker::allocImpl` and `free` are called on every allocation. Three small wins on the hot path: * Read `current_thread` once. The previous flow went through a local `getMemoryTracker()` that already loaded `current_thread`, then a follow-up `if (!current_thread)` that issued a second TLS load and branch. * Skip the unconditional store to `untracked_memory_blocker_level`. The blocker level only changes when a `MemoryTrackerBlockerInThread` is pushed/popped, which is rare; the common case wrote the same value back every allocation. * Move the global-tracker fallback into out-of-line `[[gnu::cold]]` helpers (`allocWithoutCurrentThread`, `freeWithoutCurrentThread`) so the cold path doesn't bloat the hot one. Tagged the rare branches with `[[unlikely]]`. ClickBench Q0 shows `allocImpl` self-time drop from ~0.86% to ~0.70% on the perf profile; roughly 0.15% of total CPU on the new/delete fast path.
`buildQueryPlanForTableExpression` set up the trivial-count plan and then ran an entire `Planner` over the same query (with `only_analyze`) just to read back the expected header at `WithMergeableState` and add a rename `ExpressionStep` if the trivial-count block disagreed. On ClickBench Q0 (`SELECT count() FROM hits`) that recursive planner was ~4.6% of total CPU, and the rename it produced was always the same: "alias the smallest-column name to the aggregate's identifier." Construct the trivial-count block with the aggregate's `calculateActionNodeName` directly, so its header already matches the one the outer planner expects. Then skip the recursive `Planner` whenever trivial-count was applied -- nothing else can change the header at that point, since the only steps gated on `till_stage == FetchColumns` (alias columns, where filters) don't fire after the trivial-count branch. Plan goes from `Expression(Project) -> MergingAggregated -> Expression(Change column names) -> ReadFromPreparedSource` to `Expression(Project) -> MergingAggregated -> ReadFromPreparedSource`. Three reference files updated to reflect the dropped Expression step: * `03448_trivial_count_single_threaded_merge.reference` -- one less level in `EXPLAIN PIPELINE`. * `04061_trivial_count_aggregate_function_argument_types.reference` -- the second `Header:` line came from the rename step's output and is no longer present. * `04201_trivial_count_with_additional_filter.reference` -- one less indentation level around `(ReadFromPreparedSource)`. ClickBench Q0: ~1383 -> ~1444 QPS (+4.4%).
Two clang-tidy errors reported by the arm_tidy build: 1. `src/Storages/ColumnsDescription.h`: `bugprone-copy-constructor-init`: the explicit copy / move constructors did not initialize the `IHints<>` base, leaving it default-constructed. 2. `src/Planner/PlannerJoinTree.cpp`: `performance-move-const-arg`: `std::move` on `trivial_count_column_name` had no effect because `ColumnWithTypeAndName`'s constructor takes `const String &`.
`IHints<>` is stateless, but `IHints<>(std::move(other))` plus subsequent `std::move(other.columns)` triggers `bugprone-use-after-move`. The implicit base move-construction is sufficient.
…n change Follow-up to 4c85254 (Skip the recursive Planner after trivial-count optimization), which dropped one `Expression` / `ExpressionTransform` step from the trivial-count plan in the analyzer path. Three other references were updated in that commit; this one was missed. Only the `enable_analyzer = 1` half changes (lines 9-16): one nested `Expression` layer above `(ReadFromPreparedSource)` is removed, reducing the indentation by two spaces. The `enable_analyzer = 0` half is unchanged -- the old interpreter wasn't touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6e11028 to
1c4661d
Compare
| if (state.num_trailing_defaults >= max_rows_to_read) | ||
| /// Hoist state into locals so the inner loop avoids two redundant memory writes | ||
| /// per iteration in the common case where both fields stay 0/false. | ||
| size_t num_trailing_defaults = state.num_trailing_defaults; |
There was a problem hiding this comment.
TBH, such optimisations don't make sense to me.
There was a problem hiding this comment.
This was pure assembly analysis.
From the session with Claude:
Found the full picture. Summary of the assembly analysis from the May 7 session:
Baseline (tmp/clickhouse_baseline)
analyze-assembly.py on readOrGetCachedSparseOffsets (which has deserializeOffsets inlined into it):
- Spill density: 19.1% [high] — flagged finding: "Spills inside loop bodies multiply memory traffic by iteration count."
- The smoking gun in the inner loop body (the non-end_of_granule path):
0x1656b576: xor ecx, ecx
0x1656b578: mov byte ptr [r14 + 0x18], 0x0 ; state.has_value_after_defaults = false
0x1656b57d: mov qword ptr [r14 + 0x10], 0x0 ; state.num_trailing_defaults = 0
0x1656b585: inc r15
0x1656b588: mov r14, qword ptr [rbp + 0x18] ; reload &state
0x1656b58c: mov qword ptr [rbp - 0x158], r13 ; SPILL
0x1656b593: mov qword ptr [rbp - 0x150], rcx ; SPILL
r14 is the pointer to state. Those two stores at +0x10 and +0x18 are literally the per-iteration state.num_trailing_defaults = 0; state.has_value_after_defaults = false; lines (184–185 in the baseline source). The reload of r14 from [rbp+0x18] shows the compiler couldn't even keep &state in a register across the loop body.
Root cause
DeserializeStateSparse & state is a reference parameter. The compiler can't prove writes through it don't alias the PODArray::push_back, the ReadBuffer reads in readVarUInt, or ++skipped_values_rows. So every source-level assignment to state.foo materializes as a store, and the address itself gets reloaded.
After hoisting
tmp/clickhouse_optimized:
- Inner loop no longer touches [r14+0x10]/[r14+0x18].
- The single pair of writes happens once at function exit:
0x1656b99e: mov qword ptr [rax + 0x10], rcx ; state.num_trailing_defaults
0x1656b9a2: setae byte ptr [rax + 0x18] ; state.has_value_after_defaults
There was a problem hiding this comment.
ok, but this analysis doesn't tell us
- whether these couple of instructions were on the critical path - otherwise, we just saved a bit of throughput and no latency
- even if they were, how does it compare to the total amount of cycles each iteration takes
But anyway, I just express my opinion and don't insist on it.
There was a problem hiding this comment.
But anyway, I just express my opinion and don't insist on it.
Don't worry, it's good to push back.
whether these couple of instructions were on the critical path - otherwise, we just saved a bit of throughput and no latency
This loop to deserialize sparse columns is often one of the most, if not the most, hot function in small queries like SELECT count() FROM hits WHERE AdvEngineID != 0.
even if they were, how does it compare to the total amount of cycles each iteration takes
Comparing the steady-state non-end-of-granule path (the dominant case):
no_hoist hot-iteration sequence (0x16a789a3..0x16a789c2):
add rcx, r13 ; compute push value
mov qword ptr [rax], rcx ; offsets.push_back
add rax, 0x8
mov qword ptr [rdx + 0x18], rax ; update offsets.size_
mov byte ptr [rbp + 0x18], 0x0 ; state.has_value_after_defaults = false <-- store
mov qword ptr [rbp + 0x10], 0x0 ; state.num_trailing_defaults = 0 <-- store
inc r13 ; total_rows++
mov qword ptr [rsp + 0x38], rax ; spill rax (offsets.end_)
hoist hot-iteration sequence (0x16a789db..0x16a789f4):
add rcx, r12 ; compute push value
mov qword ptr [rax], rcx ; offsets.push_back
add rax, 0x8
mov qword ptr [rdx + 0x18], rax ; update offsets.size_
mov r14, r12
inc r14 ; total_rows++
mov qword ptr [rsp + 0x58], rax ; spill rax (offsets.end_)
Delta on this iteration of the hot loop:
- −2 store instructions (state.has_value_after_defaults = false, state.num_trailing_defaults = 0 gone)
- Iteration instruction count: 9 → 7
Whole-function counts also match: no_hoist has 7 zero-stores to state.*; hoist has 3 (the writebacks at function exit + the end-of-granule reset).
Cycle interpretation on Zen3: each removed store is ~1 µop dispatched to the store-data and store-address ports; under store-buffer pressure they cost roughly 1 cycle each, but the loop is more likely IPC-bound by the VarInt decode and PODArray-append dependency chain. So the saving is ~2 µops/iter, theoretically ~2 cycles, but in practice closer to <1 cycle of wall — consistent with the ~1% benchmark improvement.
There was a problem hiding this comment.
To clarify what I was asking: my question was specifically whether the saved instructions sit on the critical path, not whether the loop is hot. Those are different things. It very well might be that we saved instructions, but we didn't change the number of cycles per loop iteration. Because as long as there are idle ports and buffers, you can add a few instructions here and there, essentially for free. So, if you have a full asm for this loop, you can pipe it into a tool like https://uica.uops.info and see whether the cycles-per-iteration metric actually decreased. It'd be quite interesting to know.
There was a problem hiding this comment.
Assembly (generated per file, not from LTO build):
- Before: https://pastila.nl/?01d3fbfd/fa38d2fa0aa9e8fd20fdcdc7c9300b4d#OTvoSCza+2GAgZYigrgV8Q==GCM
- After: https://pastila.nl/?000d7e52/ef8fe8c022f76e8ee53b769cc9b531c5#vPIIa05X6eGJa/Z3JCzatw==GCM
Predicted Throughput: (Generated urls are too big)
- Before:
Tool | Broadwell | Skylake | Coffee Lake
-- | -- | -- | --
uiCA | 31.00 | 42.00 | 42.00
- After:
Tool | Broadwell | Skylake | Coffee Lake
-- | -- | -- | --
uiCA | 28.00 | 33.00 | 33.00
| void invalidateGetCache() const; | ||
|
|
||
| mutable std::mutex get_cache_mutex; | ||
| mutable std::vector<std::pair<GetCacheKey, std::shared_ptr<const NamesAndTypesList>>> get_cache; |
There was a problem hiding this comment.
It was a deliberate choice. It's tiny and has only a handful of entries at most, so having and iterating a small vector is faster. I'll add a comment
There was a problem hiding this comment.
Decided to test it and both with concurrency 1 and concurrency 16 and the difference is slightly better towards vector but it's too small to trust, as is within noise.
| if (state.num_trailing_defaults >= max_rows_to_read) | ||
| /// Hoist state into locals so the inner loop avoids two redundant memory writes | ||
| /// per iteration in the common case where both fields stay 0/false. | ||
| size_t num_trailing_defaults = state.num_trailing_defaults; |
This reverts commit 744a3d9.
`ColumnsDescription.h` had a stale `#include <mutex>` from when `get_cache_mutex` was `std::mutex`; it's now `DB::SharedMutex` and no inline body in the header uses `std::lock_guard` / `std::mutex`. The .cpp uses `std::lock_guard<SharedMutex>` on the exclusive sites, so include `<mutex>` explicitly there.
LLVM Coverage Report
Changed lines: 97.19% (311/320) | lost baseline coverage: 9 line(s) · Uncovered code |
2f448a2
There was a problem hiding this comment.
This looks very bad to me. It replaces common usage of the cache ISerialization::getElementFromSubstreamsCache/ISerialization::addElementToSubstreamsCache to some custom logic that can be easily become broken if common logic of cache usage is changed in those methods that are used in all other places.
the heavy
Substreamconstruction inpath.push_backon cache hits
What is so heavy in construction of this struct? Per granule it should be negligable.


Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Reduce per-query overhead for simple SELECT queries (parsing, analysis and planning). For example,
SELECT count() FROM hitsfrom a single connection is roughly 50% faster.Documentation entry for user-facing changes
Per-commit Q0 (
SELECT count() FROM hits, single connection) on a local build:Pack ProfileEvents counter into 8 bytes and gate trace check on a chain bit995+3.9%