Speed up the analyzer / planner for SELECT count() and similar lightweight queries by Algunenano · Pull Request #104513 · ClickHouse/ClickHouse · GitHub
Skip to content

Speed up the analyzer / planner for SELECT count() and similar lightweight queries#104513

Merged
Algunenano merged 14 commits into
ClickHouse:masterfrom
Algunenano:optimize-sparse-offsets-cache
Jun 1, 2026
Merged

Speed up the analyzer / planner for SELECT count() and similar lightweight queries#104513
Algunenano merged 14 commits into
ClickHouse:masterfrom
Algunenano:optimize-sparse-offsets-cache

Conversation

@Algunenano

@Algunenano Algunenano commented May 10, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Reduce per-query overhead for simple SELECT queries (parsing, analysis and planning). For example, SELECT count() FROM hits from a single connection is roughly 50% faster.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Per-commit Q0 (SELECT count() FROM hits, single connection) on a local build:

Commit QPS Δ
baseline ~958
Trim per-iteration overhead in sparse offset deserialization ~958 (Q1-targeted)
Reverted: Pack ProfileEvents counter into 8 bytes and gate trace check on a chain bit 995 +3.9%
Trim StorageSnapshot::getColumns: drop NameSet and Block round-trip ~1019 +2.4%
Defer ColumnNode map construction in initializeTableExpressionData ~1216 +19%
Defer column_names and column_identifier_first_parts construction ~1352 +11%
Memoize ColumnsDescription::get on the descriptor itself ~1397 +3%
Trim CurrentMemoryTracker hot path ~1383 within noise
Skip the recursive Planner after trivial-count optimization ~1444 +4.4%

@clickhouse-gh

clickhouse-gh Bot commented May 10, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [cedbaba]

Summary:

job_name test_name status info comment
AST fuzzer (amd_debug, targeted) FAIL
Logical error: Column identifier A is already registered (STID: 4697-4326) FAIL cidb IGNORED

AI Review

Summary

This PR reduces analyzer/planner overhead for lightweight queries by making table-expression column structures lazy, memoizing ColumnsDescription::get, simplifying virtual-column collection, and removing an unnecessary recursive planner pass in the trivial-count path; corresponding stateless references were updated. After reviewing the current diff and prior discussion threads, I found no new correctness, safety, or compatibility issues requiring changes.

Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 10, 2026
Comment thread src/Common/ProfileEvents.h Outdated
Algunenano added a commit to Algunenano/ClickHouse that referenced this pull request May 10, 2026
Address review feedback on PR ClickHouse#104513: the per-Counters `any_trace_in_chain`
short-circuit in `Counters::increment` was inherited from the immediate
parent's flag at attach time. If `setTraceProfileEvents` is later called on
an ancestor, the ancestor walks up via `markChainTracing` but never reaches
already-attached descendants, so they keep `any_trace_in_chain == false`
and skip the per-event trace-bit check.

Make `inheritTracingFromParent` walk the full parent chain at attach time
and copy the flag from the first ancestor that has it set, so any tracing
state present at the moment of attach is captured. We accept the remaining
race -- a child whose attach happens before its ancestor calls
`setTraceProfileEvents` will still miss events on that thread until the
next attach -- because the only caller is `ProcessList::insert` during
query setup, before worker threads attach, so worker threads (where the
hot increments happen) always see the correct flag. Closing the remaining
window would require children-tracking on every `Counters` (≈24 extra
bytes plus a mutex), which we want to avoid.

Move the trace-toggle helpers (`setUserCounters`, `setParent`,
`setTraceAllProfileEvents`, `setTraceProfileEvent`, `markChainTracing`,
`inheritTracingFromParent`) out of the header and into `ProfileEvents.cpp`
so editing them does not trigger a wide rebuild of every TU that includes
`ProfileEvents.h`.
Comment thread src/Common/ProfileEvents.cpp
@Algunenano Algunenano mentioned this pull request May 11, 2026
12 tasks
Algunenano added a commit to Algunenano/ClickHouse that referenced this pull request May 11, 2026
Per review on PR ClickHouse#104513: `inheritTracingFromParent` only ever
flips the per-counter flag to `true`, never back to `false`, so
descendants kept a stale `true` after the first traced query and
permanently took the slow path. Refreshing the flag at the right
moment fixed correctness but the optimization only saved ~3%
overall, and the chain-of-flags bookkeeping was a source of
correctness bugs (this is the third try).

Counter packing (`COUNTER_TRACE_BIT` in bit 63 of the atomic
counter) stays — it halves counter array size and lets us read
both the post-fetch value and the trace bit out of the same
`fetch_add`. `increment` is now a single chain walk that does the
`fetch_add`, OR-accumulates `(prev & COUNTER_TRACE_BIT)` across
parents, and emits one trace at the end if any ancestor had the
bit set.

Removed:
  - `Counters::any_trace_in_chain` field
  - `markChainTracing()`, `inheritTracingFromParent()`,
    `refreshTracingFromParent()`, `anyTraceInChain()`
  - `ProcessList::insert` no longer refreshes the flag after
    `setTraceProfileEvents` (the workaround that closed the race
    is no longer needed since there's no flag to be stale)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread src/Common/ProfileEvents.cpp Outdated
Algunenano added a commit to Algunenano/ClickHouse that referenced this pull request May 12, 2026
Per review on PR ClickHouse#104513: `std::memset` over an array of
`std::atomic<Count>` is technically UB per the standard, even though
it works on every platform we ship (where the atomic is lock-free
with the same layout as `Count`). Wrap the `memset` in
`if constexpr (std::atomic<Count>::is_always_lock_free && same
size/alignment)` and provide a relaxed-store loop fallback so the
code is well-defined on any future target where that no longer
holds. The fast path is unchanged on x86_64 / AArch64 / RISC-V64.
Comment thread src/Common/ProfileEvents.h Outdated
Algunenano added a commit to Algunenano/ClickHouse that referenced this pull request May 12, 2026
Reverts the changes from PR ClickHouse#104513:
  3ed1678  Guard resetCounters memset behind a constexpr lock-free check
  285c33c  Drop the any_trace_in_chain gate, keep the counter packing
  70241ae  Refresh tracing flag on the calling thread after setTraceProfileEvents
  4f549c2  Walk parent chain when inheriting any_trace_in_chain
  ea92033  Whitelist ProfileEvents::COUNTER_TRACE_BIT and COUNTER_VALUE_MASK in style check
  1b0194c  Pack ProfileEvents counter into 8 bytes and gate trace check on a chain bit

Packing the trace flag into bit 63 of each counter required every
external reader of `Counters::operator[]` to mask the value with
`COUNTER_VALUE_MASK`. Reviewers found more than one site that still
read the raw packed value (e.g. `ExecutionSpeedLimits::throttle` and
`MetricsTransmitter::transmit`), and the failure mode is silent: the
trace bit only flips on when `setTraceProfileEvent(s)` is called, so
the bug stays dormant until tracing is enabled. The optimization
saved only ~3% on lightweight queries and was not worth the long
tail of correctness traps it created.

Restores `src/Common/ProfileEvents.{h,cpp}`,
`src/Interpreters/ProcessList.cpp`, the six external readers that
were updated to apply the mask, and the style-check whitelist to
their state on master.
Comment thread src/Common/ProfileEvents.cpp Outdated
Comment thread src/Common/ProfileEvents.cpp Outdated
Comment thread src/Common/ProfileEvents.cpp Outdated
@Algunenano

Copy link
Copy Markdown
Member Author
image

Algunenano and others added 10 commits May 20, 2026 13:03
In `deserializeOffsets`, hoist `num_trailing_defaults` and
`has_value_after_defaults` into locals so the inner loop stops writing
two state fields to memory every iteration in the steady state where
they remain 0/false. Drop the now-redundant `!offsets.empty()` guard
when computing `start_of_group` (it can only be empty while `first` is
still true).

In `readOrGetCachedSparseOffsets`, compute the cache key directly from
`settings.path` and skip the heavy `Substream` construction in
`path.push_back/pop_back` on cache hits, and reuse the same key for
`cache->insert_or_assign` on cache misses instead of recomputing it
inside `addElementToSubstreamsCache`.

Both changes preserve existing behaviour; the cache key matches what
`getSubcolumnNameForStream` would produce after the original push.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`StorageSnapshot::getColumns(All|Subcolumns|...)` is called from
`QueryAnalyzer::initializeTableExpressionData` for every table in every
query. For a wide table such as `hits` it spent ~3% of the per-query
TCPHandler CPU on two avoidable things:

* Building a `NameSet` over every regular column of the table just to
  filter out a handful of virtuals. With ~110 columns and ~5 virtuals,
  a linear `find` per virtual is cheaper than hashing all 110 names.

* Calling `VirtualColumnsDescription::getSampleBlock` to obtain
  names/types of virtuals. That constructs a full `Block`, which calls
  `IDataType::createColumn()` per virtual just so the result can be
  destructured into a `NamesAndTypesList` immediately after.

Add `VirtualColumnsDescription::getNamesAndTypes(kind, place)` that
emits the same names/types directly, fast-path the
`metadata->virtuals.empty()` case in `StorageSnapshot::getColumns`, and
do a linear de-dup against `all_columns` instead of building a hash set.

Measured on clickbench Q0 `SELECT count() FROM hits`:
* ~995 -> ~1019 QPS (+2.4%)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
For wide tables (e.g. ClickBench `hits`, ~100 columns), populating the
`name -> ColumnNode` map for every column on every query during
`initializeTableExpressionData` is the dominant analyzer cost when the
query touches no column identifiers (`SELECT count() FROM t`).

Build only the cheap membership-check sets (`column_names`,
`column_identifier_first_parts`) eagerly and defer the
`ColumnNode`-allocating map to first use via a `populate_column_node_map`
closure. The closure also takes over ALIAS column resolution that the
old eager path performed inline, so it captures the storage snapshot and
the parent `IdentifierResolveScope` and runs only when an identifier
actually requires a `ColumnNode`.

`column_name_to_column_node` becomes `std::optional<map>` so
`has_value()` doubles as the populated flag. `ensureColumnNodeMapIsPopulated`
emplaces the empty map up front so re-entrant lookups during alias
expression resolution (e.g. `y ALIAS x + x*2` looking up `x`) see the
placeholder ColumnNodes the populator has already inserted.

Subqueries / unions still populate eagerly: their projection lists are
small and the map is needed immediately by the surrounding resolution.

ClickBench Q0 (`SELECT count() FROM hits`): ~1019 -> ~1216 QPS (+19%).
Wider impact wherever a query references few columns of a wide table.
After making `column_name_to_column_node` lazy, perf showed
`initializeTableExpressionData` still spending ~1.4% inserting
`column_names` strings and ~1.0% constructing `Identifier` objects to
populate `column_identifier_first_parts` for every column of the table.

These two sets are membership-check helpers used only when an identifier
is being bound to a table expression. Trivial queries like
`SELECT count() FROM t` never consult them, so move their construction
into `ensureColumnMembershipSetsArePopulated()`, called on demand from
`hasFullIdentifierName`, `canBindIdentifier`, `tryGetSubcolumnInfo`, and
the JOIN USING parent-scope check.

`ensureColumnNodeMapIsPopulated` calls the membership-set populator
first so any code path that needs the `ColumnNode` map continues to see
a fully built table expression data.

ClickBench Q0 (`SELECT count() FROM hits`): ~1216 -> ~1350 QPS (+11%).
Analyzer and planner each call `StorageSnapshot::getColumns` once per query
with the same option set (`All|Virtuals|Subcolumns` from the analyzer,
`AllPhysical|Subcolumns` from the planner). Each call iterates the
columns multi-index and runs `addSubcolumnsToList` (one hash lookup in
`columns` and one `equal_range` on `subcolumns` per row) — O(N + N*M) work
on every query, ~2.7% of CPU on `SELECT count() FROM hits` for a wide
table.

Move the memoization to `ColumnsDescription::get`, where it lives for the
lifetime of the descriptor (i.e. one StorageInMemoryMetadata version).
Two queries on the same table version reuse the same cached
`NamesAndTypesList`s. The cache is keyed by the four scalar fields of
`GetColumnsOptions`, packed into a 4-byte struct, and a small linear
search is fast enough for the handful of distinct option sets a query
mix produces.

Every mutator that touches `columns` or `subcolumns` (`add`, `remove`,
`rename`, `modifyColumnOrder`, `flattenNested`, `addSubcolumns`,
`removeSubcolumns`, copy/move assignment) calls `invalidateGetCache` so
ALTER and similar paths see consistent data.

ClickBench Q0 (`SELECT count() FROM hits`): ~1352 -> ~1397 QPS (+3%).
`CurrentMemoryTracker::allocImpl` and `free` are called on every
allocation. Three small wins on the hot path:

* Read `current_thread` once. The previous flow went through a local
  `getMemoryTracker()` that already loaded `current_thread`, then a
  follow-up `if (!current_thread)` that issued a second TLS load and
  branch.
* Skip the unconditional store to `untracked_memory_blocker_level`. The
  blocker level only changes when a `MemoryTrackerBlockerInThread` is
  pushed/popped, which is rare; the common case wrote the same value
  back every allocation.
* Move the global-tracker fallback into out-of-line `[[gnu::cold]]`
  helpers (`allocWithoutCurrentThread`, `freeWithoutCurrentThread`) so
  the cold path doesn't bloat the hot one.

Tagged the rare branches with `[[unlikely]]`. ClickBench Q0 shows
`allocImpl` self-time drop from ~0.86% to ~0.70% on the perf profile;
roughly 0.15% of total CPU on the new/delete fast path.
`buildQueryPlanForTableExpression` set up the trivial-count plan and
then ran an entire `Planner` over the same query (with `only_analyze`)
just to read back the expected header at `WithMergeableState` and add
a rename `ExpressionStep` if the trivial-count block disagreed. On
ClickBench Q0 (`SELECT count() FROM hits`) that recursive planner was
~4.6% of total CPU, and the rename it produced was always the same:
"alias the smallest-column name to the aggregate's identifier."

Construct the trivial-count block with the aggregate's
`calculateActionNodeName` directly, so its header already matches the
one the outer planner expects. Then skip the recursive `Planner`
whenever trivial-count was applied -- nothing else can change the
header at that point, since the only steps gated on
`till_stage == FetchColumns` (alias columns, where filters) don't
fire after the trivial-count branch.

Plan goes from `Expression(Project) -> MergingAggregated -> Expression(Change column names) -> ReadFromPreparedSource`
to `Expression(Project) -> MergingAggregated -> ReadFromPreparedSource`.

Three reference files updated to reflect the dropped Expression step:
* `03448_trivial_count_single_threaded_merge.reference` -- one less
  level in `EXPLAIN PIPELINE`.
* `04061_trivial_count_aggregate_function_argument_types.reference` --
  the second `Header:` line came from the rename step's output and is
  no longer present.
* `04201_trivial_count_with_additional_filter.reference` -- one less
  indentation level around `(ReadFromPreparedSource)`.

ClickBench Q0: ~1383 -> ~1444 QPS (+4.4%).
Two clang-tidy errors reported by the arm_tidy build:

1. `src/Storages/ColumnsDescription.h`:
   `bugprone-copy-constructor-init`: the explicit copy / move
   constructors did not initialize the `IHints<>` base, leaving it
   default-constructed.

2. `src/Planner/PlannerJoinTree.cpp`:
   `performance-move-const-arg`: `std::move` on
   `trivial_count_column_name` had no effect because
   `ColumnWithTypeAndName`'s constructor takes `const String &`.
`IHints<>` is stateless, but `IHints<>(std::move(other))` plus subsequent
`std::move(other.columns)` triggers `bugprone-use-after-move`. The implicit
base move-construction is sufficient.
…n change

Follow-up to 4c85254 (Skip the recursive Planner after trivial-count
optimization), which dropped one `Expression` / `ExpressionTransform`
step from the trivial-count plan in the analyzer path. Three other
references were updated in that commit; this one was missed.

Only the `enable_analyzer = 1` half changes (lines 9-16): one nested
`Expression` layer above `(ReadFromPreparedSource)` is removed,
reducing the indentation by two spaces. The `enable_analyzer = 0`
half is unchanged -- the old interpreter wasn't touched.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Algunenano Algunenano force-pushed the optimize-sparse-offsets-cache branch from 6e11028 to 1c4661d Compare May 20, 2026 13:07
@nickitat nickitat self-assigned this May 26, 2026
Comment thread src/Analyzer/Resolve/IdentifierResolver.cpp Outdated
Comment thread src/Analyzer/Resolve/TableExpressionData.h
Comment thread src/Storages/VirtualColumnsDescription.cpp
if (state.num_trailing_defaults >= max_rows_to_read)
/// Hoist state into locals so the inner loop avoids two redundant memory writes
/// per iteration in the common case where both fields stay 0/false.
size_t num_trailing_defaults = state.num_trailing_defaults;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH, such optimisations don't make sense to me.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with ^^

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was pure assembly analysis.

From the session with Claude:

Found the full picture. Summary of the assembly analysis from the May 7 session:

Baseline (tmp/clickhouse_baseline)

analyze-assembly.py on readOrGetCachedSparseOffsets (which has deserializeOffsets inlined into it):

- Spill density: 19.1% [high] — flagged finding: "Spills inside loop bodies multiply memory traffic by iteration count."
- The smoking gun in the inner loop body (the non-end_of_granule path):

0x1656b576: xor ecx, ecx
0x1656b578: mov byte ptr [r14 + 0x18], 0x0   ; state.has_value_after_defaults = false
0x1656b57d: mov qword ptr [r14 + 0x10], 0x0  ; state.num_trailing_defaults  = 0
0x1656b585: inc r15
0x1656b588: mov r14, qword ptr [rbp + 0x18]  ; reload &state
0x1656b58c: mov qword ptr [rbp - 0x158], r13 ; SPILL
0x1656b593: mov qword ptr [rbp - 0x150], rcx ; SPILL

r14 is the pointer to state. Those two stores at +0x10 and +0x18 are literally the per-iteration state.num_trailing_defaults = 0; state.has_value_after_defaults = false; lines (184–185 in the baseline source). The reload of r14 from [rbp+0x18] shows the compiler couldn't even keep &state in a register across the loop body.

Root cause

DeserializeStateSparse & state is a reference parameter. The compiler can't prove writes through it don't alias the PODArray::push_back, the ReadBuffer reads in readVarUInt, or ++skipped_values_rows. So every source-level assignment to state.foo materializes as a store, and the address itself gets reloaded.

After hoisting

tmp/clickhouse_optimized:
- Inner loop no longer touches [r14+0x10]/[r14+0x18].
- The single pair of writes happens once at function exit:
0x1656b99e: mov qword ptr [rax + 0x10], rcx     ; state.num_trailing_defaults
0x1656b9a2: setae byte ptr [rax + 0x18]         ; state.has_value_after_defaults

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, but this analysis doesn't tell us

  1. whether these couple of instructions were on the critical path - otherwise, we just saved a bit of throughput and no latency
  2. even if they were, how does it compare to the total amount of cycles each iteration takes

But anyway, I just express my opinion and don't insist on it.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But anyway, I just express my opinion and don't insist on it.

Don't worry, it's good to push back.

whether these couple of instructions were on the critical path - otherwise, we just saved a bit of throughput and no latency

This loop to deserialize sparse columns is often one of the most, if not the most, hot function in small queries like SELECT count() FROM hits WHERE AdvEngineID != 0.

even if they were, how does it compare to the total amount of cycles each iteration takes

Comparing the steady-state non-end-of-granule path (the dominant case):

no_hoist hot-iteration sequence (0x16a789a3..0x16a789c2):
add  rcx, r13                          ; compute push value
mov  qword ptr [rax], rcx              ; offsets.push_back
add  rax, 0x8
mov  qword ptr [rdx + 0x18], rax       ; update offsets.size_
mov  byte  ptr [rbp + 0x18], 0x0       ; state.has_value_after_defaults = false   <-- store
mov  qword ptr [rbp + 0x10], 0x0       ; state.num_trailing_defaults  = 0         <-- store
inc  r13                               ; total_rows++
mov  qword ptr [rsp + 0x38], rax       ; spill rax (offsets.end_)

hoist hot-iteration sequence (0x16a789db..0x16a789f4):
add  rcx, r12                          ; compute push value
mov  qword ptr [rax], rcx              ; offsets.push_back
add  rax, 0x8
mov  qword ptr [rdx + 0x18], rax       ; update offsets.size_
mov  r14, r12
inc  r14                               ; total_rows++
mov  qword ptr [rsp + 0x58], rax       ; spill rax (offsets.end_)

Delta on this iteration of the hot loop:
- −2 store instructions (state.has_value_after_defaults = false, state.num_trailing_defaults = 0 gone)
- Iteration instruction count: 9 → 7

Whole-function counts also match: no_hoist has 7 zero-stores to state.*; hoist has 3 (the writebacks at function exit + the end-of-granule reset).

Cycle interpretation on Zen3: each removed store is ~1 µop dispatched to the store-data and store-address ports; under store-buffer pressure they cost roughly 1 cycle each, but the loop is more likely IPC-bound by the VarInt decode and PODArray-append dependency chain. So the saving is ~2 µops/iter, theoretically ~2 cycles, but in practice closer to <1 cycle of wall — consistent with the ~1% benchmark improvement.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify what I was asking: my question was specifically whether the saved instructions sit on the critical path, not whether the loop is hot. Those are different things. It very well might be that we saved instructions, but we didn't change the number of cycles per loop iteration. Because as long as there are idle ports and buffers, you can add a few instructions here and there, essentially for free. So, if you have a full asm for this loop, you can pipe it into a tool like https://uica.uops.info and see whether the cycles-per-iteration metric actually decreased. It'd be quite interesting to know.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assembly (generated per file, not from LTO build):

Predicted Throughput: (Generated urls are too big)

  • Before:
Tool | Broadwell | Skylake | Coffee Lake
-- | -- | -- | --
uiCA | 31.00 | 42.00 | 42.00
  • After:
Tool | Broadwell | Skylake | Coffee Lake
-- | -- | -- | --
uiCA | 28.00 | 33.00 | 33.00

Comment thread src/Common/CurrentMemoryTracker.h Outdated
Comment thread src/Storages/VirtualColumnsDescription.h
Comment thread src/Storages/ColumnsDescription.h Outdated
Comment thread src/Storages/ColumnsDescription.cpp Outdated
void invalidateGetCache() const;

mutable std::mutex get_cache_mutex;
mutable std::vector<std::pair<GetCacheKey, std::shared_ptr<const NamesAndTypesList>>> get_cache;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::unordered_map?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a deliberate choice. It's tiny and has only a handful of entries at most, so having and iterating a small vector is faster. I'll add a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to test it and both with concurrency 1 and concurrency 16 and the difference is slightly better towards vector but it's too small to trust, as is within noise.

Comment thread src/Storages/ColumnsDescription.h Outdated
Comment thread src/Storages/ColumnsDescription.cpp Outdated
Comment thread src/Storages/ColumnsDescription.h Outdated
if (state.num_trailing_defaults >= max_rows_to_read)
/// Hoist state into locals so the inner loop avoids two redundant memory writes
/// per iteration in the common case where both fields stay 0/false.
size_t num_trailing_defaults = state.num_trailing_defaults;

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with ^^

Comment thread src/Analyzer/Resolve/TableExpressionData.h Outdated
Comment thread src/Analyzer/Resolve/TableExpressionData.h Outdated
`ColumnsDescription.h` had a stale `#include <mutex>` from when
`get_cache_mutex` was `std::mutex`; it's now `DB::SharedMutex` and no
inline body in the header uses `std::lock_guard` / `std::mutex`. The
.cpp uses `std::lock_guard<SharedMutex>` on the exclusive sites, so
include `<mutex>` explicitly there.
@novikd novikd self-assigned this May 27, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 27, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 84.10% 84.10% +0.00%
Functions 91.40% 91.40% +0.00%
Branches 76.50% 76.50% +0.00%

Changed lines: 97.19% (311/320) | lost baseline coverage: 9 line(s) · Uncovered code

Full report · Diff report

@Algunenano Algunenano added this pull request to the merge queue Jun 1, 2026
Merged via the queue into ClickHouse:master with commit 2f448a2 Jun 1, 2026
164 of 166 checks passed
@Algunenano Algunenano deleted the optimize-sparse-offsets-cache branch June 1, 2026 14:36
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 1, 2026
Comment on lines +212 to +216

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks very bad to me. It replaces common usage of the cache ISerialization::getElementFromSubstreamsCache/ISerialization::addElementToSubstreamsCache to some custom logic that can be easily become broken if common logic of cache usage is changed in those methods that are used in all other places.

the heavy Substream construction in path.push_back on cache hits

What is so heavy in construction of this struct? Per granule it should be negligable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants