iframe-proxy

Algunenano · 2026-05-10T11:24:32Z

Changelog category (leave one):

Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Reduce per-query overhead for simple SELECT queries (parsing, analysis and planning). For example, SELECT count() FROM hits from a single connection is roughly 50% faster.

Documentation entry for user-facing changes

Documentation is written (mandatory for new features)

Per-commit Q0 (SELECT count() FROM hits, single connection) on a local build:

clickhouse-gh · 2026-05-10T11:25:06Z

Workflow [PR], commit [cedbaba]

Summary: ❌

job_name	test_name	status	info	comment
AST fuzzer (amd_debug, targeted)		FAIL
	Logical error: Column identifier A is already registered (STID: 4697-4326)	FAIL	cidb	IGNORED

AI Review

Summary

This PR reduces analyzer/planner overhead for lightweight queries by making table-expression column structures lazy, memoizing ColumnsDescription::get, simplifying virtual-column collection, and removing an unnecessary recursive planner pass in the trivial-count path; corresponding stateless references were updated. After reviewing the current diff and prior discussion threads, I found no new correctness, safety, or compatibility issues requiring changes.

Final Verdict

Status: ✅ Approve

Address review feedback on PR ClickHouse#104513: the per-Counters `any_trace_in_chain` short-circuit in `Counters::increment` was inherited from the immediate parent's flag at attach time. If `setTraceProfileEvents` is later called on an ancestor, the ancestor walks up via `markChainTracing` but never reaches already-attached descendants, so they keep `any_trace_in_chain == false` and skip the per-event trace-bit check. Make `inheritTracingFromParent` walk the full parent chain at attach time and copy the flag from the first ancestor that has it set, so any tracing state present at the moment of attach is captured. We accept the remaining race -- a child whose attach happens before its ancestor calls `setTraceProfileEvents` will still miss events on that thread until the next attach -- because the only caller is `ProcessList::insert` during query setup, before worker threads attach, so worker threads (where the hot increments happen) always see the correct flag. Closing the remaining window would require children-tracking on every `Counters` (≈24 extra bytes plus a mutex), which we want to avoid. Move the trace-toggle helpers (`setUserCounters`, `setParent`, `setTraceAllProfileEvents`, `setTraceProfileEvent`, `markChainTracing`, `inheritTracingFromParent`) out of the header and into `ProfileEvents.cpp` so editing them does not trigger a wide rebuild of every TU that includes `ProfileEvents.h`.

Per review on PR ClickHouse#104513: `inheritTracingFromParent` only ever flips the per-counter flag to `true`, never back to `false`, so descendants kept a stale `true` after the first traced query and permanently took the slow path. Refreshing the flag at the right moment fixed correctness but the optimization only saved ~3% overall, and the chain-of-flags bookkeeping was a source of correctness bugs (this is the third try). Counter packing (`COUNTER_TRACE_BIT` in bit 63 of the atomic counter) stays — it halves counter array size and lets us read both the post-fetch value and the trace bit out of the same `fetch_add`. `increment` is now a single chain walk that does the `fetch_add`, OR-accumulates `(prev & COUNTER_TRACE_BIT)` across parents, and emits one trace at the end if any ancestor had the bit set. Removed: - `Counters::any_trace_in_chain` field - `markChainTracing()`, `inheritTracingFromParent()`, `refreshTracingFromParent()`, `anyTraceInChain()` - `ProcessList::insert` no longer refreshes the flag after `setTraceProfileEvents` (the workaround that closed the race is no longer needed since there's no flag to be stale) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per review on PR ClickHouse#104513: `std::memset` over an array of `std::atomic<Count>` is technically UB per the standard, even though it works on every platform we ship (where the atomic is lock-free with the same layout as `Count`). Wrap the `memset` in `if constexpr (std::atomic<Count>::is_always_lock_free && same size/alignment)` and provide a relaxed-store loop fallback so the code is well-defined on any future target where that no longer holds. The fast path is unchanged on x86_64 / AArch64 / RISC-V64.

Reverts the changes from PR ClickHouse#104513: 3ed1678 Guard resetCounters memset behind a constexpr lock-free check 285c33c Drop the any_trace_in_chain gate, keep the counter packing 70241ae Refresh tracing flag on the calling thread after setTraceProfileEvents 4f549c2 Walk parent chain when inheriting any_trace_in_chain ea92033 Whitelist ProfileEvents::COUNTER_TRACE_BIT and COUNTER_VALUE_MASK in style check 1b0194c Pack ProfileEvents counter into 8 bytes and gate trace check on a chain bit Packing the trace flag into bit 63 of each counter required every external reader of `Counters::operator[]` to mask the value with `COUNTER_VALUE_MASK`. Reviewers found more than one site that still read the raw packed value (e.g. `ExecutionSpeedLimits::throttle` and `MetricsTransmitter::transmit`), and the failure mode is silent: the trace bit only flips on when `setTraceProfileEvent(s)` is called, so the bug stays dormant until tracing is enabled. The optimization saved only ~3% on lightweight queries and was not worth the long tail of correctness traps it created. Restores `src/Common/ProfileEvents.{h,cpp}`, `src/Interpreters/ProcessList.cpp`, the six external readers that were updated to apply the mask, and the style-check whitelist to their state on master.

Algunenano · 2026-05-15T09:24:27Z

In `deserializeOffsets`, hoist `num_trailing_defaults` and `has_value_after_defaults` into locals so the inner loop stops writing two state fields to memory every iteration in the steady state where they remain 0/false. Drop the now-redundant `!offsets.empty()` guard when computing `start_of_group` (it can only be empty while `first` is still true). In `readOrGetCachedSparseOffsets`, compute the cache key directly from `settings.path` and skip the heavy `Substream` construction in `path.push_back/pop_back` on cache hits, and reuse the same key for `cache->insert_or_assign` on cache misses instead of recomputing it inside `addElementToSubstreamsCache`. Both changes preserve existing behaviour; the cache key matches what `getSubcolumnNameForStream` would produce after the original push. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`StorageSnapshot::getColumns(All|Subcolumns|...)` is called from `QueryAnalyzer::initializeTableExpressionData` for every table in every query. For a wide table such as `hits` it spent ~3% of the per-query TCPHandler CPU on two avoidable things: * Building a `NameSet` over every regular column of the table just to filter out a handful of virtuals. With ~110 columns and ~5 virtuals, a linear `find` per virtual is cheaper than hashing all 110 names. * Calling `VirtualColumnsDescription::getSampleBlock` to obtain names/types of virtuals. That constructs a full `Block`, which calls `IDataType::createColumn()` per virtual just so the result can be destructured into a `NamesAndTypesList` immediately after. Add `VirtualColumnsDescription::getNamesAndTypes(kind, place)` that emits the same names/types directly, fast-path the `metadata->virtuals.empty()` case in `StorageSnapshot::getColumns`, and do a linear de-dup against `all_columns` instead of building a hash set. Measured on clickbench Q0 `SELECT count() FROM hits`: * ~995 -> ~1019 QPS (+2.4%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

For wide tables (e.g. ClickBench `hits`, ~100 columns), populating the `name -> ColumnNode` map for every column on every query during `initializeTableExpressionData` is the dominant analyzer cost when the query touches no column identifiers (`SELECT count() FROM t`). Build only the cheap membership-check sets (`column_names`, `column_identifier_first_parts`) eagerly and defer the `ColumnNode`-allocating map to first use via a `populate_column_node_map` closure. The closure also takes over ALIAS column resolution that the old eager path performed inline, so it captures the storage snapshot and the parent `IdentifierResolveScope` and runs only when an identifier actually requires a `ColumnNode`. `column_name_to_column_node` becomes `std::optional<map>` so `has_value()` doubles as the populated flag. `ensureColumnNodeMapIsPopulated` emplaces the empty map up front so re-entrant lookups during alias expression resolution (e.g. `y ALIAS x + x*2` looking up `x`) see the placeholder ColumnNodes the populator has already inserted. Subqueries / unions still populate eagerly: their projection lists are small and the map is needed immediately by the surrounding resolution. ClickBench Q0 (`SELECT count() FROM hits`): ~1019 -> ~1216 QPS (+19%). Wider impact wherever a query references few columns of a wide table.

After making `column_name_to_column_node` lazy, perf showed `initializeTableExpressionData` still spending ~1.4% inserting `column_names` strings and ~1.0% constructing `Identifier` objects to populate `column_identifier_first_parts` for every column of the table. These two sets are membership-check helpers used only when an identifier is being bound to a table expression. Trivial queries like `SELECT count() FROM t` never consult them, so move their construction into `ensureColumnMembershipSetsArePopulated()`, called on demand from `hasFullIdentifierName`, `canBindIdentifier`, `tryGetSubcolumnInfo`, and the JOIN USING parent-scope check. `ensureColumnNodeMapIsPopulated` calls the membership-set populator first so any code path that needs the `ColumnNode` map continues to see a fully built table expression data. ClickBench Q0 (`SELECT count() FROM hits`): ~1216 -> ~1350 QPS (+11%).

Analyzer and planner each call `StorageSnapshot::getColumns` once per query with the same option set (`All|Virtuals|Subcolumns` from the analyzer, `AllPhysical|Subcolumns` from the planner). Each call iterates the columns multi-index and runs `addSubcolumnsToList` (one hash lookup in `columns` and one `equal_range` on `subcolumns` per row) — O(N + N*M) work on every query, ~2.7% of CPU on `SELECT count() FROM hits` for a wide table. Move the memoization to `ColumnsDescription::get`, where it lives for the lifetime of the descriptor (i.e. one StorageInMemoryMetadata version). Two queries on the same table version reuse the same cached `NamesAndTypesList`s. The cache is keyed by the four scalar fields of `GetColumnsOptions`, packed into a 4-byte struct, and a small linear search is fast enough for the handful of distinct option sets a query mix produces. Every mutator that touches `columns` or `subcolumns` (`add`, `remove`, `rename`, `modifyColumnOrder`, `flattenNested`, `addSubcolumns`, `removeSubcolumns`, copy/move assignment) calls `invalidateGetCache` so ALTER and similar paths see consistent data. ClickBench Q0 (`SELECT count() FROM hits`): ~1352 -> ~1397 QPS (+3%).

`CurrentMemoryTracker::allocImpl` and `free` are called on every allocation. Three small wins on the hot path: * Read `current_thread` once. The previous flow went through a local `getMemoryTracker()` that already loaded `current_thread`, then a follow-up `if (!current_thread)` that issued a second TLS load and branch. * Skip the unconditional store to `untracked_memory_blocker_level`. The blocker level only changes when a `MemoryTrackerBlockerInThread` is pushed/popped, which is rare; the common case wrote the same value back every allocation. * Move the global-tracker fallback into out-of-line `[[gnu::cold]]` helpers (`allocWithoutCurrentThread`, `freeWithoutCurrentThread`) so the cold path doesn't bloat the hot one. Tagged the rare branches with `[[unlikely]]`. ClickBench Q0 shows `allocImpl` self-time drop from ~0.86% to ~0.70% on the perf profile; roughly 0.15% of total CPU on the new/delete fast path.

`buildQueryPlanForTableExpression` set up the trivial-count plan and then ran an entire `Planner` over the same query (with `only_analyze`) just to read back the expected header at `WithMergeableState` and add a rename `ExpressionStep` if the trivial-count block disagreed. On ClickBench Q0 (`SELECT count() FROM hits`) that recursive planner was ~4.6% of total CPU, and the rename it produced was always the same: "alias the smallest-column name to the aggregate's identifier." Construct the trivial-count block with the aggregate's `calculateActionNodeName` directly, so its header already matches the one the outer planner expects. Then skip the recursive `Planner` whenever trivial-count was applied -- nothing else can change the header at that point, since the only steps gated on `till_stage == FetchColumns` (alias columns, where filters) don't fire after the trivial-count branch. Plan goes from `Expression(Project) -> MergingAggregated -> Expression(Change column names) -> ReadFromPreparedSource` to `Expression(Project) -> MergingAggregated -> ReadFromPreparedSource`. Three reference files updated to reflect the dropped Expression step: * `03448_trivial_count_single_threaded_merge.reference` -- one less level in `EXPLAIN PIPELINE`. * `04061_trivial_count_aggregate_function_argument_types.reference` -- the second `Header:` line came from the rename step's output and is no longer present. * `04201_trivial_count_with_additional_filter.reference` -- one less indentation level around `(ReadFromPreparedSource)`. ClickBench Q0: ~1383 -> ~1444 QPS (+4.4%).

Two clang-tidy errors reported by the arm_tidy build: 1. `src/Storages/ColumnsDescription.h`: `bugprone-copy-constructor-init`: the explicit copy / move constructors did not initialize the `IHints<>` base, leaving it default-constructed. 2. `src/Planner/PlannerJoinTree.cpp`: `performance-move-const-arg`: `std::move` on `trivial_count_column_name` had no effect because `ColumnWithTypeAndName`'s constructor takes `const String &`.

`IHints<>` is stateless, but `IHints<>(std::move(other))` plus subsequent `std::move(other.columns)` triggers `bugprone-use-after-move`. The implicit base move-construction is sufficient.

…n change Follow-up to 4c85254 (Skip the recursive Planner after trivial-count optimization), which dropped one `Expression` / `ExpressionTransform` step from the trivial-count plan in the analyzer path. Three other references were updated in that commit; this one was missed. Only the `enable_analyzer = 1` half changes (lines 9-16): one nested `Expression` layer above `(ReadFromPreparedSource)` is removed, reducing the indentation by two spaces. The `enable_analyzer = 0` half is unchanged -- the old interpreter wasn't touched. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

nickitat · 2026-05-26T17:54:07Z

-    if (state.num_trailing_defaults >= max_rows_to_read)
+    /// Hoist state into locals so the inner loop avoids two redundant memory writes
+    /// per iteration in the common case where both fields stay 0/false.
+    size_t num_trailing_defaults = state.num_trailing_defaults;


TBH, such optimisations don't make sense to me.

Agree with ^^

This was pure assembly analysis.

From the session with Claude:

Found the full picture. Summary of the assembly analysis from the May 7 session: Baseline (tmp/clickhouse_baseline) analyze-assembly.py on readOrGetCachedSparseOffsets (which has deserializeOffsets inlined into it): - Spill density: 19.1% [high] — flagged finding: "Spills inside loop bodies multiply memory traffic by iteration count." - The smoking gun in the inner loop body (the non-end_of_granule path): 0x1656b576: xor ecx, ecx 0x1656b578: mov byte ptr [r14 + 0x18], 0x0 ; state.has_value_after_defaults = false 0x1656b57d: mov qword ptr [r14 + 0x10], 0x0 ; state.num_trailing_defaults = 0 0x1656b585: inc r15 0x1656b588: mov r14, qword ptr [rbp + 0x18] ; reload &state 0x1656b58c: mov qword ptr [rbp - 0x158], r13 ; SPILL 0x1656b593: mov qword ptr [rbp - 0x150], rcx ; SPILL r14 is the pointer to state. Those two stores at +0x10 and +0x18 are literally the per-iteration state.num_trailing_defaults = 0; state.has_value_after_defaults = false; lines (184–185 in the baseline source). The reload of r14 from [rbp+0x18] shows the compiler couldn't even keep &state in a register across the loop body. Root cause DeserializeStateSparse & state is a reference parameter. The compiler can't prove writes through it don't alias the PODArray::push_back, the ReadBuffer reads in readVarUInt, or ++skipped_values_rows. So every source-level assignment to state.foo materializes as a store, and the address itself gets reloaded. After hoisting tmp/clickhouse_optimized: - Inner loop no longer touches [r14+0x10]/[r14+0x18]. - The single pair of writes happens once at function exit: 0x1656b99e: mov qword ptr [rax + 0x10], rcx ; state.num_trailing_defaults 0x1656b9a2: setae byte ptr [rax + 0x18] ; state.has_value_after_defaults

ok, but this analysis doesn't tell us

whether these couple of instructions were on the critical path - otherwise, we just saved a bit of throughput and no latency

even if they were, how does it compare to the total amount of cycles each iteration takes

But anyway, I just express my opinion and don't insist on it.

But anyway, I just express my opinion and don't insist on it.

Don't worry, it's good to push back.

whether these couple of instructions were on the critical path - otherwise, we just saved a bit of throughput and no latency

This loop to deserialize sparse columns is often one of the most, if not the most, hot function in small queries like SELECT count() FROM hits WHERE AdvEngineID != 0.

even if they were, how does it compare to the total amount of cycles each iteration takes

Comparing the steady-state non-end-of-granule path (the dominant case): no_hoist hot-iteration sequence (0x16a789a3..0x16a789c2): add rcx, r13 ; compute push value mov qword ptr [rax], rcx ; offsets.push_back add rax, 0x8 mov qword ptr [rdx + 0x18], rax ; update offsets.size_ mov byte ptr [rbp + 0x18], 0x0 ; state.has_value_after_defaults = false <-- store mov qword ptr [rbp + 0x10], 0x0 ; state.num_trailing_defaults = 0 <-- store inc r13 ; total_rows++ mov qword ptr [rsp + 0x38], rax ; spill rax (offsets.end_) hoist hot-iteration sequence (0x16a789db..0x16a789f4): add rcx, r12 ; compute push value mov qword ptr [rax], rcx ; offsets.push_back add rax, 0x8 mov qword ptr [rdx + 0x18], rax ; update offsets.size_ mov r14, r12 inc r14 ; total_rows++ mov qword ptr [rsp + 0x58], rax ; spill rax (offsets.end_) Delta on this iteration of the hot loop: - −2 store instructions (state.has_value_after_defaults = false, state.num_trailing_defaults = 0 gone) - Iteration instruction count: 9 → 7 Whole-function counts also match: no_hoist has 7 zero-stores to state.*; hoist has 3 (the writebacks at function exit + the end-of-granule reset). Cycle interpretation on Zen3: each removed store is ~1 µop dispatched to the store-data and store-address ports; under store-buffer pressure they cost roughly 1 cycle each, but the loop is more likely IPC-bound by the VarInt decode and PODArray-append dependency chain. So the saving is ~2 µops/iter, theoretically ~2 cycles, but in practice closer to <1 cycle of wall — consistent with the ~1% benchmark improvement.

To clarify what I was asking: my question was specifically whether the saved instructions sit on the critical path, not whether the loop is hot. Those are different things. It very well might be that we saved instructions, but we didn't change the number of cycles per loop iteration. Because as long as there are idle ports and buffers, you can add a few instructions here and there, essentially for free. So, if you have a full asm for this loop, you can pipe it into a tool like https://uica.uops.info and see whether the cycles-per-iteration metric actually decreased. It'd be quite interesting to know.

Assembly (generated per file, not from LTO build):

Before: https://pastila.nl/?01d3fbfd/fa38d2fa0aa9e8fd20fdcdc7c9300b4d#OTvoSCza+2GAgZYigrgV8Q==GCM

After: https://pastila.nl/?000d7e52/ef8fe8c022f76e8ee53b769cc9b531c5#vPIIa05X6eGJa/Z3JCzatw==GCM

Predicted Throughput: (Generated urls are too big)

Before:

Tool | Broadwell | Skylake | Coffee Lake -- | -- | -- | -- uiCA | 31.00 | 42.00 | 42.00

After:

Tool | Broadwell | Skylake | Coffee Lake -- | -- | -- | -- uiCA | 28.00 | 33.00 | 33.00

rschu1ze · 2026-05-26T22:33:29Z

+    void invalidateGetCache() const;
+
+    mutable std::mutex get_cache_mutex;
+    mutable std::vector<std::pair<GetCacheKey, std::shared_ptr<const NamesAndTypesList>>> get_cache;


std::unordered_map?

It was a deliberate choice. It's tiny and has only a handful of entries at most, so having and iterating a small vector is faster. I'll add a comment

Decided to test it and both with concurrency 1 and concurrency 16 and the difference is slightly better towards vector but it's too small to trust, as is within noise.

rschu1ze · 2026-05-26T22:43:14Z

-    if (state.num_trailing_defaults >= max_rows_to_read)
+    /// Hoist state into locals so the inner loop avoids two redundant memory writes
+    /// per iteration in the common case where both fields stay 0/false.
+    size_t num_trailing_defaults = state.num_trailing_defaults;


Agree with ^^

This reverts commit 744a3d9.

…fsets-cache

`ColumnsDescription.h` had a stale `#include <mutex>` from when `get_cache_mutex` was `std::mutex`; it's now `DB::SharedMutex` and no inline body in the header uses `std::lock_guard` / `std::mutex`. The .cpp uses `std::lock_guard<SharedMutex>` on the exclusive sites, so include `<mutex>` explicitly there.

clickhouse-gh · 2026-05-27T18:27:18Z

LLVM Coverage Report

Metric	Baseline	Current	Δ
Lines	84.10%	84.10%	+0.00%
Functions	91.40%	91.40%	+0.00%
Branches	76.50%	76.50%	+0.00%

Changed lines: 97.19% (311/320) | lost baseline coverage: 9 line(s) · Uncovered code

Full report · Diff report

Avogar · 2026-06-19T14:20:14Z

+    /// `getSubcolumnNameForStream(path + [SparseOffsets], encode_sparse_stream=true)` equals
+    /// `getSubcolumnNameForStream(path, true) + ".sparse.idx"` (or "sparse.idx" if path is empty),
+    /// so we can compute the cache key without mutating settings.path.
+    /// This lets us skip the heavy `Substream` construction in `path.push_back` on cache hits,
+    /// and reuse the same key for `cache->insert_or_assign` on cache misses.


This looks very bad to me. It replaces common usage of the cache ISerialization::getElementFromSubstreamsCache/ISerialization::addElementToSubstreamsCache to some custom logic that can be easily become broken if common logic of cache usage is changed in those methods that are used in all other places.

the heavy Substream construction in path.push_back on cache hits

What is so heavy in construction of this struct? Per granule it should be negligable.

clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 10, 2026

clickhouse-gh Bot reviewed May 10, 2026

View reviewed changes

Comment thread src/Common/ProfileEvents.h Outdated

clickhouse-gh Bot reviewed May 11, 2026

View reviewed changes

Comment thread src/Common/ProfileEvents.cpp

Algunenano mentioned this pull request May 11, 2026

ClickBench Q1 analysis #81043

Open

12 tasks

Algunenano mentioned this pull request May 12, 2026

Flaky test: 03710_parallel_alter_comment_rename_selects #104692

Closed

clickhouse-gh Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/Common/ProfileEvents.cpp Outdated

clickhouse-gh Bot reviewed May 12, 2026

View reviewed changes

Comment thread src/Common/ProfileEvents.h Outdated

clickhouse-gh Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/Common/ProfileEvents.cpp Outdated

clickhouse-gh Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/Common/ProfileEvents.cpp Outdated

clickhouse-gh Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/Common/ProfileEvents.cpp Outdated

Algunenano mentioned this pull request May 15, 2026

Optimize per-event trace flags: move into separate array and lazy allocation #105030

Merged

alexey-milovidov mentioned this pull request May 17, 2026

Stop the bleeding in function_prop_fuzzer #105146

Merged

1 task

groeneai mentioned this pull request May 18, 2026

Stop clickhouse_fuzzer from self-inflicting timeouts via SIGALRM signals #105176

Merged

1 task

Algunenano and others added 10 commits May 20, 2026 13:03

Drop std::move from base call in move ctor

1200b83

`IHints<>` is stateless, but `IHints<>(std::move(other))` plus subsequent `std::move(other.columns)` triggers `bugprone-use-after-move`. The implicit base move-construction is sufficient.

Algunenano force-pushed the optimize-sparse-offsets-cache branch from 6e11028 to 1c4661d Compare May 20, 2026 13:07

nickitat self-assigned this May 26, 2026

nickitat reviewed May 26, 2026

View reviewed changes

rschu1ze reviewed May 26, 2026

View reviewed changes

Algunenano added 4 commits May 27, 2026 16:02

Revert "Trim CurrentMemoryTracker hot path"

cb077e6

This reverts commit 744a3d9.

Improvements based on review feedback

de0f40e

Merge remote-tracking branch 'blessed/master' into optimize-sparse-of…

e6c9afd

…fsets-cache

novikd self-assigned this May 27, 2026

nickitat approved these changes Jun 1, 2026

View reviewed changes

Algunenano added this pull request to the merge queue Jun 1, 2026

Merged via the queue into ClickHouse:master with commit 2f448a2 Jun 1, 2026
164 of 166 checks passed

Algunenano deleted the optimize-sparse-offsets-cache branch June 1, 2026 14:36

robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 1, 2026

Avogar reviewed Jun 19, 2026

View reviewed changes

Avogar mentioned this pull request Jun 19, 2026

Fix performance regression for Map subcolumns with PREWHERE #107988

Merged

alexey-milovidov mentioned this pull request Jul 1, 2026

Enable snappy compression in HTTP interface #100752

Open

1 task

groeneai mentioned this pull request Jul 1, 2026

Fix logical error in automatic parallel replicas with min_number_of_rows_per_replica #109011

Open

Commit	QPS	Δ
baseline	~958	—
Trim per-iteration overhead in sparse offset deserialization	~958	(Q1-targeted)
Reverted: ~~Pack ProfileEvents counter into 8 bytes and gate trace check on a chain bit~~	~~995~~	~~+3.9%~~
Trim StorageSnapshot::getColumns: drop NameSet and Block round-trip	~1019	+2.4%
Defer ColumnNode map construction in initializeTableExpressionData	~1216	+19%
Defer column_names and column_identifier_first_parts construction	~1352	+11%
Memoize ColumnsDescription::get on the descriptor itself	~1397	+3%
Trim CurrentMemoryTracker hot path	~1383	within noise
Skip the recursive Planner after trivial-count optimization	~1444	+4.4%

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

Algunenano commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

Uh oh!

clickhouse-gh Bot commented May 10, 2026 • edited by Algunenano Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

AI Review

Summary

Final Verdict

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Algunenano commented May 15, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

clickhouse-gh Bot commented May 27, 2026

LLVM Coverage Report

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Algunenano commented May 10, 2026 •

edited

Loading

clickhouse-gh Bot commented May 10, 2026 •

edited by Algunenano

Loading