Optimize filtered constant columns by niyue · Pull Request #105541 · ClickHouse/ClickHouse · GitHub
Skip to content

Optimize filtered constant columns#105541

Open
niyue wants to merge 39 commits into
ClickHouse:masterfrom
niyue:perf/const-column-filter
Open

Optimize filtered constant columns#105541
niyue wants to merge 39 commits into
ClickHouse:masterfrom
niyue:perf/const-column-filter

Conversation

@niyue

@niyue niyue commented May 21, 2026

Copy link
Copy Markdown
Contributor

Optimizes FilterTransform to replace physical input columns with ColumnConst after filters that prove the column has a single value, such as WHERE x = 'hello'. This reduces work for later pipeline steps and lets downstream expressions observe the filtered column as constant.

The optimization currently handles equality predicates through AND conditions and preserves conservative behavior for OR, NOT, and multi-value predicates.

This PR partially fixes issue: #34326. It covers the ColumnConst optimization for columns constrained by equality filters, while the remaining LowCardinality part is left out because changing a filtered column to LowCardinality may alter the column type/representation contract which seems more risky.

Changelog category (leave one):

  • Performance Improvement

Changelog entry:

  • Optimize filtering by replacing columns constrained to a single value by WHERE equality conditions with ColumnConst.

Performance benchmark

I measured this locally with a RelWithDebInfo build. Each case uses clickhouse benchmark --iterations 10; the table reports the median (50%) latency.

Case Query Before fix median After fix median Improvement
group_by_const_keys_after_filter SELECT count(), sum(v) FROM constant_column_after_filter WHERE k1 = 42 AND k2 = 43 AND k3 = 44 AND k4 = 45 AND k5 = 46 AND k6 = 47 AND k7 = 48 AND k8 = 49 GROUP BY k1, k2, k3, k4, k5, k6, k7, k8 FORMAT Null SETTINGS max_threads = 1, optimize_move_to_prewhere = 0, query_plan_optimize_prewhere = 0, query_plan_optimize_lazy_materialization = 0 0.941 sec 0.792 sec 15.8% faster
expressions_on_const_columns_after_filter SELECT sum(length(toString(k1))), sum(length(toString(k2))), sum(length(toString(k3))), sum(length(toString(k4))), sum(length(toString(k5))), sum(length(toString(k6))), sum(length(toString(k7))), sum(length(toString(k8))) FROM constant_column_after_filter WHERE k1 = 42 AND k2 = 43 AND k3 = 44 AND k4 = 45 AND k5 = 46 AND k6 = 47 AND k7 = 48 AND k8 = 49 FORMAT Null SETTINGS max_threads = 1, optimize_move_to_prewhere = 0, query_plan_optimize_prewhere = 0, query_plan_optimize_lazy_materialization = 0 0.768 sec 0.434 sec 43.5% faster

Note

Medium Risk
Touches core query execution filtering and mutates output column representations, which could subtly affect type/nullable/LC behavior; additional guards reduce risk but correctness/perf impacts span many queries.

Overview
Filter optimization: FilterTransform now analyzes the filter ActionsDAG and, for and-combined equals predicates against constants (e.g. x = 'hello'), records columns that become single-valued and rewrites those output columns to ColumnConst after filtering.

The rewrite is applied on fast paths (all-rows-pass / constant-true filter) and after partial filtering, with safeguards to skip float-related cases and to avoid unsafe conversions (notably float/decimal comparisons), plus handling for nullable/low-cardinality-nullable columns.

Adds a new stateless test (04259_filter_constant_column_after_where) covering aliases, multiple AND equalities, non-handled OR/NOT, nullable/low-cardinality cases, and float/decimal edge conditions.

Reviewed by Cursor Bugbot for commit 9a6d76a. Bugbot is set up for automated code reviews on this repo. Configure here.

@alexey-milovidov alexey-milovidov added the can be tested Allows running workflows for external contributors label May 21, 2026
@clickhouse-gh

clickhouse-gh Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [13cc95c]

Summary:

job_name test_name status info comment
Stateless tests (amd_llvm_coverage, old analyzer, s3 storage, DatabaseReplicated, WasmEdge, parallel, 1/3) FAIL
04259_filter_constant_column_after_where FAIL cidb

AI Review

Summary

This PR teaches FilterTransform to replace post-filter columns with ColumnConst when equality predicates prove a single surviving value, and then patches several downstream consumers to tolerate runtime const columns. The remaining blocker is that the rewrite does not preserve the same settings-sensitive parsing semantics that equals uses for string literals, so some queries can now return the wrong filtered column value.

Findings
❌ Blockers
  • src/Processors/Transforms/FilterTransform.cpp:202 The rewrite converts string RHS literals with default FormatSettings, but comparison semantics already depend on per-query settings such as bool_true_representation, bool_false_representation, and cast_string_to_date_time_mode. A concrete wrong-result case is WHERE b = '0' SETTINGS bool_true_representation='0', bool_false_representation='1': the predicate keeps b = true rows, while the rewrite parses '0' with default settings and replaces the surviving column with false.
    Suggested fix: thread the same comparison/format settings into tryConvertFieldToType, or conservatively disable the rewrite for string RHS constants until the exact comparison parsing path can be reused.
Final Verdict
  • Status: ❌ Block
  • Minimum required actions:
    • Make the constant rewrite use the same settings-aware string parsing as equals, or skip rewrite for string literals.
    • Add a focused regression covering a settings-dependent comparison, at least the Bool case above.

@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label May 21, 2026
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
@niyue niyue force-pushed the perf/const-column-filter branch from f18c917 to ae0a620 Compare May 22, 2026 13:31
niyue added a commit to niyue/ClickHouse that referenced this pull request May 22, 2026
Avoid replacing floating-point columns after `equals`, because `+0.0` and `-0.0` compare equal but have different bit representations. Materialize `Nullable` constant columns before passing them downstream to avoid invalid keys in grouping sets with `group_by_use_nulls`.

PR: ClickHouse#105541

Tests: ninja -C build programs/clickhouse

Tests: tests/clickhouse-test --no-random-settings --no-random-merge-tree-settings 04259_filter_constant_column_after_where

Tests: local `03305_mergine_aggregated_filter_push_down` smoke with `127.0.0.{1,1}` because this macOS host has no `127.0.0.2` loopback alias
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
niyue added 4 commits May 22, 2026 22:34
Cover additional `FilterTransform` cases for filtered `ColumnConst` columns, including convertible constants, non-input expressions, `Nullable`, `LowCardinality`, omitted filtered columns, partially omitted equality columns, and empty results.
Avoid replacing floating-point columns after `equals`, because `+0.0` and `-0.0` compare equal but have different bit representations. Materialize `Nullable` constant columns before passing them downstream to avoid invalid keys in grouping sets with `group_by_use_nulls`.

PR: ClickHouse#105541

Tests: ninja -C build programs/clickhouse

Tests: tests/clickhouse-test --no-random-settings --no-random-merge-tree-settings 04259_filter_constant_column_after_where

Tests: local `03305_mergine_aggregated_filter_push_down` smoke with `127.0.0.{1,1}` because this macOS host has no `127.0.0.2` loopback alias
Skip the constant-column rewrite when an `equals` predicate constrains a `Decimal` column with a floating-point constant. Such comparisons can keep multiple distinct `Decimal` values after conversion through floating-point comparison paths, so replacing the result with one converted constant changes query-visible output.
@niyue niyue force-pushed the perf/const-column-filter branch from ae0a620 to 34f9cd9 Compare May 22, 2026 15:45
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
niyue added 2 commits May 23, 2026 15:31
Make the `equals` safety check recursive over constrained and constant types before replacing filtered columns with `ColumnConst`. This skips rewrites when the constrained type contains `Float`, or when it contains `Decimal` and the constant side contains `Float`.

Add stateless coverage for nested `Tuple(Float64)` and nested `Tuple(Decimal128)` predicates that must preserve distinct surviving values.

Review: ClickHouse#105541 (comment)
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
niyue added 2 commits May 24, 2026 14:14
Skip the constant-column rewrite when the constrained column type contains `Dynamic`, because runtime variants are not visible through static child-type traversal and can carry non-representation-unique values.
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
`equals` between `String` and `FixedString` can keep multiple distinct `String` payloads because of zero-padded `FixedString` comparison semantics. Do not replace the filtered `String` column with a single `ColumnConst` in that case
Comment thread src/Interpreters/Aggregator.cpp Outdated
`MergingAggregated` can receive partially aggregated blocks after expression or filter steps. Numeric aggregation merge methods read key columns through raw data, so a  key can be read past its single nested value and produce wrong grouping keys.

Materialize only key columns before merge and keep aggregate state columns unchanged.
@niyue niyue force-pushed the perf/const-column-filter branch from d6403a1 to 73f7811 Compare May 24, 2026 15:59
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
niyue added 4 commits May 25, 2026 11:57
Fix `04259_filter_constant_column_after_where` by fixing `query_plan_merge_filters` and `query_plan_optimize_lazy_materialization` after the explicit filter-merge coverage queries.

The test checks `dumpColumnStructure`, so randomized filter splitting and lazy materialization can change whether a column is observed as `ColumnConst` or a full column without changing query semantics.
The `equals` function for `Variant` can keep rows with different active variants for the same literal value, such as `Int8(1)` and `UInt8(1)` in `Variant(Int8, UInt8)`. Rewriting the constrained column to one `ColumnConst` would collapse those rows to a single active variant.

Skip the optimization when the constrained result type recursively contains `Variant`, and cover both direct `Variant` and nested `Tuple(Variant)` cases.

Review: ClickHouse#105541 (comment)

Tests: `ninja -C build src/CMakeFiles/dbms.dir/Processors/Transforms/FilterTransform.cpp.o > build/build_filter_transform_variant_guard_object_20260525.log 2>&1`
Comment thread src/Interpreters/Aggregator.cpp
Comment thread src/Processors/Transforms/FilterTransform.cpp
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
@clickhouse-gh

clickhouse-gh Bot commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

🟢 AI verdict: improvement1 query(s) improved out of 38 analysed

This PR adds optimize_constant_columns_after_filter, which replaces a column with a constant after an equality WHERE predicate proves all surviving values are equal, cutting downstream materialization and aggregation work. ClickBench Q28 (-5.58%) is a clean, consistent win and sits squarely on that path: it filters on several equality predicates ahead of a heavy GROUP BY. The remaining flags are off-path and were downgraded: ClickBench Q4/Q15 are plain aggregations with no equality filter for the optimization to act on, and TPC-H Q4 only has date ranges and a semi-join, so its +8.79% is borderline and noisy rather than a real PR effect.

clickbench

🟢 1 improved · ⚠️ 2 inconclusive

Flagged queries (3 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
🟢 28 improvement 7021 6629 -5.6% 0.0003 cpu: Heavy GROUP BY with several equality predicates (e.g. CounterID/IsRefresh) - exactly what const-after-filter speeds up
⚠️ 4 not_sure 265 218 -17.7% <0.0001 aggregation: uniq(UserID), no equality WHERE filter, so the const-after-filter path can't apply; -18% reads as variance
⚠️ 15 not_sure 249 196 -21.3% <0.0001 aggregation: GROUP BY UserID with no WHERE equality predicate; PR can't drive -21% here, treat as run-to-run variance

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

⚠️ 1 inconclusive

Flagged queries (1 of 22)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
⚠️ 4 not_sure 2390 2600 +8.8% 0.0035 aggregation: Only date ranges + semi-join, no equality-constant filter; PR can't slow this path. +8.79% is noisy, borderline

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

Debug info
  • StressHouse run: 39d082aa-f668-4265-bfca-f390641a3968
  • MIRAI run: 6bc287e4-3d00-4751-b27e-6f8a4bc88838
  • PR check IDs:
    • clickbench_67607_1782154641
    • clickbench_67616_1782154641
    • clickbench_67625_1782154642
    • tpch_adapted_1_official_67632_1782154641
    • tpch_adapted_1_official_67645_1782154641
    • tpch_adapted_1_official_67654_1782154641

… and process enum missing value correctly.
Comment thread src/Processors/Transforms/FilterTransform.cpp Outdated
Comment thread src/Core/Settings.cpp
Comment thread src/Processors/Transforms/FilterTransform.cpp
Comment thread src/Processors/Transforms/FilterTransform.cpp
niyue added 3 commits June 12, 2026 13:45
Keep one semantic check for the `AND` case while preserving the
existing type-specific coverage.
Materialize runtime `ColumnConst` keys before `DistinctSortedTransform`
chooses a hash method. Numeric hash methods expect row-addressable key
storage, while filtered constants can be represented by a single nested
value.

Add an in-order `DISTINCT` regression that drives a const key produced
after `WHERE` through this path.
Comment thread src/Processors/Transforms/FilterTransform.cpp
niyue added 2 commits June 12, 2026 21:41
Avoid replacing filtered date/time-family columns with `ColumnConst`
when the equality constant uses a different date/time-family type.
Those comparisons use common-type promotion, so keeping the original
full column avoids depending on one-way constant conversion.

Add a stateless regression covering mixed `Date`, `Date32`, `Time`,
`Time64`, `DateTime`, and `DateTime64` equality filters.
Comment thread src/Processors/Transforms/FilterTransform.cpp
niyue and others added 5 commits June 13, 2026 20:03
Runtime `equals` converts `Enum` values to their names when comparing
against `String`, while `tryConvertFieldToType` would stringify the
underlying enum number. Skip this rewrite so filtered `String` columns are
not replaced with a different `ColumnConst` value.
`FilterTransform` can now leave filtered columns as `ColumnConst`. Convert
them to full columns at the `MergeTree` part writer boundary so projection
part serialization receives physical columns such as `ColumnString`.
Comment thread src/Storages/MergeTree/MergeTreeDataPartWriterOnDisk.cpp
niyue added 2 commits July 4, 2026 14:00
Add a focused regression for post-filter constant columns entering
`MergeTree` part writes, including filtered projection part serialization.
if (!canReplaceColumnWithConstantAfterFilter(result_column.type, constant_node->result_type))
return {};

auto converted = tryConvertFieldToType(*constant_field, *result_column.type, constant_node->result_type.get());

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equals on string literals is settings-sensitive, but this rewrite recomputes the RHS with default FormatSettings instead of the same parser the comparison used. That is a wrong-result bug, not just a missed optimization.

One concrete repro is Bool: WHERE b = '0' SETTINGS bool_true_representation='0', bool_false_representation='1' keeps rows where b = true, because SerializationBool checks the custom true/false spellings before the built-in special values. Here tryConvertFieldToType runs with default settings, so it parses '0' as the built-in false and rewrites the surviving b column to ColumnConst(false).

03803_const_string_comparison_cast_settings.sql already shows that comparison semantics depend on per-query format settings (bool_*_representation, cast_string_to_date_time_mode, input_format_null_as_default). Please either pass the query FormatSettings/comparison settings into this conversion or skip the rewrite for string RHS constants unless you can reuse the exact same parsing path.

@clickhouse-gh

clickhouse-gh Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.60% 85.60% +0.00%
Functions 92.70% 92.70% +0.00%
Branches 77.80% 77.80% +0.00%

Changed lines: Changed C/C++ lines covered: 170/173 (98.27%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-performance Pull request with some performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants