Parallelize `dictGetKeys` constant path and constant fold inverse dictionary lookup with `dictGetKeys` by nihalzp · Pull Request #91164 · ClickHouse/ClickHouse · GitHub
Skip to content

Parallelize dictGetKeys constant path and constant fold inverse dictionary lookup with dictGetKeys#91164

Merged
alexey-milovidov merged 60 commits into
ClickHouse:masterfrom
nihalzp:optimize-rev-dict-lookup-dictgetkeys
Jun 30, 2026
Merged

Parallelize dictGetKeys constant path and constant fold inverse dictionary lookup with dictGetKeys#91164
alexey-milovidov merged 60 commits into
ClickHouse:masterfrom
nihalzp:optimize-rev-dict-lookup-dictgetkeys

Conversation

@nihalzp

@nihalzp nihalzp commented Nov 30, 2025

Copy link
Copy Markdown
Member

Optimization of the constant path:

See #91164 (comment).

Constant folding:

This query:

SELECT
    table1.color_id AS color_id,
    table1.payload AS payload
FROM default.t AS table1
WHERE dictGet('colors', 'name', color_id) = 'red'

Previously became:

SELECT
    table1.color_id AS color_id,
    table1.payload AS payload
FROM default.t AS table1
WHERE table1.color_id IN (SELECT table1.id AS id
FROM dictionary('colors') AS table1
WHERE table1.name = 'red')

Now:

SELECT
    table1.color_id AS color_id,
    table1.payload AS payload
FROM default.t AS table1
WHERE table1.color_id IN [1, 3]

If the rhs set of IN has size 1, it becomes:

SELECT
    table1.color_id AS color_id,
    table1.payload AS payload
FROM default.t AS table1
WHERE table1.color_id = 1

If rhs set of IN is empty, then:

SELECT
    table1.color_id AS color_id,
    table1.payload AS payload
FROM default.t AS table1
WHERE 0
image

Changelog category (leave one):

  • Performance Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Optimize inverse dictionary lookups. A constant equality predicate such as WHERE dictGet(dict, attr, key_expr) = value is now constant-folded directly into a key filter (key_expr = const, key_expr IN [...], or WHERE 0) instead of being rewritten to an IN (SELECT ... FROM dictionary(...)) subquery, and the constant path of dictGetKeys now executes in parallel.

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Note

Medium Risk
Changes query-tree rewrite logic for dictGet predicates and makes dictGetKeys constant execution multi-threaded, which can affect query planning/semantics and performance under various dictionary layouts and type-cast edge cases.

Overview
Improves inverse dictionary lookup optimization by constant-folding eligible dictGet equality predicates into a precomputed key list using dictGetKeys, rewriting to key = <const>, key IN [..], or WHERE 0 (while preserving original nullable result types and skipping cases where dictGet-family casting would change semantics).

Refactors the fallback rewrite path to gate dictionary()-subquery rewrites on CREATE TEMPORARY TABLE access only when needed, and replaces analyzer-pass based resolving with direct StorageDictionary/table-function resolution.

Speeds up dictGetKeys for constant inputs by switching the constant path from single-threaded pulling to a parallel pipeline with per-stream sinks that filter matching rows and aggregate keys, plus adds a performance test and updates multiple stateless references/RBAC regression coverage to reflect the new constant-folded plans.

Reviewed by Cursor Bugbot for commit 0f7565c. Bugbot is set up for automated code reviews on this repo. Configure here.

Version info

  • Merged into: 26.7.1.293

@clickhouse-gh

clickhouse-gh Bot commented Nov 30, 2025

Copy link
Copy Markdown
Contributor

@nihalzp nihalzp changed the title Optimize inverse dictionary lookup with dictGetKeys and parallelize dictGetKeys constant path Optimize inverse dictionary lookup with dictGetKeys by constant folding and parallelize dictGetKeys constant path Nov 30, 2025
@clickhouse-gh clickhouse-gh Bot added the pr-performance Pull request with some performance improvements label Nov 30, 2025
@nihalzp

nihalzp commented Dec 1, 2025

Copy link
Copy Markdown
Member Author

For big dictionary:

AMD:

image

ARM:

image

@nihalzp nihalzp changed the title Optimize inverse dictionary lookup with dictGetKeys by constant folding and parallelize dictGetKeys constant path Parallelize dictGetKeys constant path and constant fold inverse dictionary lookup with dictGetKeys Dec 1, 2025
@nihalzp nihalzp marked this pull request as ready for review December 1, 2025 13:08
@novikd novikd self-assigned this Dec 1, 2025
Comment thread src/Analyzer/Passes/InverseDictionaryLookupPass.cpp Outdated
Comment thread src/Analyzer/Passes/InverseDictionaryLookupPass.cpp Outdated
Comment thread src/Functions/FunctionDictGetKeys.cpp Outdated
@nihalzp nihalzp requested a review from novikd January 17, 2026 07:24
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged master (the branch was 318 commits behind and red) and verified the two affected translation units (InverseDictionaryLookupPass.cpp, FunctionDictGetKeys.cpp) still compile cleanly.

The only CI failure is BuzzHouse (amd_debug): Logical error: 'Inconsistent AST formatting: the original AST: (STID: 1941-1bfa). This is unrelated to this PR — it is a parser/formatter roundtrip inconsistency on a fuzzer-generated query that contains no dictGet/dictGetKeys (it involves kql_array_sort_asc, tupleElement, nested numeric/emoji identifiers like t0d0.c0.c2.2.3). It is already tracked with the exact same STID: 1941-1bfa in #105396.

@groeneai, could you take a look at #105396 and provide a fix in a separate PR if one isn't already in progress? Latest CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=91164&sha=1102a6dbcb0f2cd3788e5c04bbdc1b1a78b76121&name_0=PR&name_1=BuzzHouse%20%28amd_debug%29

@groeneai

groeneai commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Picked up the BuzzHouse failure (STID 1941-1bfa, issue #105396). The crash on this PR is a sub-variant of the same chronic round-trip family that PR #104991 (CODEC/BACKUP_NAME, merged) and PR #105839 (query parameter substitution, open) cover.

This sub-variant is parser-side: the kql_array_sort subscript hack in ParserExpression was rewriting a sibling arrayElement(kql_array_sort_*, ...) operand to tupleElement in place. When the inner arrayElement was built via call syntax (is_operator = false), the destructive rewrite broke the format/parse/format round-trip check in debug builds and raised LOGICAL_ERROR Inconsistent AST formatting (STID: 1941-1bfa).

Fix is in PR #106691 — gates the rewrite on is_operator. Operator-syntax kql[i][j] still folds to arrayElement(tupleElement(kql, i), j). Call-syntax shapes round-trip cleanly, regression test 04325_stid_1941_1bfa_kql_array_sort_arrayelement_roundtrip covers both shapes plus the BuzzHouse-failing pattern.

Once #106691 is merged the BuzzHouse run for #91164 should be unblocked. I will keep monitoring this PR's CI.

@alexey-milovidov

Copy link
Copy Markdown
Member

Merged master to refresh the base (merge-base was 2026-06-07) and trigger a clean CI run.

The only CI red was the AST fuzzer logical error Column identifier A is already registered (STID 4697-4326), which is unrelated to this PR: the same error hits master (pull_request_number = 0) and dozens of other PRs every day in CIDB. It is already tracked by #106649 and fixes are in progress in #106025 and #106414.

@groeneai, this is the known mutations IN-subquery analyzer regression — no separate action needed here, just flagging that the red on this PR is that pre-existing issue.

alexey-milovidov and others added 2 commits June 16, 2026 18:08
…ctionaryLookupPass`

The merge of `master` into this branch dropped the `#include <Access/ContextAccess.h>`
and `#include <Access/Common/AccessType.h>` includes from
`InverseDictionaryLookupPass.cpp`. As a result the call
`getContext()->getAccess()->isGranted(AccessType::CREATE_TEMPORARY_TABLE)` failed to
compile with:

    error: member access into incomplete type 'element_type'
    (aka 'const DB::ContextAccessWrapper')

because `getAccess` returns a `shared_ptr` to the only-forward-declared
`ContextAccessWrapper` (forward-declared in `Interpreters/Context.h`), whose member
`isGranted` requires the full definition from `Access/ContextAccess.h`.

This broke both the `Fast test` and `Build (arm_tidy)` jobs. Re-add both Access
includes (matching `master`), since `ContextAccessWrapper::isGranted` and
`AccessType::CREATE_TEMPORARY_TABLE` are both used directly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ictgetkeys

# Conflicts:
#	tests/queries/0_stateless/03701_optimize_inverse_dictionary_lookup_basic.reference
#	tests/queries/0_stateless/03702_optimize_inverse_dictionary_lookup_composite_and_layouts.reference
#	tests/queries/0_stateless/03703_optimize_inverse_dictionary_lookup_dictget_family.reference
#	tests/queries/0_stateless/03713_optimize_inverse_dictionary_lookup_setting_rewrite_in_to_join.reference
#	tests/queries/0_stateless/04201_inverse_dictionary_lookup_edge_cases.reference
Comment thread src/Analyzer/Passes/InverseDictionaryLookupPass.cpp
alexey-milovidov and others added 5 commits June 18, 2026 11:48
Addresses review feedback on `InverseDictionaryLookupPass`: the
`isCreateTemporaryTableGranted()` check ran before the new `dictGetKeys`
constant-fold path, so a user with dictionary access but without the
`CREATE TEMPORARY TABLE` grant never reached the fast path, even when the
rewrite is only `key = const`, `key IN [..]`, or `0` and no `dictionary()`
table function is built.

Only the fallback `IN (SELECT ... FROM dictionary(...))` rewrite uses the
`dictionary()` table function, so move the grant check down to that branch.
The constant-fold path needs only the normal dictionary access of
`dictGetKeys`, so it is no longer gated by the grant.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ARY TABLE

Proves, via `EXPLAIN`, that a user with `dictGet` access but without the
`CREATE TEMPORARY TABLE` grant still reaches the `dictGetKeys` constant fold
(`key IN [..]` and `key = const`): the folded plan contains neither `dictGet`
nor a `dictionary()` subquery, the `dictionary()` table function is rejected
with `ACCESS_DENIED`, and the optimized results match the unoptimized ones.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…master

Master PR ClickHouse#94681 made `FunctionNode::toASTImpl` always format operators as
function calls in `EXPLAIN SYNTAX` / `toAST` (e.g. `a IN b` renders as
`in(a, b)`, `a < b` as `less(a, b)`, `a AND b` as `and(a, b)`, `a NOT LIKE b`
as `notLike(a, b)`). After merging `master`, the inverse-dictionary-lookup and
`dictGetKeys` plan dumps render in the new function-call form.

Regenerated the references. The change is purely cosmetic plan formatting:
every query result and data row is byte-identical to the previous references;
only the rendering of operators inside `EXPLAIN` plans changed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Merged master (the branch was ~1180 commits behind and red) and pushed fixes for the red CI and the access-gating finding.

Red Fast test — references after operators-as-functions

The 6 failing dictGetKeys / inverse-dictionary tests (03701, 03702, 03703, 03713, 04201, 04276) failed only because master PR #94681 made FunctionNode::toASTImpl always format operators as function calls in EXPLAIN SYNTAX / toAST (a IN bin(a, b), a < bless(a, b), a AND band(a, b), a NOT LIKE bnotLike(a, b)). Regenerated the references; verified the change is purely cosmetic plan formatting — every query result and data row is byte-identical, only operator rendering inside EXPLAIN plans changed. The full 15-test inverse_dictionary_lookup + dict_get_keys stateless suite passes locally.

Access-gating finding (InverseDictionaryLookupPass.cpp:414) — fixed

Moved isCreateTemporaryTableGranted() below the constant-fold path so it gates only the fallback IN (SELECT ... FROM dictionary(...)) rewrite. A user with dictGet access but without CREATE TEMPORARY TABLE now reaches the dictGetKeys constant fold (key = const / key IN [..] / 0). Added 04401_inverse_dictionary_lookup_rbac_constant_fold, which proves via EXPLAIN that the folded plan contains neither dictGet nor a dictionary() subquery for such a user, that dictionary() is still ACCESS_DENIED, and that results match the unoptimized path.

Remaining AI-review items — left for review

  • Nested parallel executor in the constant path (FunctionDictGetKeys.cpp): the inline thread @novikd opened is resolved — the original ThreadPool was replaced with stream-attached sinks, and dictGetKeys is isDeterministic() = false, so the parallel scan only runs once at analysis time for the pass-driven fold; the inner executor runs with use_concurrency_control, so its CPU slots come from the query's shared pool. This is the PR's core parallelization, so I left it as the resolved thread stands. @novikd, please confirm whether you want any additional capping here.
  • Bounded large-hit fold (InverseDictionaryLookupPass.cpp:437): this remains a deliberate trade-off (analysis-time execution is bounded by max_memory_usage; the cost is wasted work only in the rare large-hit case, in exchange for the common-case win). A bounded dictionary-scan helper would be a larger, separate change; leaving the thread open for a decision.
  • The changelog entry could mention both surfaces (constant dictGetKeys and the inverse dictGet equality fold).

Comment thread src/Functions/FunctionDictGetKeys.cpp
alexey-milovidov and others added 2 commits June 30, 2026 05:56
The constant path of `dictGetKeys` reads the dictionary with several
streams that pull blocks from a shared counter in
`DictionarySourceCoordinator`. Each `DictGetKeysMatchingRowsSink`
accumulated the matching keys of the blocks its stream happened to
consume, and the merge appended all chunks of sink 0, then sink 1, and
so on. With a dictionary spanning more than one block this grouped the
result by worker, so the user-visible array order became
scheduler-dependent and could differ from the single-stream vector path
(`dictGetKeys(d, a, 'x')` vs `dictGetKeys(d, a, materialize('x'))`).

Tag every chunk produced by `DictionarySource` with its sequential
block number (a `DictionaryBlockNumber` chunk info) and merge the
matched chunks in ascending block-number order in `executeConstPath`.
Block `n` always covers the key range
`[n * max_block_size, (n + 1) * max_block_size)`, so this reproduces the
single-stream scan order and keeps the result deterministic and
consistent with the vector path, while preserving the parallel scan.

Add `04413_dict_get_keys_const_path_order_consistency`, which forces
many one-row blocks (`max_block_size = 1`) and `max_threads = 8` so the
constant path is genuinely parallel, then pins its result (order
included) to the vector path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Addressed the AI review Request changes verdict (deterministic dictGetKeys constant-path order) and refreshed the branch.

Fix (d7288222579)dictGetKeys constant-path ordering determinism:

  • DictionarySource now tags every chunk it produces with its sequential block number (a DictionaryBlockNumber chunk info). Block n always covers the key range [n * max_block_size, (n + 1) * max_block_size).
  • executeConstPath gathers the matched chunks from all per-stream sinks and merges them in ascending block-number order, reproducing the single-stream scan order. The result is now deterministic and identical to the vector path (dictGetKeys(d, a, 'x') vs dictGetKeys(d, a, materialize('x'))) regardless of how blocks are spread across reading streams. The parallel scan/filter is preserved.
  • New test 04413_dict_get_keys_const_path_order_consistency forces many one-row blocks (max_block_size = 1) and max_threads = 8 so the constant path is genuinely parallel, then pins its result (order included) to the vector path for values matching several keys.

Also merged current master into the branch (clean, no conflicts) to refresh CI.

The remaining unresolved thread on InverseDictionaryLookupPass.cpp (bounded dictGetKeys fold to avoid the large-hit double-scan) is left as the design follow-up @alexey-milovidov flagged for himself; it is not part of the verdict's required actions.

@groeneai, the only red on the previous run was AST fuzzer (amd_debug) failing with Logical error: Block structure mismatch in A stream: different columns (STID 0993-27f0), which is the known flake tracked in #108142 (matched across unrelated PRs and master, not specific to this change). Could you confirm/track? It should re-run on the freshly triggered CI.

Comment thread src/Functions/FunctionDictGetKeys.cpp Outdated
@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov Confirmed: 0993-27f0 is the chronic trunk flake (tracked in #108142), not caused by this PR.

CIDB over the last 30 days: 10 master hits plus 104 distinct unrelated PRs (131 total events), from 2026-06-06 to 2026-06-30 05:02 UTC, across AST fuzzer (amd_debug / arm_asan_ubsan) and Stress test (amd_msan / arm_asan_ubsan / serverfuzz). On this PR it appeared once: AST fuzzer (amd_debug) on 756da4d. The logical error is in the UnionStep / IntersectOrExceptStep block-structure path, unrelated to the dictGetKeys and inverse-dictionary-lookup changes here.

Master example, same check: https://s3.amazonaws.com/clickhouse-test-reports/json.html?REF=master&sha=cd3c529cd3a8002caa1a4ba585a39da636836784&name_0=MasterCI&name_1=AST%20fuzzer%20%28amd_debug%29&name_1=AST%20fuzzer%20%28amd_debug%29

In-flight fix: #107719. Expect it to clear on the re-run.

… signed-zero

Addresses the two blocking findings of the AI review (`❌ Block` verdict).

1. Non-coordinator dictionary layouts (`DIRECT`, `COMPLEX_KEY_DIRECT`,
   `POLYGON`, `REGEXP_TREE`, ...) read in a single stream and do not tag their
   chunks with a `DictionaryBlockNumber`. The constant path previously asserted
   the tag with `chassert` and then dereferenced it unconditionally, so a
   release build (where `chassert` does not evaluate) would dereference a null
   chunk info for a valid `dictGetKeys(dict, attr, const)` call. The block
   number is now optional: it is only read when present, and the block-number
   merge sort runs only for multi-stream reads (`num_streams > 1`), which come
   exclusively from `DictionarySourceCoordinator` and always attach the tag.
   Single-stream layouts keep the sink's insertion (scan) order.

2. The vector path used the raw `sipHash128AtRow` as a mandatory prefilter
   before the `equals` confirmation. `ColumnVector<T>::updateHashWithValue`
   hashes the stored bytes, so `0.0` and `-0.0` hash differently even though
   SQL `equals(0.0, -0.0)` is true, which made
   `dictGetKeys(d, 'f', CAST(-0.0 AS Float64))` and its `materialize(...)`
   form disagree. Floating-point attributes (optionally `Nullable`) now
   normalize signed zero in the bucket / cache hash on both the query-value
   side and the dictionary side, so SQL-equal values always share a bucket and
   the `equals` confirmation then decides the match.

New regressions:
- `04490_dict_get_keys_float_signed_zero_consistency` pins the constant and
  vector paths together for `Float32` / `Float64` with both zero signs in the
  dictionary.
- `04491_dict_get_keys_const_path_non_coordinator_layout` exercises the
  constant path on `DIRECT` / `COMPLEX_KEY_DIRECT` (must not crash and must
  agree with the vector path).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member

Addressed the AI review ❌ Block verdict — both required actions are now fixed in 68b5feb7320 (built and tested locally).

1. Blocker — non-coordinator dictionary layouts (FunctionDictGetKeys.cpp:114).
The constant path asserted DictionaryBlockNumber with chassert and then dereferenced it unconditionally, so a release build (where chassert does not evaluate) would dereference a null chunk info for a valid dictGetKeys(dict, attr, const) call on layouts that read in a single stream (DIRECT, COMPLEX_KEY_DIRECT, POLYGON, REGEXP_TREE, ...). The block number is now optional — read only when present — and the block-number merge sort runs only for multi-stream reads (num_streams > 1), which come exclusively from DictionarySourceCoordinator and always attach the tag. Single-stream layouts keep the sink's insertion (scan) order.
New regression 04491_dict_get_keys_const_path_non_coordinator_layout exercises DIRECT / COMPLEX_KEY_DIRECT; it does not crash on a RelWithDebInfo build (where chassert is a no-op) and the constant path agrees with the vector path for every value.

2. Major — vector-path floating zero (FunctionDictGetKeys.cpp:657).
ColumnVector<T>::updateHashWithValue hashes the stored bytes, so 0.0 and -0.0 hash differently even though SQL equals(0.0, -0.0) is true. The hash is a mandatory prefilter before equals, so dictGetKeys(d, 'f', CAST(-0.0 AS Float64)) and its materialize(...) form could disagree. Floating-point attributes (optionally Nullable) now normalize signed zero in the bucket / cache hash on both the query-value side and the dictionary side, so SQL-equal values always share a bucket and the equals confirmation decides the match.
New regression 04490_dict_get_keys_float_signed_zero_consistency pins the constant and vector paths together for Float32 / Float64 with both zero signs stored in the dictionary, using explicit literal-vs-materialize(literal) comparisons.

Locally re-ran the full dict_get_keys + inverse_dictionary_lookup stateless suite (13 .sql tests) — all pass; the two new tests pass too.

The only remaining unresolved thread (InverseDictionaryLookupPass.cpp — bounded large-hit fold to avoid the double-scan) is the design follow-up @alexey-milovidov flagged for himself; it is not part of the verdict's required actions.

@groeneai, the sole CI red is Stress test (amd_msan)Hung check failed, possible deadlock found, the chronic master-wide flake tracked in #107941. The hung stack is AST-fuzzer query analysis (executeASTFuzzerQueriesInterpreterSelectQueryAnalyzer) with zero dictGet / dictGetKeys frames, so it is unrelated to this change. Could you confirm/track? It should re-run on the freshly triggered CI.

@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.70% 92.70% +0.00%
Branches 77.60% 77.70% +0.10%

Changed lines: Changed C/C++ lines covered: 318/332 (95.78%) · Uncovered code

Full report · Diff report

@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 30, 2026
Merged via the queue into ClickHouse:master with commit 287154e Jun 30, 2026
174 checks passed
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 30, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-performance Pull request with some performance improvements pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants