Guard against empty projection crash in PlannerJoinTree#101005
Guard against empty projection crash in PlannerJoinTree#101005alexey-milovidov merged 7 commits into
Conversation
|
Workflow [PR], commit [8739dcc] Summary: ✅ AI ReviewSummaryThis PR updates Final Verdict
|
pufit
left a comment
There was a problem hiding this comment.
If there is a reproducing query from fuzzer - it should be used as a regression test in stateless tests.
| @@ -925,7 +925,14 @@ NameAndTypePair ExpressionActions::getSmallestColumn(const NamesAndTypesList & c | |||
| } | |||
|
|
|||
| if (!min_size) | |||
| { | |||
| /// If all columns are subcolumns, fall back to the first one. | |||
7ddd18a to
4904020
Compare
|
@pufit Thank you for the review. I've reworked the fix based on your feedback (force-pushed clean branch from latest master). Changes from v1 → v2:
Session: cron:clickhouse-ci-task-worker:20260328-191500 |
|
cc @pufit — reworked per your feedback. The fix is now at the call site in PlannerJoinTree.cpp (empty projection guard), with the getSmallestColumn() fallback removed. |
4904020 to
d79c424
Compare
v3 — Revised fix per @pufit's reviewProblem in v2: The regression test Root cause (refined): v3 fix: Replace Changes v2 → v3:
Verification:
Session: |
Pre-PR Validation Gate (v3)a) Deterministic repro: Fast test CI deterministically reproduces the crash on every run — test b) Root cause: c) Fix matches root cause: Replace d) Test intent preserved: No existing tests modified. New regression test e) Both directions: Without fix: Fast test deterministically fails with LOGICAL_ERROR (CIDB confirmed). With fix: 50/50 passes locally with randomization + all 9 count_distinct tests pass + original trigger test 04039 passes. Session: |
d79c424 to
3923cdf
Compare
|
v4 (3923cdf9207): Added v3 had a DatabaseReplicated check failure — same pattern as tests 04059 and 04062 where DatabaseReplicated mode's query rewriting breaks the test. This is a tag-only change; the C++ fix and test logic are identical to v3. @pufit — would appreciate a re-review when you get a chance. v3 addressed all your v1 feedback (stateless regression test instead of unit test, root cause fix at call site instead of band-aid fallback). v4 just adds the tag to skip DatabaseReplicated mode. |
|
Please check bot's comments, they look legit. |
Add skip_subcolumns parameter to getSmallestColumn() so callers can disable the isSubcolumn() filter when working with subquery projection columns instead of storage column lists. The root cause: getSmallestColumn() skips entries where isSubcolumn() returns true, which correctly avoids storage meta-subcolumns (.size0, .keys). But when called from PlannerJoinTree for subquery/union table expressions, projection columns like tup.a (from CountDistinctPass rewrite) carry subcolumn metadata and get incorrectly skipped, causing "No available columns" LOGICAL_ERROR crash. The fix passes skip_subcolumns=false at the subquery call site, keeping the "pick cheapest column" optimization while correctly including all projection entries.
3923cdf to
3fb50a3
Compare
v5 — Addresses performance regression concern@pufit — thanks for the pointer to the bot comments. They were spot-on. What changed v4 → v5: The v4 fix used
Bot feedback addressed:
Verification: 50/50 passes, original trigger test 04039 passes, all count_distinct tests pass. Session: |
pufit
left a comment
There was a problem hiding this comment.
LGTM, but need someone else to verify it
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
…mn-crash-STID-3938
Adds `src/Planner/tests/gtest_planner_empty_projection.cpp` covering two cases that the existing SQL test `04066_smallest_column_subquery_projection` does not reach: 1. `ExpressionActionsGetSmallestColumn.AllSubcolumnsWithoutSkipReturnsColumn` exercises the new `skip_subcolumns=false` parameter on `ExpressionActions::getSmallestColumn`. A list of nothing-but-subcolumn entries (modelling `tup.a` produced by `FunctionToSubcolumnsPass`) is returned instead of being silently skipped — the original STID 3938-33a6 reproducer in Fast-test / release-like builds. 2. `PlannerJoinTreeEmptyProjectionGuard.SubqueryWithEmptyProjectionThrowsUnsupportedMethod` is the focused regression that the bot review on this PR asked for. It constructs a fully analyzed query tree for `SELECT 1 FROM (SELECT x FROM t)` and simulates the AST-fuzzer-induced corruption by calling `clearProjectionColumns` on the inner subquery, then plans the tampered tree via the `InterpreterSelectQueryAnalyzer(query_tree, ...)` constructor (no passes re-applied). With the guard in place the planner surfaces `UNSUPPORTED_METHOD`; without it `ExpressionActions::getSmallestColumn` throws `LOGICAL_ERROR: No available columns` and `abortOnFailedAssertion` terminates the process in debug/sanitizer builds — the same stack trace as STID 3938-33a6. Verified locally on a Debug build (`ENABLE_TESTS=ON`): - Both tests pass with the guard. - With the guard temporarily disabled, the planner-level test process aborts with SIGABRT (exit code 134); the captured stack trace matches STID 3938-33a6 (`ExpressionActions.cpp:940` → `PlannerJoinTree.cpp:486` → `Planner.cpp:2118` → ... → `InterpreterSelectQueryAnalyzer::getQueryPlan`). The gtest is discovered automatically via the `gtest_*.cpp` GLOB_RECURSE in `src/CMakeLists.txt`; no CMake changes needed.
|
Added a focused regression test in
Validation in both directions (Debug build,
The companion test The gtest is discovered automatically by the |
The gtest fixture in `gtest_planner_empty_projection.cpp` registered a database called `test` in the process-wide `DatabaseCatalog` via its `State` singleton. `gtest_transform_query_for_external_database.cpp` does the same. Alphabetically the `Planner/` test runs first, so it grabs the name and the `Storages/` fixture's `State::instance` then throws "Database test already exists" before the test body executes, breaking 16 existing tests across the `asan_ubsan`, `tsan`, `msan` and `amd_llvm_coverage` builds. Rename the database in the new gtest to `planner_empty_projection_test_db` so the two fixtures can coexist in the same `unit_tests_dbms` process. The existing `Storages/` fixture is untouched. Verified locally by running the full `unit_tests_dbms` binary with filter `PlannerJoinTreeEmptyProjectionGuard*:ExpressionActionsGetSmallestColumn*:TransformQueryForExternalDatabase*` — all 17 tests pass in the same process.
|
Fixes the gtest isolation regression introduced by the previous commit SymptomAfter Root causeBoth FixRename the database in the new gtest to Verified locally by running the full |
LLVM Coverage Report
Changed lines: 100.00% (64/64) · Uncovered code |
|
Hi @groeneai @alexey-milovidov @pufit — the changelog category for this PR might need a look. Current: Why: Reproduced the bug with a production-valid query: Could you verify this is correct? Ignore if the current category is intentional. |
There was a problem hiding this comment.
This was wrong. There is only one row, so returning 2 is a bug -> #105439

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Guard against empty subquery projection that caused server crash (LOGICAL_ERROR: No available columns) during AST fuzzer stress tests.
Description
Problem: In stress tests with AST fuzzer enabled (
ast_fuzzer_runs=5), the server crashes withLOGICAL_ERROR: No available columnsatExpressionActions::getSmallestColumn()called fromPlannerJoinTree.cpp:471. The crash occurs when the planner tries to pick a column from a subquery that has an empty projection — a state that can be created by rare AST fuzzer mutations.STID: 3938-33a6 — 10+ hits across stress tests (amd_debug, amd_msan, amd_tsan, arm_asan_ubsan).
Root cause analysis: When
columns_namesis empty for a subquery table expression (the outer query doesn't reference any specific column from the subquery), the planner callsgetSmallestColumn()on the subquery's projection columns. NormallyRemoveUnusedProjectionColumnsPassensures at least 1 projection column, but passes that create subqueries after it runs (likeCountDistinctPass) or rare AST fuzzer mutations can produce subqueries with empty projections. TheLOGICAL_ERRORexception fromgetSmallestColumn()callsabort(), crashing the server.Fix: Guard the call site in
PlannerJoinTree.cppagainst empty projection by checking before callinggetSmallestColumn(). If the projection is empty, throw a non-fatalUNSUPPORTED_METHODerror instead of propagating togetSmallestColumn()which would throwLOGICAL_ERRORand abort. This converts the server crash into a normal query error that the AST fuzzer handles gracefully.What changed:
src/Planner/PlannerJoinTree.cpp: Added empty-projection guard beforegetSmallestColumn()calltests/queries/0_stateless/04066_smallest_column_subquery_projection.sql: Regression test exercising thegetSmallestColumn()code path throughCountDistinctPassrewrites on tuple subcolumnsPrevious approach (v1): Added a fallback in
getSmallestColumn()to return the first column when all are subcolumns. Per @pufit's review, this masked the real issue — the bug is at the call site (empty projection), not ingetSmallestColumn(). This v2 fixes the call site instead.Version info
26.5.1.730