AST JSON serialization by alexey-milovidov · Pull Request #100412 · ClickHouse/ClickHouse · GitHub
Skip to content

AST JSON serialization#100412

Open
alexey-milovidov wants to merge 277 commits into
masterfrom
ast-json-serialization
Open

AST JSON serialization#100412
alexey-milovidov wants to merge 277 commits into
masterfrom
ast-json-serialization

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Mar 23, 2026

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • New Feature

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Add new functions to serialize and deserialize AST as JSON.


Note

Medium Risk
Introduces a new query parsing path that deserializes ASTs from JSON in both server and clients; although gated by an experimental setting, it touches core parsing/execution and adds substantial new serialization/deserialization logic that could impact correctness and resource limits.

Overview
Adds JSON round-trippable AST support: new parseQueryToJSON() and formatQueryFromJSON() SQL functions convert between SQL and a JSON AST form (with optional best-effort preservation of original comments/whitespace when formatting).

Introduces an experimental dialect = clickhouse_json mode (guarded by allow_experimental_json_ast_dialect) that accepts JSON AST directly; executeQuery, ClientBase, and LocalConnection route parsing through IAST::createFromJSON with new max_ast_depth/max_ast_elements limits and a SET-query escape hatch to switch dialects back.

Implements JSON (de)serialization (writeJSON/readJSON) for many AST node types and adds a central JSON AST factory/deserializer in ASTFromJSON, including depth/element counting and improved validation for malformed JSON ASTs.

Reviewed by Cursor Bugbot for commit 16e5854. Bugbot is set up for automated code reviews on this repo. Configure here.

@clickhouse-gh

clickhouse-gh Bot commented Mar 23, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-feature Pull request with new product feature label Mar 23, 2026
Comment thread src/Parsers/ASTJSONHelpers.cpp
Comment thread src/Parsers/ASTJSONReadHelpers.cpp Outdated
Comment thread src/Parsers/IAST.cpp Outdated
Comment thread src/Functions/parseQueryToJSON.cpp Outdated
Comment thread src/Client/ClientBase.cpp Outdated
Comment thread src/Client/LocalConnection.cpp Outdated
Comment thread src/Functions/formatQueryFromJSON.cpp Outdated
Comment thread src/Functions/parseQueryToJSON.cpp
Comment thread src/Parsers/ASTFromJSON.cpp
Comment thread src/Parsers/ASTAssignment.h Outdated
Comment thread src/Parsers/ASTSQLSecurity.cpp Outdated
Comment thread tests/queries/0_stateless/04054_ast_json_serialization.sql Outdated
Comment thread src/Parsers/SelectUnionMode.cpp Outdated
Comment thread src/Parsers/ASTDictionary.cpp
Comment thread src/Parsers/ASTSetQuery.cpp
Comment thread src/Parsers/ASTSetQuery.cpp
Comment thread src/Parsers/ASTRenameQuery.cpp
Comment thread src/Parsers/ASTTTLElement.cpp
Comment thread src/Parsers/ASTViewTargets.cpp Outdated
Comment thread src/Parsers/ASTSelectWithUnionQuery.cpp
Comment thread src/Parsers/ASTJSONReadHelpers.h
alexey-milovidov and others added 2 commits March 29, 2026 00:18
…dd validation

- Fix `toString(SelectUnionMode::INTERSECT_DISTINCT)` returning `"INTERSECT_DEFAULT"`
  instead of `"INTERSECT_DISTINCT"`, which broke JSON round-trip semantics.
- Validate `parametrised_alias` cast in `readAlias` — throw `BAD_ARGUMENTS` when
  the AST node is present but is not an `ASTQueryParameter`.
- Add null checks for `getObject` dereferences in `ASTTTLElement::readJSON`
  (`group_by_key` and `group_by_assignments` arrays).
- Throw `BAD_ARGUMENTS` instead of silently skipping non-object items in
  `ASTViewTargets::readJSON`.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Parsers/ASTAlterQuery.cpp
…JSON` to prevent use-after-free

`readRawChild` stores raw pointers returned by `children.emplace_back(...).get()`.
Subsequent `emplace_back` calls can reallocate the vector and invalidate all
previously stored raw pointers (`col_decl`, `column`, `order_by`, etc.).
Reserve capacity upfront so no reallocation occurs during the sequence.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Interpreters/executeQuery.cpp Outdated
…executeQuery`

The `checkASTSizeLimits` call was only reached much later (after several
AST traversals like `ReplaceQueryParameterVisitor`, `ApplyWithGlobalVisitor`,
etc.), allowing deeply nested JSON ASTs to trigger expensive processing
before limit enforcement. Add the check right after deserialization to be
consistent with other entry points.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread src/Parsers/ASTFromJSON.cpp Outdated
Comment thread src/Parsers/Access/ASTUserNameWithHost.cpp
…STUserNameWithHost::readJSON`

`ParserUserNameWithHost` produces the user name only as an identifier
(optionally a query parameter) or a string literal, and stores the host
pattern as a string literal. `ASTUserNameWithHost::getStringFromAST`
therefore handles only those shapes and otherwise throws `LOGICAL_ERROR`.

`readJSON` previously accepted any AST node for `username` and
`host_pattern`. A malformed `SQLSecurity` definer could thus deserialize,
say, a `Function` as the user: it would format successfully, but a later
`processSQLSecurityOption` calling `ASTUserNameWithHost::toString` would
reach the internal `LOGICAL_ERROR` branch in `getStringFromAST` instead of
failing as user-input validation.

Validate at the JSON boundary that `username` is an `ASTIdentifier` or a
string `ASTLiteral`, and that `host_pattern` is a string `ASTLiteral`,
rejecting anything else with `BAD_ARGUMENTS`.

Adds `04408_ast_json_user_name_with_host_child_typing`: an identifier user
name round-trips, while a `Function` or non-string-literal `username` and a
non-literal or non-string-literal `host_pattern` are rejected.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, investigate Hung check failed, possible deadlock found

@groeneai

Copy link
Copy Markdown
Contributor

Investigated. The only red on ac582af1 is Stress test (amd_tsan) / Hung check failed, possible deadlock found, and it is the chronic trunk hung-check family, not caused by this PR.

The hung entry is a single cancelled INSERT (is_cancelled: 1, elapsed 1382s, well past max_execution_time: 60) stuck inside an Azure blob upload during MergeTree part finalization. Captured stack:

MergeTreeSink::consume
 -> MergeTreeDataWriter::writeTempPart
 -> MergedBlockOutputStream::finalizePartAsync -> finalizePartOnDisk
 -> WriteBufferWithFinalizeCallback::preFinalize -> ForkWriteBuffer::preFinalize
 -> WriteBufferFromAzureBlobStorage::preFinalize -> execWithRetry
 -> Azure SDK BlockBlobClient::Upload -> ... -> PocoAzureHTTPClient::Send

Mechanism: the AST fuzzer mutated 01593_insert_settings to max_partitions_per_insert_block = 1, so the INSERT produced 21,310 tiny compact parts (InsertedCompactParts: 21310, AzureWriteRequestsCount: 235118). The query was cancelled, but the in-flight Azure Send does not unblock, so it lingers in the processlist past the hung-check window. Symbolization then failed in the harness (lldb -p timed out at 30s, system.stack_trace exited 36), which is why the report shows "suspiciously small stacktraces".

This is distinct from the leaked-local-ThreadPool-at-GlobalThreadPool::shutdown variant that the diagnostic instrument #105905 targets; it is the Azure-object-storage finalization stall sub-pattern of the same umbrella.

Evidence it is chronic and unrelated to this PR (CIDB, 30d):

  • Hung check failed, possible deadlock found fires 5-28 times/day on master, every day.
  • The Azure-upload sub-signature alone: 226 hits / 120 distinct PRs / 96 on master, last seen 2026-06-25 20:49Z.
  • On amd_tsan stress specifically: 113 hits / 60 PRs / 52 master.
  • This PR's diff is 216 files, all AST parsing/serialization (src/Parsers/AST*, parseQueryToJSON/formatQueryFromJSON, SettingsEnums); zero storage/MergeTree/Azure code, and the feature is experimental and gated.

No new actionable signature here, so no separate fix PR. I cannot reproduce the stall locally (no Azure stress environment), so I am not making a speculative change. Tracked under the chronic hung-check umbrella.

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=ac582af1ef90443c867a2afaa3589e067cae13cc&name_0=PR&name_1=Stress%20test%20%28amd_tsan%29

alexey-milovidov and others added 4 commits June 28, 2026 23:00
… block

The `02995_new_settings_history` fast-test check requires every new setting
to be registered in a `SettingsChangesHistory` version block newer than the
released baseline (currently `> 26.6`). While this branch was open `master`
opened the `26.7` block, but the new `allow_experimental_json_ast_dialect`
entry stayed in `26.6`, so the check reported
`PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: allow_experimental_json_ast_dialect WAS ADDED`.

Move the entry to the `26.7` block so the check passes.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=b0b2284704ea80978a410f3bdf3382bf0b01da85&name_0=PR&name_1=Fast%20test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_required_fields`

`formatQueryFromJSON` serializes `ASTSystemQuery::Type` as its numeric enum
position, and the test hard-codes those numbers to build malformed
`SystemQuery` payloads that must be rejected with `BAD_ARGUMENTS`. While this
branch was open `master` inserted new `SYSTEM` query types before
`SCHEDULE_MERGE` and `REFRESH_VIEW`, shifting their values:

- `SCHEDULE_MERGE`: `98` -> `99` (`98` is now `START_CLEANUP`)
- `REFRESH_VIEW`: `102` -> `103` (`102` is now `SET_COVERAGE_TEST`)

The stale numbers no longer pointed at the types whose `readJSON` requires
`table`/`scheduled_merge_parts`, so the payloads deserialized into valid
`SYSTEM START CLEANUP` / `SYSTEM SET COVERAGE TEST` queries and the expected
`BAD_ARGUMENTS` was never thrown. Update the values to the current enum
positions and note in the test that they track the enum order.

CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=b0b2284704ea80978a410f3bdf3382bf0b01da85&name_0=PR&name_1=Fast%20test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread src/Parsers/ASTRenameQuery.cpp Outdated
The `readJSON` paths for several AST nodes read nested non-AST struct
scalar fields directly through `Poco::getValue`, which coerces scalar
types: the string `"yes"` becomes a real `true`, and a non-string value
is stringified into a name. That let malformed `clickhouse_json`
deserialize into a *different* valid AST instead of failing closed with
`BAD_ARGUMENTS` at the boundary.

The reviewer flagged `RenameQuery` elements (`from_database`,
`from_table`, `to_database`, `to_table`, `if_exists`); this also applies
the same strict reads to the sibling non-AST structs that read scalars
the same way:

- `ASTRenameQuery` element fields
- `ASTBackupQuery` element `type`/name fields and `except_tables`
- `ASTSetQuery` change/`query_parameters` names and values
- `ASTDictionarySettings` and `ASTCreateWasmFunctionQuery` change names
- `ASTViewTargets` `kind`/`table_*`/`inner_uuid`
- `ASTSystemQuery` `tables` and `server_type`
- the complex-type `value` dump in `ASTJSONReadHelpers`

Each field is now read through `JSONObjectReader`, which validates the
exact JSON scalar type and rejects a wrong type with `BAD_ARGUMENTS`.
Valid round-trips are unaffected (the writer always emits the correct
scalar type). For the enum discriminators (`type`) that previously relied
on `getValue` throwing on a missing key, an explicit presence check keeps
a missing field a controlled `BAD_ARGUMENTS` instead of a default value.

Adds `04489_ast_json_rename_scalar_coercion` covering the rejected
malformed shapes (including the `"if_exists":"yes"` case from review) and
the valid round-trips that must still pass.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed ba34ee4b08b addressing the AI Review Request changes verdict (the nested non-AST scalar validation gap).

Fix. RenameQuery element scalars (from_database/from_table/to_database/to_table/if_exists) are now read through JSONObjectReader (getString/getBool) instead of Poco::getValue, so a wrong JSON scalar type — e.g. the "if_exists":"yes" shape called out in review — is rejected with BAD_ARGUMENTS at the boundary instead of being coerced into a different valid AST. I applied the same strict reads to the sibling non-AST structs that still called Poco::getValue directly: ASTBackupQuery, ASTSetQuery, ASTDictionarySettings, ASTCreateWasmFunctionQuery, ASTViewTargets, ASTSystemQuery, and the complex-type value dump in ASTJSONReadHelpers. Valid round-trips are unaffected (the writer always emits the correct scalar type). The single unresolved review thread is resolved.

Test. 04489_ast_json_rename_scalar_coercion — the rejected malformed shapes (incl. "if_exists":"yes", a number into a name, a bool into a name) and the valid round-trips that must still pass.

CI. The 3 reds are all pre-existing fuzzer flakes in code paths this PR does not touch (the feature is gated behind allow_experimental_json_ast_dialect, not used by any failing query):

Verified the 8 changed TUs with -fsyntax-only (clean); the new test's exact JSON replace targets are confirmed against the serialization in writeJSON and the existing 04054/04339 references.

Comment thread src/Parsers/ASTAlterQuery.cpp
Comment thread src/Parsers/ASTAlterQuery.cpp Outdated
Comment thread src/Parsers/ASTTTLElement.cpp
Clears the AI Review "Request changes" verdict (the AST JSON round-trip
and boundary-validation gaps).

- `ASTAlterCommand::formatImpl`: emit `FROM '<path>'` for the
  `ATTACH PART '...' FROM '...'` form. The path is stored in `from`
  (the parser only sets it for the PART form) and consumed by
  `PartitionCommand::parse` as `from_path`, but `formatImpl` dropped it,
  so the `parseQueryToJSON` -> `formatQueryFromJSON` round trip hid the
  source path.
- `ASTAlterCommand::readJSON`: allow a null `statistics_decl` for the
  parser-produced `MATERIALIZE STATISTICS ALL` and `CLEAR STATISTICS ALL`
  forms (the reader previously rejected its own writer's JSON for them),
  while still requiring the declaration for plain `DROP STATISTICS` and
  the non-`ALL` forms, and keeping the no-`TYPE`-list check.
- `ASTAlterCommand::readJSON`: validate the per-command scalar fields the
  parser always sets: `FETCH PARTITION` requires `from`,
  `[ATTACH|REPLACE] PARTITION ... FROM` requires `from_table`, and
  `UNFREEZE PARTITION` / `UNFREEZE` require `with_name`. Malformed
  `clickhouse_json` now fails with `BAD_ARGUMENTS` at the boundary
  instead of formatting a parser-impossible `FROM `/`WITH NAME`.
- `ASTTTLElement::readJSON`: for `TTL MOVE` require a `DISK`/`VOLUME`
  destination with a non-empty name (mirroring the sibling
  `MOVE_PARTITION` check) and reject stray destination state on
  non-`MOVE` modes; otherwise a non-`DISK`/`VOLUME` destination reaches
  the `LOGICAL_ERROR` branch in `formatImpl` and a missing name formats
  an empty target.
- `ASTSelectQuery::readJSON`: restore `cte_aliases` with the same typed
  read as the column `aliases` (an `ASTExpressionList` whose children are
  all `ASTIdentifier`), so malformed JSON cannot reach a later analyzer
  cast.
- `LocalConnection`: capture `max_query_size` / `max_ast_depth` /
  `max_ast_elements` together with the dialect/gate, before the query's
  own `SETTINGS` are applied, and use the captured values in the
  `input()` initializer's JSON reparse. A JSON
  `INSERT ... FROM input(...) SETTINGS max_ast_depth = 1` can no longer
  make the second parse fail under limits the first parse never saw.
- `parseQueryToJSON` / `formatQueryFromJSON`: bump the
  `FunctionDocumentation` version to `{26, 7}` to match the `26.7`
  settings-history bucket of `allow_experimental_json_ast_dialect`.

The non-array TTL `group_by_*` finding is already enforced by the strict
`JSONObjectReader::getArray` (it throws `BAD_ARGUMENTS` on a present
non-array value); a guard case is added to the test below.

Test: `04490_ast_json_alter_ttl_select_validation` covers the new valid
round-trips and the malformed shapes that must fail closed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed bc7c3d73b9b addressing the AI Review Request changes verdict (the remaining AST JSON round-trip / boundary-validation gaps). All eight inline threads are addressed and resolved.

Majors

  • ATTACH PART '...' FROM '...'ASTAlterCommand::formatImpl now emits the FROM '<path>' clause for the PART form (part && !from.empty()), so the source path that PartitionCommand::parse consumes as from_path survives the parseQueryToJSONformatQueryFromJSON round trip instead of being silently dropped.
  • MATERIALIZE STATISTICS ALL / CLEAR STATISTICS ALLreadJSON now allows a null statistics_decl for these parser-produced ALL forms (the reader was rejecting its own writer's JSON), while still requiring the declaration for plain DROP STATISTICS / non-ALL forms and keeping the no-TYPE-list check.
  • ALTER scalar fields — FETCH PARTITION now requires from, [ATTACH|REPLACE] PARTITION ... FROM requires from_table, and UNFREEZE PARTITION / UNFREEZE require with_name; malformed JSON fails with BAD_ARGUMENTS at the boundary instead of formatting a parser-impossible FROM /WITH NAME. (The per-Type child requirements were already enforced by the existing switch (type) block.)
  • cte_aliases — restored with the same typed read as the column aliases (an ASTExpressionList whose children are all ASTIdentifier), via a shared setIdentifierList helper.
  • TTL MOVEreadJSON now requires a DISK/VOLUME destination with a non-empty name (mirroring the sibling MOVE_PARTITION check) and rejects stray destination state on non-MOVE modes, so malformed JSON can no longer reach the LOGICAL_ERROR branch in formatImpl or format an empty target.
  • LocalConnection input() reparse — max_query_size / max_ast_depth / max_ast_elements are now captured into LocalQueryState together with the dialect/gate (before query-local SETTINGS apply) and the JSON reparse uses the captured values, so an INSERT ... FROM input(...) SETTINGS max_ast_depth = 1 cannot make the second parse fail under limits the first parse never saw.

Nit

  • parseQueryToJSON / formatQueryFromJSON FunctionDocumentation versions bumped to {26, 7} to match the 26.7 settings-history bucket of allow_experimental_json_ast_dialect.

The non-array TTL group_by_* finding is already enforced by the strict JSONObjectReader::getArray (it throws BAD_ARGUMENTS on a present non-array value); a guard case is included in the new test.

Test. 04490_ast_json_alter_ttl_select_validation adds the valid round-trips (ATTACH PART ... FROM, MATERIALIZE/CLEAR STATISTICS ALL, FETCH/REPLACE/UNFREEZE, TTL MOVE, CTE aliases) and the malformed shapes that must fail closed (invalid/empty TTL MOVE destination, non-array group_by_key, missing FETCH from / REPLACE from_table / UNFREEZE with_name, malformed cte_aliases).

Verification. The seven changed translation units that exist on master were checked with -fsyntax-only (clean). The positive round-trip references were generated with formatQuerySingleLine on a local build (verified byte-identical to the existing formatQueryFromJSON(parseQueryToJSON(...)) references), and every negative replace() target was confirmed against the writeJSON serialization. A full build of this branch was not run locally (the shared build dir was occupied building another branch).

CI. The only red on the previous commit was 04204_global_in_join_max_rows_to_transfer_103333Unexpected packet from server (expected TablesStatusResponse, got ProfileInfo) (UNEXPECTED_PACKET_FROM_SERVER). The randomized-settings diagnosis re-ran it 145× with 0 failures, and the test exercises distributed GLOBAL IN (unrelated to the AST JSON paths, which are gated behind allow_experimental_json_ast_dialect). It is the distributed TablesStatus desync recurrence of #93018, with a fix already in flight in #108854.

Comment thread src/Parsers/ASTCreateQuery.cpp
Comment thread src/Parsers/ASTSystemQuery.cpp Outdated
Comment thread src/Parsers/ASTCreateQuery.cpp
Comment thread src/Parsers/ASTIdentifier.cpp
alexey-milovidov and others added 3 commits June 30, 2026 21:01
…rget

Master independently added 04489_max_threads_auto_parsing_compat,
04490_dict_get_keys_*/04490_table_readonly_*, and 04491_* tests, colliding
with this PR's 04489_ast_json_rename_scalar_coercion and
04490_ast_json_alter_ttl_select_validation. Renumber the AST JSON tests to
04492/04493.

Also fix the FETCH PARTITION negative case in the alter/ttl/select test: the
serializer escapes the forward slash (`escape_forward_slashes` is on by
default), so the FROM path is emitted as `"from":"\\/zk"`. The old
`replace` target `,"from":"/zk"` never matched, so the field was not
stripped and the query round-tripped instead of failing with `BAD_ARGUMENTS`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… SYSTEM server_type element typing, empty-name identifier

Address the AST JSON review findings that let `clickhouse_json` build
parser-impossible ASTs whose formatting hides state the executor still uses:

* `ASTCreateQuery::readJSON` now rejects the attach-only fields
  (`attach_short_syntax`, `attach_from_path` / `has_attach_from_path`,
  `attach_as_replicated`) when `attach` is false — the parser gates them behind
  `attach`, and `formatImpl` asserts `attach || !has_attach_from_path`. It also
  requires `has_attach_from_path` to match whether `attach_from_path` is
  non-empty, and requires `has_uuid` to match whether a non-`Nil` `uuid` is
  present (every parser path derives `has_uuid` from `uuid != Nil`; a mismatched
  payload would enable `{uuid}` macro expansion while the formatted SQL shows no
  `UUID` clause).

* `ASTSystemQuery::readJSON` validates each `server_type.exclude_types`
  element as a non-boolean integer and each `server_type.exclude_custom_names`
  element as a string, instead of letting Poco coerce a string `"1"` or a
  number `123` into a real `SYSTEM START/STOP LISTEN` filter.

* `ASTIdentifier::readJSON` / `ASTTableIdentifier::readJSON` now require the
  single child of an empty-name identifier (`{x:Identifier}`) to be an
  `ASTQueryParameter`, mirroring the compound-placeholder branch;
  query-parameter visitors downcast it unconditionally.

New test `04494_ast_json_create_system_identifier_validation` covers the
positive round-trips and the malformed `clickhouse_json` shapes that must fail
with `BAD_ARGUMENTS`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Continued work on this PR (pushed bc7c3d73b9b..7d91d061847):

Merge + test renumbering

  • Merged master into the branch (clean, 0 conflicts; the branch was 659 commits behind).
  • master independently added 04489_max_threads_auto_parsing_compat, 04490_dict_get_keys_* / 04490_table_readonly_*, and 04491_*, colliding with this PR's AST JSON tests. Renumbered them to 04492_ast_json_rename_scalar_coercion and 04493_ast_json_alter_ttl_select_validation.

Fixed the Fast-test failure (04490/now 04493)

  • The FETCH PARTITION negative case used replace(..., ',"from":"/zk"', ''), but the serializer escapes the forward slash (escape_forward_slashes is on by default), so the path is emitted as "from":"\/zk". The old target never matched, the field was not stripped, and the query round-tripped instead of failing with BAD_ARGUMENTS. The replace target now uses the escaped form (verified the SQL escaping on a binary: replace('…"from":"\/zk"…', ',"from":"\\/zk"', '') strips the field).

Addressed the AI "Request changes" Majors (all 4 inline threads resolved), in 7d91d061847:

  • ASTCreateQuery::readJSON rejects the attach-only fields (attach_short_syntax, attach_from_path / has_attach_from_path, attach_as_replicated) when attach is false, requires has_attach_from_path to match a non-empty attach_from_path, and rejects has_uuid != (uuid != Nil).
  • ASTSystemQuery::readJSON strictly types server_type.exclude_types (non-boolean integer) and exclude_custom_names (string) via Poco::Dynamic::Var checks instead of coercive getElement.
  • ASTIdentifier::readJSON / ASTTableIdentifier::readJSON require the single child of an empty-name identifier ({x:Identifier}) to be an ASTQueryParameter.
  • New test 04494_ast_json_create_system_identifier_validation covers the positive round-trips and the malformed clickhouse_json shapes that must fail with BAD_ARGUMENTS.

ASTFunction::kind masking finding (src/Parsers/ASTFunction.cpp:204)

  • This was previously dismissed by the author and the thread is resolved. The AI re-raised it as a real risk (a clickhouse_json payload setting "kind":"CODEC"/"LAMBDA_FUNCTION" on a secret-bearing ordinary function could skip the ORDINARY_FUNCTION secret-masking path in FunctionSecretArgumentsFinderAST). Leaving the design call to the owner, @alexey-milovidov: either validate special kind values against their parser-produced contexts in readJSON, or move secret masking so it does not trust kind for user-provided JSON, plus a focused negative test. I did not change it to respect the existing dismissal.

Verification note: a full feature build is infeasible in this worktree (the build dir is configured for another branch and reconfiguring + rebuilding the 659-commit delta exceeds the foreground limit), so I verified statically and against a binary: all three changed TUs pass -fsyntax-only; the C++ uses the same Poco::Dynamic::Var API as ASTBackupQuery/ASTJSONReadHelpers; the positive references were generated with formatQuerySingleLine (byte-identical to formatQueryFromJSON(parseQueryToJSON(...)) for unchanged formatImpl); and every negative replace target was checked against the exact writeJSON output (a mismatched target surfaces as a reference mismatch in CI, not a silent pass).

Comment thread src/Parsers/ASTIdentifier.cpp
alexey-milovidov and others added 2 commits July 1, 2026 00:18
The `Fast test` `04494_ast_json_create_system_identifier_validation` failed
because its reference expected `FROM '/p'` and `UUID '...'`, but the `.sql`
test's default `TabSeparated` output escapes a single quote in a string value
as `\'`. `formatQueryFromJSON` correctly returns plain `'/p'` / `'...'`; the
client renders them as `\'/p\'` / `\'...\'`, matching every other AST JSON
round-trip reference (e.g. `04307`'s `\'r\'`). Only the reference was missing
the escaping; no code change.

CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=7d91d06184716e1e9842d871eb25908b6a7421b7&name_0=PR&name_1=Fast%20test

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the AI review blocker on hidden UUID state. `ASTTableIdentifier` and
`ASTViewTargets` persist a `UUID` on nested table references (a
`REFRESH DEPENDS ON src UUID '...'` dependency, a materialized-view `TO`
target's `table_id`), and execution honours it (`ASTTableIdentifier::getTableId`
and the target `StorageID`), but the formatters never emit it:
`ASTIdentifier::formatImplWithoutAlias` prints only the name, and
`ASTViewTargets::formatTarget` prints only `db.table` for a `table_id` target.
`formatQueryFromJSON` would therefore print a name-only reference while the JSON
AST still resolves the table by `UUID`, so the shown SQL can refer to a
different table than the AST executes against.

Fail closed at the deserialization boundary: `ASTTableIdentifier::readJSON`
rejects a `uuid` and `ASTViewTargets::readJSON` rejects a `table_uuid`, both with
`BAD_ARGUMENTS`. `inner_uuid` (emitted faithfully as `INNER UUID '...'`) is
unaffected, and the `CREATE` / dictionary-name `UUID` (lifted onto the query
object and formatted as a `UUID` clause) round-trips as before.

Test: `04495_ast_json_nested_uuid_rejection`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed 7d91d061847..94d6f534230:

Fixed the Fast test failure (04494_ast_json_create_system_identifier_validation, the only red check). The reference expected FROM '/p' and UUID '...', but the .sql test's default TabSeparated output escapes a single quote in a string value as \', so the stdout is FROM \'/p\' / UUID \'…\'. formatQueryFromJSON returns the correct plain-quoted SQL; only the reference was missing the client-side escaping (every other AST JSON round-trip reference already has it, e.g. 04307's \'r\'). Reference-only change, no code.

Addressed the AI Review blocker on hidden UUID table-reference state (ASTIdentifier.cpp:326 / ASTViewTargets.cpp:449; the last unresolved thread, now resolved). Took the review's reject option. ASTTableIdentifier::readJSON now rejects a uuid, and ASTViewTargets::readJSON rejects a table_uuid, both with BAD_ARGUMENTS, because the nested-reference formatters (ASTIdentifier::formatImplWithoutAlias, ASTViewTargets::formatTarget for a table_id target) never emit a UUID clause — so a payload carrying one would format a name-only reference while the JSON AST still resolves the table by UUID. inner_uuid (emitted faithfully as INNER UUID '...') is untouched; the CREATE / dictionary-name UUID (lifted onto the query object via getTable(), which returns a plain ASTIdentifier, and formatted as a UUID clause) round-trips as before — verified no existing test round-trips a nested-ref uuid or table_uuid. New test 04495_ast_json_nested_uuid_rejection (positive round-trips + the two injected-UUID rejections). Both TUs pass -fsyntax-only.

Left for your decision — the ASTFunction::kind secret-masking blocker (ASTFunction.cpp:204). This is the security-contract call you already reserved to yourself in the previous comment, and I did not want to pre-empt it. readJSON accepts any special kind (CODEC/STATISTICS/…) on any function name, and FunctionSecretArgumentsFinderAST skips ordinary-function secret detection for those kinds, so a clickhouse_json payload could mark a secret-bearing ordinary function as CODEC and get it logged unmasked. Of the two options, the JSON-boundary one (validate/reject a special kind against the function's parser-produced context in readJSON) is more contained than changing the global secret-masking path, but it needs the per-kind legitimacy rules that only you should define. The AI verdict will stay ❌ Block until this is resolved.

I did not re-merge master (branch is 44 commits behind, merge-base 2026-06-30 19:17Z < 1 day, and the red was a PR-internal reference bug, not master drift).

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@groeneai, fix the Block structure mismatch error.

/// Capture the parser-affecting settings now, before the query's own `SETTINGS` clause is applied
/// during execution. The `input()` initializer below reparses `state->query`, and must use the
/// dialect/gate the query was originally accepted with rather than the (possibly mutated) live ones.
state->parsed_as_json_dialect = query_context->getSettingsRef()[Setting::dialect] == Dialect::clickhouse_json;

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only captures the JSON-specific subset of parser state. The input() initializer still chooses the non-JSON parser and limits from the live context->getSettingsRef() after InterpreterSetQuery::applySettingsFromQuery has already applied the query-local SETTINGS.

Concrete repro: INSERT INTO FUNCTION null('x UInt8') SELECT * FROM input('x UInt8') SETTINGS max_query_size = 1 FORMAT TSV is accepted by the initial parse under the session defaults, but the callback later reparses state->query with max_query_size = 1 and fails before reading any input. SETTINGS dialect = 'kusto' has the same shape and switches the callback to ParserKQLStatement for a query that was accepted as SQL.

Please either carry the already parsed ASTInsertQuery / resolved input format into the initializer, or capture the full parse-time parser configuration for the second parse instead of only the JSON-specific subset.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified this against master: the non-JSON else branch flagged here is byte-for-byte identical to master — same parser construction from the live context->getSettingsRef() and the same settings[max_query_size] / settings[max_parser_depth] / settings[max_parser_backtracks] (and settings[dialect]) passed into parseQueryAndMovePosition / parseKQLQueryAndMovePosition inside the input() initializer callback. So whatever the repro describes for that path (SETTINGS max_query_size = 1, SETTINGS dialect = 'kusto'), it is not introduced or changed by this PR — it applies equally to master today, since the callback has always reparsed with the post-SETTINGS live settings.

What this PR adds is the clickhouse_json branch, and that path is protected: the JSON-affecting settings (dialect gate, max_query_size, max_ast_depth, max_ast_elements) are captured into LocalQueryState at sendQuery time — before the query-local SETTINGS are applied — and used in the callback.

Making the pre-existing non-JSON path robust the same way is a separate, pre-existing-behavior fix (not required for the JSON feature), and the two options have different blast radius:

  • Capturing the full parse-time parser configuration for the else branch would also change the SET dialect = 'clickhouse' mid-input escape-hatch semantics that the JSON branch comment relies on — needs a deliberate decision.
  • Carrying the already-parsed ASTInsertQuery / resolved input format into the initializer avoids the reparse entirely but changes the IServerConnection::sendQuery contract.

Both touch pre-existing clickhouse-local behavior beyond the scope of this PR, so leaving the call to @alexey-milovidov — happy to split either into a separate PR.

@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov This is the chronic trunk UnionStep constant-fold bug, not caused by this PR. A separate fix PR is already open: #107719 (closes #106956).

Latest hit on this PR (STID 0993-27f0, AST fuzzer (amd_debug, targeted), 32889d60b24):

Block structure mismatch in UnionStep stream: different columns:
__table1.j String Const(size = 0, String(size = 1))
__table1.j String String(size = 0)
SELECT DISTINCT JSONExtractUIntCaseInsensitive('(\0', j, materialize(toNullable(NULL)))
FROM (SELECT DISTINCT parseQueryToJSON('SELECT DISTINCT a FROM t') AS j GROUP BY toUInt128(1) WITH ROLLUP LIMIT 28
      UNION ALL
      SELECT DISTINCT parseQueryToJSON('SELECT DISTINCT a FROM t') AS j GROUP BY 1 WITH ROLLUP LIMIT 28)
WHERE NOT (toFixedString(NULL, 1048576))

Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=32889d60b24cb0c73c1a6547c99afc3e360efd18&name_0=PR&name_1=AST%20fuzzer%20%28amd_debug%2C%20targeted%29

Root cause: set-operation branches are optimized independently, so filter push-down can constant-fold a column in one branch but not its sibling. The plan-time checkHeaders in UnionStep::updateOutputHeader (UnionStep.cpp:28) uses strict assertBlocksHaveEqualStructure and rejects the Const meeting a full column of the same type, even though the execution-time makeConvertingActions path already tolerates it. #107719 makes the plan-time check const-tolerant at that exact site. Trunk-wide: 208 hits across 154 distinct PRs and 17 on master in the last 30 days.

@clickhouse-gh

clickhouse-gh Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered: 4240/5183 (81.81%) · Uncovered code

Full report · Diff report

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-feature Pull request with new product feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants