AST JSON serialization#100412
Conversation
…dd validation - Fix `toString(SelectUnionMode::INTERSECT_DISTINCT)` returning `"INTERSECT_DEFAULT"` instead of `"INTERSECT_DISTINCT"`, which broke JSON round-trip semantics. - Validate `parametrised_alias` cast in `readAlias` — throw `BAD_ARGUMENTS` when the AST node is present but is not an `ASTQueryParameter`. - Add null checks for `getObject` dereferences in `ASTTTLElement::readJSON` (`group_by_key` and `group_by_assignments` arrays). - Throw `BAD_ARGUMENTS` instead of silently skipping non-object items in `ASTViewTargets::readJSON`. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…JSON` to prevent use-after-free `readRawChild` stores raw pointers returned by `children.emplace_back(...).get()`. Subsequent `emplace_back` calls can reallocate the vector and invalidate all previously stored raw pointers (`col_decl`, `column`, `order_by`, etc.). Reserve capacity upfront so no reallocation occurs during the sequence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…executeQuery` The `checkASTSizeLimits` call was only reached much later (after several AST traversals like `ReplaceQueryParameterVisitor`, `ApplyWithGlobalVisitor`, etc.), allowing deeply nested JSON ASTs to trigger expensive processing before limit enforcement. Add the check right after deserialization to be consistent with other entry points. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…STUserNameWithHost::readJSON` `ParserUserNameWithHost` produces the user name only as an identifier (optionally a query parameter) or a string literal, and stores the host pattern as a string literal. `ASTUserNameWithHost::getStringFromAST` therefore handles only those shapes and otherwise throws `LOGICAL_ERROR`. `readJSON` previously accepted any AST node for `username` and `host_pattern`. A malformed `SQLSecurity` definer could thus deserialize, say, a `Function` as the user: it would format successfully, but a later `processSQLSecurityOption` calling `ASTUserNameWithHost::toString` would reach the internal `LOGICAL_ERROR` branch in `getStringFromAST` instead of failing as user-input validation. Validate at the JSON boundary that `username` is an `ASTIdentifier` or a string `ASTLiteral`, and that `host_pattern` is a string `ASTLiteral`, rejecting anything else with `BAD_ARGUMENTS`. Adds `04408_ast_json_user_name_with_host_child_typing`: an identifier user name round-trips, while a `Function` or non-string-literal `username` and a non-literal or non-string-literal `host_pattern` are rejected. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
@groeneai, investigate |
|
Investigated. The only red on The hung entry is a single cancelled INSERT ( Mechanism: the AST fuzzer mutated This is distinct from the leaked-local-ThreadPool-at- Evidence it is chronic and unrelated to this PR (CIDB, 30d):
No new actionable signature here, so no separate fix PR. I cannot reproduce the stall locally (no Azure stress environment), so I am not making a speculative change. Tracked under the chronic hung-check umbrella. |
… block The `02995_new_settings_history` fast-test check requires every new setting to be registered in a `SettingsChangesHistory` version block newer than the released baseline (currently `> 26.6`). While this branch was open `master` opened the `26.7` block, but the new `allow_experimental_json_ast_dialect` entry stayed in `26.6`, so the check reported `PLEASE ADD THE NEW SETTING TO SettingsChangesHistory.cpp: allow_experimental_json_ast_dialect WAS ADDED`. Move the entry to the `26.7` block so the check passes. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=b0b2284704ea80978a410f3bdf3382bf0b01da85&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…_required_fields` `formatQueryFromJSON` serializes `ASTSystemQuery::Type` as its numeric enum position, and the test hard-codes those numbers to build malformed `SystemQuery` payloads that must be rejected with `BAD_ARGUMENTS`. While this branch was open `master` inserted new `SYSTEM` query types before `SCHEDULE_MERGE` and `REFRESH_VIEW`, shifting their values: - `SCHEDULE_MERGE`: `98` -> `99` (`98` is now `START_CLEANUP`) - `REFRESH_VIEW`: `102` -> `103` (`102` is now `SET_COVERAGE_TEST`) The stale numbers no longer pointed at the types whose `readJSON` requires `table`/`scheduled_merge_parts`, so the payloads deserialized into valid `SYSTEM START CLEANUP` / `SYSTEM SET COVERAGE TEST` queries and the expected `BAD_ARGUMENTS` was never thrown. Update the values to the current enum positions and note in the test that they track the enum order. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=b0b2284704ea80978a410f3bdf3382bf0b01da85&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The `readJSON` paths for several AST nodes read nested non-AST struct scalar fields directly through `Poco::getValue`, which coerces scalar types: the string `"yes"` becomes a real `true`, and a non-string value is stringified into a name. That let malformed `clickhouse_json` deserialize into a *different* valid AST instead of failing closed with `BAD_ARGUMENTS` at the boundary. The reviewer flagged `RenameQuery` elements (`from_database`, `from_table`, `to_database`, `to_table`, `if_exists`); this also applies the same strict reads to the sibling non-AST structs that read scalars the same way: - `ASTRenameQuery` element fields - `ASTBackupQuery` element `type`/name fields and `except_tables` - `ASTSetQuery` change/`query_parameters` names and values - `ASTDictionarySettings` and `ASTCreateWasmFunctionQuery` change names - `ASTViewTargets` `kind`/`table_*`/`inner_uuid` - `ASTSystemQuery` `tables` and `server_type` - the complex-type `value` dump in `ASTJSONReadHelpers` Each field is now read through `JSONObjectReader`, which validates the exact JSON scalar type and rejects a wrong type with `BAD_ARGUMENTS`. Valid round-trips are unaffected (the writer always emits the correct scalar type). For the enum discriminators (`type`) that previously relied on `getValue` throwing on a missing key, an explicit presence check keeps a missing field a controlled `BAD_ARGUMENTS` instead of a default value. Adds `04489_ast_json_rename_scalar_coercion` covering the rejected malformed shapes (including the `"if_exists":"yes"` case from review) and the valid round-trips that must still pass. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Fix. Test. CI. The 3 reds are all pre-existing fuzzer flakes in code paths this PR does not touch (the feature is gated behind
Verified the 8 changed TUs with |
Clears the AI Review "Request changes" verdict (the AST JSON round-trip
and boundary-validation gaps).
- `ASTAlterCommand::formatImpl`: emit `FROM '<path>'` for the
`ATTACH PART '...' FROM '...'` form. The path is stored in `from`
(the parser only sets it for the PART form) and consumed by
`PartitionCommand::parse` as `from_path`, but `formatImpl` dropped it,
so the `parseQueryToJSON` -> `formatQueryFromJSON` round trip hid the
source path.
- `ASTAlterCommand::readJSON`: allow a null `statistics_decl` for the
parser-produced `MATERIALIZE STATISTICS ALL` and `CLEAR STATISTICS ALL`
forms (the reader previously rejected its own writer's JSON for them),
while still requiring the declaration for plain `DROP STATISTICS` and
the non-`ALL` forms, and keeping the no-`TYPE`-list check.
- `ASTAlterCommand::readJSON`: validate the per-command scalar fields the
parser always sets: `FETCH PARTITION` requires `from`,
`[ATTACH|REPLACE] PARTITION ... FROM` requires `from_table`, and
`UNFREEZE PARTITION` / `UNFREEZE` require `with_name`. Malformed
`clickhouse_json` now fails with `BAD_ARGUMENTS` at the boundary
instead of formatting a parser-impossible `FROM `/`WITH NAME`.
- `ASTTTLElement::readJSON`: for `TTL MOVE` require a `DISK`/`VOLUME`
destination with a non-empty name (mirroring the sibling
`MOVE_PARTITION` check) and reject stray destination state on
non-`MOVE` modes; otherwise a non-`DISK`/`VOLUME` destination reaches
the `LOGICAL_ERROR` branch in `formatImpl` and a missing name formats
an empty target.
- `ASTSelectQuery::readJSON`: restore `cte_aliases` with the same typed
read as the column `aliases` (an `ASTExpressionList` whose children are
all `ASTIdentifier`), so malformed JSON cannot reach a later analyzer
cast.
- `LocalConnection`: capture `max_query_size` / `max_ast_depth` /
`max_ast_elements` together with the dialect/gate, before the query's
own `SETTINGS` are applied, and use the captured values in the
`input()` initializer's JSON reparse. A JSON
`INSERT ... FROM input(...) SETTINGS max_ast_depth = 1` can no longer
make the second parse fail under limits the first parse never saw.
- `parseQueryToJSON` / `formatQueryFromJSON`: bump the
`FunctionDocumentation` version to `{26, 7}` to match the `26.7`
settings-history bucket of `allow_experimental_json_ast_dialect`.
The non-array TTL `group_by_*` finding is already enforced by the strict
`JSONObjectReader::getArray` (it throws `BAD_ARGUMENTS` on a present
non-array value); a guard case is added to the test below.
Test: `04490_ast_json_alter_ttl_select_validation` covers the new valid
round-trips and the malformed shapes that must fail closed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Majors
Nit
The non-array TTL Test. Verification. The seven changed translation units that exist on master were checked with CI. The only red on the previous commit was |
…rget Master independently added 04489_max_threads_auto_parsing_compat, 04490_dict_get_keys_*/04490_table_readonly_*, and 04491_* tests, colliding with this PR's 04489_ast_json_rename_scalar_coercion and 04490_ast_json_alter_ttl_select_validation. Renumber the AST JSON tests to 04492/04493. Also fix the FETCH PARTITION negative case in the alter/ttl/select test: the serializer escapes the forward slash (`escape_forward_slashes` is on by default), so the FROM path is emitted as `"from":"\\/zk"`. The old `replace` target `,"from":"/zk"` never matched, so the field was not stripped and the query round-tripped instead of failing with `BAD_ARGUMENTS`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… SYSTEM server_type element typing, empty-name identifier
Address the AST JSON review findings that let `clickhouse_json` build
parser-impossible ASTs whose formatting hides state the executor still uses:
* `ASTCreateQuery::readJSON` now rejects the attach-only fields
(`attach_short_syntax`, `attach_from_path` / `has_attach_from_path`,
`attach_as_replicated`) when `attach` is false — the parser gates them behind
`attach`, and `formatImpl` asserts `attach || !has_attach_from_path`. It also
requires `has_attach_from_path` to match whether `attach_from_path` is
non-empty, and requires `has_uuid` to match whether a non-`Nil` `uuid` is
present (every parser path derives `has_uuid` from `uuid != Nil`; a mismatched
payload would enable `{uuid}` macro expansion while the formatted SQL shows no
`UUID` clause).
* `ASTSystemQuery::readJSON` validates each `server_type.exclude_types`
element as a non-boolean integer and each `server_type.exclude_custom_names`
element as a string, instead of letting Poco coerce a string `"1"` or a
number `123` into a real `SYSTEM START/STOP LISTEN` filter.
* `ASTIdentifier::readJSON` / `ASTTableIdentifier::readJSON` now require the
single child of an empty-name identifier (`{x:Identifier}`) to be an
`ASTQueryParameter`, mirroring the compound-placeholder branch;
query-parameter visitors downcast it unconditionally.
New test `04494_ast_json_create_system_identifier_validation` covers the
positive round-trips and the malformed `clickhouse_json` shapes that must fail
with `BAD_ARGUMENTS`.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Continued work on this PR (pushed Merge + test renumbering
Fixed the Fast-test failure (
Addressed the AI "Request changes" Majors (all 4 inline threads resolved), in
Verification note: a full feature build is infeasible in this worktree (the build dir is configured for another branch and reconfiguring + rebuilding the 659-commit delta exceeds the foreground limit), so I verified statically and against a binary: all three changed TUs pass |
The `Fast test` `04494_ast_json_create_system_identifier_validation` failed because its reference expected `FROM '/p'` and `UUID '...'`, but the `.sql` test's default `TabSeparated` output escapes a single quote in a string value as `\'`. `formatQueryFromJSON` correctly returns plain `'/p'` / `'...'`; the client renders them as `\'/p\'` / `\'...\'`, matching every other AST JSON round-trip reference (e.g. `04307`'s `\'r\'`). Only the reference was missing the escaping; no code change. CI: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=100412&sha=7d91d06184716e1e9842d871eb25908b6a7421b7&name_0=PR&name_1=Fast%20test Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Address the AI review blocker on hidden UUID state. `ASTTableIdentifier` and `ASTViewTargets` persist a `UUID` on nested table references (a `REFRESH DEPENDS ON src UUID '...'` dependency, a materialized-view `TO` target's `table_id`), and execution honours it (`ASTTableIdentifier::getTableId` and the target `StorageID`), but the formatters never emit it: `ASTIdentifier::formatImplWithoutAlias` prints only the name, and `ASTViewTargets::formatTarget` prints only `db.table` for a `table_id` target. `formatQueryFromJSON` would therefore print a name-only reference while the JSON AST still resolves the table by `UUID`, so the shown SQL can refer to a different table than the AST executes against. Fail closed at the deserialization boundary: `ASTTableIdentifier::readJSON` rejects a `uuid` and `ASTViewTargets::readJSON` rejects a `table_uuid`, both with `BAD_ARGUMENTS`. `inner_uuid` (emitted faithfully as `INNER UUID '...'`) is unaffected, and the `CREATE` / dictionary-name `UUID` (lifted onto the query object and formatted as a `UUID` clause) round-trips as before. Test: `04495_ast_json_nested_uuid_rejection`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Pushed Fixed the Addressed the AI Review blocker on hidden UUID table-reference state ( Left for your decision — the I did not re-merge |
|
@groeneai, fix the |
| /// Capture the parser-affecting settings now, before the query's own `SETTINGS` clause is applied | ||
| /// during execution. The `input()` initializer below reparses `state->query`, and must use the | ||
| /// dialect/gate the query was originally accepted with rather than the (possibly mutated) live ones. | ||
| state->parsed_as_json_dialect = query_context->getSettingsRef()[Setting::dialect] == Dialect::clickhouse_json; |
There was a problem hiding this comment.
This only captures the JSON-specific subset of parser state. The input() initializer still chooses the non-JSON parser and limits from the live context->getSettingsRef() after InterpreterSetQuery::applySettingsFromQuery has already applied the query-local SETTINGS.
Concrete repro: INSERT INTO FUNCTION null('x UInt8') SELECT * FROM input('x UInt8') SETTINGS max_query_size = 1 FORMAT TSV is accepted by the initial parse under the session defaults, but the callback later reparses state->query with max_query_size = 1 and fails before reading any input. SETTINGS dialect = 'kusto' has the same shape and switches the callback to ParserKQLStatement for a query that was accepted as SQL.
Please either carry the already parsed ASTInsertQuery / resolved input format into the initializer, or capture the full parse-time parser configuration for the second parse instead of only the JSON-specific subset.
There was a problem hiding this comment.
Verified this against master: the non-JSON else branch flagged here is byte-for-byte identical to master — same parser construction from the live context->getSettingsRef() and the same settings[max_query_size] / settings[max_parser_depth] / settings[max_parser_backtracks] (and settings[dialect]) passed into parseQueryAndMovePosition / parseKQLQueryAndMovePosition inside the input() initializer callback. So whatever the repro describes for that path (SETTINGS max_query_size = 1, SETTINGS dialect = 'kusto'), it is not introduced or changed by this PR — it applies equally to master today, since the callback has always reparsed with the post-SETTINGS live settings.
What this PR adds is the clickhouse_json branch, and that path is protected: the JSON-affecting settings (dialect gate, max_query_size, max_ast_depth, max_ast_elements) are captured into LocalQueryState at sendQuery time — before the query-local SETTINGS are applied — and used in the callback.
Making the pre-existing non-JSON path robust the same way is a separate, pre-existing-behavior fix (not required for the JSON feature), and the two options have different blast radius:
- Capturing the full parse-time parser configuration for the
elsebranch would also change theSET dialect = 'clickhouse'mid-input escape-hatch semantics that the JSON branch comment relies on — needs a deliberate decision. - Carrying the already-parsed
ASTInsertQuery/ resolved input format into the initializer avoids the reparse entirely but changes theIServerConnection::sendQuerycontract.
Both touch pre-existing clickhouse-local behavior beyond the scope of this PR, so leaving the call to @alexey-milovidov — happy to split either into a separate PR.
|
@alexey-milovidov This is the chronic trunk Latest hit on this PR (STID 0993-27f0, SELECT DISTINCT JSONExtractUIntCaseInsensitive('(\0', j, materialize(toNullable(NULL)))
FROM (SELECT DISTINCT parseQueryToJSON('SELECT DISTINCT a FROM t') AS j GROUP BY toUInt128(1) WITH ROLLUP LIMIT 28
UNION ALL
SELECT DISTINCT parseQueryToJSON('SELECT DISTINCT a FROM t') AS j GROUP BY 1 WITH ROLLUP LIMIT 28)
WHERE NOT (toFixedString(NULL, 1048576))Root cause: set-operation branches are optimized independently, so filter push-down can constant-fold a column in one branch but not its sibling. The plan-time |
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 4240/5183 (81.81%) · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Add new functions to serialize and deserialize AST as JSON.
Note
Medium Risk
Introduces a new query parsing path that deserializes ASTs from JSON in both server and clients; although gated by an experimental setting, it touches core parsing/execution and adds substantial new serialization/deserialization logic that could impact correctness and resource limits.
Overview
Adds JSON round-trippable AST support: new
parseQueryToJSON()andformatQueryFromJSON()SQL functions convert between SQL and a JSON AST form (with optional best-effort preservation of original comments/whitespace when formatting).Introduces an experimental
dialect = clickhouse_jsonmode (guarded byallow_experimental_json_ast_dialect) that accepts JSON AST directly;executeQuery,ClientBase, andLocalConnectionroute parsing throughIAST::createFromJSONwith newmax_ast_depth/max_ast_elementslimits and aSET-query escape hatch to switch dialects back.Implements JSON (de)serialization (
writeJSON/readJSON) for many AST node types and adds a central JSON AST factory/deserializer inASTFromJSON, including depth/element counting and improved validation for malformed JSON ASTs.Reviewed by Cursor Bugbot for commit 16e5854. Bugbot is set up for automated code reviews on this repo. Configure here.