Map MySQL geometry types to ClickHouse geometric types by alexey-milovidov · Pull Request #108944 · ClickHouse/ClickHouse · GitHub
Skip to content

Map MySQL geometry types to ClickHouse geometric types#108944

Open
alexey-milovidov wants to merge 13 commits into
masterfrom
feat/mysql-geometry-types
Open

Map MySQL geometry types to ClickHouse geometric types#108944
alexey-milovidov wants to merge 13 commits into
masterfrom
feat/mysql-geometry-types

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 30, 2026

Copy link
Copy Markdown
Member

The MySQL integrations (the MySQL database engine, table engine and table function) previously mapped only POINT to the ClickHouse Point type and converted every other MySQL spatial type (LINESTRING, POLYGON, MULTILINESTRING, MULTIPOLYGON, generic GEOMETRY) to String, exposing the raw WKB bytes.

This adds a new geometry flag to the mysql_datatypes_support_level setting, enabled by default, that maps MySQL's spatial column types to the corresponding ClickHouse geometric types:

MySQL ClickHouse
POINT Point (always, regardless of the flag)
LINESTRING LineString
POLYGON Polygon
MULTILINESTRING MultiLineString
MULTIPOLYGON MultiPolygon
GEOMETRY (generic) Geometry (the umbrella Variant)
MULTIPOINT, GEOMETRYCOLLECTION String

A concrete spatial column type (LINESTRING, POLYGON, MULTILINESTRING, MULTIPOLYGON) is mapped to its ClickHouse counterpart, because MySQL enforces a single subtype per such column, so a read can never encounter a value of a different subtype.

The generic GEOMETRY column type is mapped to the umbrella Geometry type (a Variant over Point/LineString/Polygon/MultiLineString/MultiPolygon/Ring). Such a column can store values of any subtype, including ones that have no ClickHouse counterpart (MULTIPOINT, GEOMETRYCOLLECTION); this incompatibility is accepted: a value of a representable subtype reads back correctly, while a value of a subtype with no ClickHouse counterpart throws an exception at read time, because parseWKBFormat rejects the WKB type and the Geometry Variant has no alternative for it.

POINT keeps being converted to Point unconditionally for backward compatibility (it has been supported since version 24.1). Columns declared as MULTIPOINT or GEOMETRYCOLLECTION themselves have no ClickHouse counterpart and, as all unknown types, fall back to String. When the geometry flag is disabled, the previous behavior is restored (everything except POINT becomes String).

MySQLSource reads the geometric values (for the concrete spatial columns, and for columns declared as a geometric or Geometry type) by stripping the 4-byte SRID prefix and parsing the standard WKB payload with parseWKBFormat, then inserting the result into the concrete geometric column or into the Geometry Variant by discriminator.

Motivation. This was raised in the review of #107740 (discussion r3496279299): mapping all spatial values to Point is wrong. The concrete spatial types map to their exact ClickHouse counterparts; the generic GEOMETRY maps to the umbrella Geometry type, accepting that the rarely used MULTIPOINT and GEOMETRYCOLLECTION subtypes are not representable and throw at read time.

PostgreSQL uses different (non-WKB) text representations for its built-in geometric types and EWKB for PostGIS, and currently has no geometric type mapping at all, so it is left for a follow-up.

Changelog category (leave one):

  • Improvement

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

The MySQL database engine, table engine and table function now map MySQL's spatial column types (LINESTRING, POLYGON, MULTILINESTRING, MULTIPOLYGON, and the generic GEOMETRY) to the corresponding ClickHouse geometric types instead of String. This is controlled by the new geometry flag of the mysql_datatypes_support_level setting, enabled by default. POINT is still always converted to Point. The generic GEOMETRY column maps to the umbrella Geometry type; reading a value whose subtype has no ClickHouse counterpart (MULTIPOINT, GEOMETRYCOLLECTION) throws an exception at read time.

alexey-milovidov and others added 2 commits June 30, 2026 13:55
Previously the MySQL integrations (the `MySQL` database engine, table
engine and table function) mapped only `POINT` to the ClickHouse `Point`
type and converted every other spatial type (`LINESTRING`, `POLYGON`,
`MULTILINESTRING`, `MULTIPOLYGON`, generic `GEOMETRY`) to `String`,
exposing the raw WKB bytes.

This adds a new `geometry` flag to the `mysql_datatypes_support_level`
setting, enabled by default, that maps MySQL spatial types to the
corresponding ClickHouse geometric types:

- `LINESTRING`     -> `LineString`
- `POLYGON`        -> `Polygon`
- `MULTILINESTRING`-> `MultiLineString`
- `MULTIPOLYGON`   -> `MultiPolygon`
- `GEOMETRY`       -> `Geometry` (a `Variant` over all the geometric types,
                      used when the concrete subtype is not known from the
                      column metadata)

`POINT` keeps being converted to `Point` unconditionally, for backward
compatibility (it has been supported since version 24.1). `MULTIPOINT`
and `GEOMETRYCOLLECTION` have no ClickHouse counterpart and, as all
unknown types, fall back to `String`.

`MySQLSource` reads these values by stripping the 4-byte SRID prefix and
parsing the standard WKB payload with `parseWKBFormat`, then inserting
the result into the concrete geometric column or into the `Geometry`
`Variant` by discriminator.

This was raised in the review of
#107740 (discussion
r3496279299): the query-passing path could only report the generic
`MYSQL_TYPE_GEOMETRY` and would otherwise have to guess `Point`. Mapping
to `Geometry` removes that guess. PostgreSQL uses different (non-WKB)
representations for its built-in geometric types and EWKB for PostGIS, so
it is left for a follow-up.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [7e84273]

Summary:

job_name test_name status info comment
Stateless tests (amd_tsan, parallel, 2/2) FAIL
Server died FAIL cidb, issue
AST fuzzer (amd_debug) ERROR

AI Review

Summary

This PR adds default-on MySQL geometry mappings, a WKB read path for ClickHouse geometric types, and setting plumbing for schema inference. The table-backed paths are mostly wired up, but the current head still leaves one query-backed inference bug unresolved and introduces a new precedence bug for named collections in the table-engine path, so the advertised mysql_datatypes_support_level contract is not consistent across supported MySQL entrypoints.

Findings
  • ⚠️ Majors
    • [src/Storages/StorageMySQL.cpp:124-132, src/Storages/StorageMySQL.cpp:738-762, src/DataTypes/convertMySQLDataType.cpp:259-260] Query-backed MySQL schema inference still ignores the effective mysql_datatypes_support_level and still guesses every MYSQL_TYPE_GEOMETRY result as Point. A query-backed source such as mysql(..., query('SELECT ls FROM t')) therefore infers a MySQL LINESTRING as Point, then fails at read time with Only Point data type is supported; the same path still drops per-call, per-engine, and named-collection overrides for geometry, decimal, datetime64, and date2Date32. Suggested fix: thread type_support into doQueryResultStructure and stop mapping MYSQL_TYPE_GEOMETRY to Point when the concrete subtype is not knowable from MYSQL_FIELD.
    • [src/Storages/StorageMySQL.cpp:460-468] The new query-context bridge now gives the session SET higher precedence than mysql_datatypes_support_level coming from a named collection. getConfiguration loads named-collection settings first, then loadFromQueryContext overwrites them, and loadFromQuery(*args.storage_def) only restores explicit SETTINGS, not named-collection values. A table created from a named collection can therefore silently ignore its own geometry opt-out and persist the session value into metadata. Suggested fix: apply the query-context default before loading named-collection values, or make loadFromQueryContext skip settings that were already supplied by the engine arguments.
Final Verdict

Status: ⚠️ Request changes

Minimum required actions:

  • Fix the query-backed inference path so it uses the effective mysql_datatypes_support_level and does not force MYSQL_TYPE_GEOMETRY to Point.
  • Fix the new named-collection precedence bug in the table-engine registration path.

@clickhouse-gh clickhouse-gh Bot added the pr-improvement Pull request with some product improvements label Jun 30, 2026
Comment thread src/DataTypes/convertMySQLDataType.cpp
A MySQL column declared with the generic `GEOMETRY` type can store values
of any subtype, including `MULTIPOINT` and `GEOMETRYCOLLECTION`, which have
no ClickHouse counterpart. `parseWKBFormat` only accepts the WKB type ids
`1`, `2`, `3`, `5`, `6`, and the umbrella `Geometry` type is a `Variant`
over `Point`, `LineString`, `Polygon`, `MultiLineString`, `MultiPolygon`,
`Ring` only - it cannot represent those subtypes. So mapping a generic
`GEOMETRY` column to `Geometry` (which was enabled by default) made
previously-readable rows throw `BAD_ARGUMENTS` at `SELECT` time, and
contradicted this PR's own stated intent that `MULTIPOINT` and
`GEOMETRYCOLLECTION` fall back to `String`.

Keep the generic `GEOMETRY` column type mapped to `String` (its raw WKB,
the previous behavior). The concrete spatial column types (`LINESTRING`,
`POLYGON`, `MULTILINESTRING`, `MULTIPOLYGON`) are still mapped to the
corresponding ClickHouse geometric types under the `geometry` flag,
because MySQL enforces a single subtype per such column, so the read can
never encounter an unrepresentable value. A user who knows a generic
`GEOMETRY` column only holds representable subtypes can still read it as
`Geometry` by declaring the column type explicitly (e.g. via the MySQL
table engine); that read path is unchanged.

The integration test now asserts the generic `geometry` column (including
a row holding a `MULTIPOINT`) is read as `String` without error, which is
the regression the review asked to cover.

Addresses the review finding on src/DataTypes/convertMySQLDataType.cpp.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): addressed the blocking AI Review finding.

The AI Review requested changes on the generic GEOMETRY mapping (one unresolved thread, now resolved). I confirmed it is a real, default-on regression: a MySQL column declared with the generic GEOMETRY type can hold values of any subtype, including MULTIPOINT (WKB id 4) and GEOMETRYCOLLECTION (id 7); parseWKBFormat only accepts ids 1/2/3/5/6 and the Geometry Variant has no alternative for those, so mapping it to Geometry made previously-readable rows throw BAD_ARGUMENTS at SELECT time — and it contradicted this PR's own stated "fall back to String" intent.

Fix (f515b85):

  • Generic GEOMETRY column type now stays mapped to String (raw WKB — the previous, safe behavior).
  • The concrete spatial types (LINESTRING/POLYGON/MULTILINESTRING/MULTIPOLYGON) are still mapped to ClickHouse geometric types under the geometry flag — these are safe because MySQL enforces one subtype per such column.
  • The explicit-Geometry-column read path (e.g. MySQL table engine) is unchanged, so a user who knows the data is representable can still opt in.
  • Updated docs, the setting description, the settings-changes-history note, and the integration test, which now adds a generic geometry column holding a MULTIPOINT and asserts it reads as String without error (the requested regression test).
  • Verified convertMySQLDataType.cpp compiles (-fsyntax-only, -Werror); Python ruff clean.

This reverses one design element of the PR (generic GEOMETRYGeometry), so flagging it for your call — it is an isolated, easily-reverted commit if you prefer to keep the umbrella mapping and instead document the unsupported-subtype limitation.

The only CI failure was 03360_bool_remote (Unexpected packet from server (expected TablesStatusResponse, got ProfileInfo)) — the known distributed TablesStatus desync flake being fixed by #108854, unrelated to this change. The push triggered a fresh CI run on f515b851b34.

Comment thread tests/integration/test_storage_mysql/test.py
alexey-milovidov and others added 2 commits June 30, 2026 20:48
This addresses the review finding on the MySQL table engine and fixes the
`test_mysql_database_engine/test.py::test_backup_database` integration test.

Two related problems, both rooted in the same omission: when this PR added
`geometry` to the default of the `mysql_datatypes_support_level` server
setting, it left the engine-level `MySQLSettings` default unchanged at
`decimal,datetime64,date2Date32`.

1. The engine and server defaults no longer matched, so `loadFromQueryContext`
   considered the value "changed" for every freshly created MySQL database (and
   table) engine, persisting a redundant
   `SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32,geometry'`
   clause into the `CREATE` statement. That is what broke `test_backup_database`,
   which asserts the round-tripped `SHOW CREATE DATABASE` has no `SETTINGS`
   clause. Fixed by giving `MySQLSettings` the same default as the server
   setting, so the default case persists nothing again.

2. `StorageMySQL::getTableStructureFromData` read the type-mapping level from the
   global query context instead of the engine settings, so an explicit
   per-engine opt-out such as
   `CREATE TABLE t ENGINE = MySQL(...) SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32'`
   (with omitted columns) was silently ignored: schema inference still mapped
   `LINESTRING` to `LineString`, so the documented "disable `geometry` to restore
   `String`" behavior was false for the persistent table engine.

   Fixed by threading the effective type-support level into
   `getTableStructureFromData`. The table engine now passes its own engine
   setting; for that value to also reflect a query-context `SET`, the table
   engine bridges the context value into the engine settings via
   `loadFromQueryContext`, the same way the MySQL database engine already does.
   The table function keeps reading the query context, since it has no
   persistent engine settings.

The integration test now also covers the per-engine opt-out: a schema-inferred
`MySQL` table engine with `SETTINGS mysql_datatypes_support_level` disabling
`geometry` reads `LINESTRING` as `String`, and the default-on case reads it as
`LineString`.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): fixed the failing integration test and addressed the blocking AI Review finding. Pushed f515b851b34..7ef7d7ac244 (one fix commit + a fresh merge of master).

Both problems shared one root cause: this PR added geometry to the default of the mysql_datatypes_support_level server setting but left the engine-level MySQLSettings default at decimal,datetime64,date2Date32.

  1. CI failure test_mysql_database_engine/test.py::test_backup_database — with the defaults mismatched, MySQLSettings::loadFromQueryContext considered the value "changed" for every freshly created MySQL database engine and persisted a redundant SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32,geometry' clause into the CREATE statement, which the test asserts is absent. Fixed by giving MySQLSettings the same default as the server setting, so the default case persists nothing again.

  2. AI Review ⚠️ Major (table-engine opt-out)StorageMySQL::getTableStructureFromData read the type-mapping level from the global query context instead of the engine settings, so CREATE TABLE t ENGINE = MySQL(...) SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32' (with omitted columns) was silently ignored — schema inference still mapped LINESTRING to LineString. Fixed by threading the effective type-support level into getTableStructureFromData: the table engine passes its own engine setting and bridges the query-context value into it via loadFromQueryContext (mirroring the MySQL database engine); the table function keeps reading the query context.

Added a regression test covering the per-engine opt-out (LINESTRING reads as String when geometry is disabled via table-engine SETTINGS; reads as LineString by default).

Design note for your call: item 2 mirrors the database engine by calling loadFromQueryContext in registerStorageMySQL, which also freezes a non-default mysql_datatypes_support_level into the table's CREATE (only when it differs from the default — the default case persists nothing). This makes the table engine consistent with the database engine and also makes a query-context SET take effect for table-engine schema inference (previously the table engine read the global context and ignored session SET). If you prefer the table engine to stay strictly per-SETTINGS (no SET bridging / no freezing), I can drop the loadFromQueryContext line and keep only the getTableStructureFromData threading — that still fixes the AI finding.

Verified: -fsyntax-only -Werror on all changed/related MySQL translation units, ruff clean on the test. 02995_new_settings_history is unaffected (the server-default change is recorded in SettingsChangesHistory.cpp under 26.7; the engine MySQLSettings default has no history mechanism). Fresh CI is running on 7ef7d7ac244.

Comment thread src/TableFunctions/TableFunctionMySQL.cpp Outdated
The earlier commit that threaded the type-mapping level into
`StorageMySQL::getTableStructureFromData` and the `StorageMySQL`
constructor changed the constructor's fallback schema inference to read
the engine `MySQLSettings` instead of the query context. The `mysql(...)`
table function creates its `StorageMySQL` with a default `MySQLSettings{}`,
so when it inferred the structure itself the default-on `geometry` (and
`decimal`/`datetime64`) flags were always applied, silently ignoring both
the per-call `SETTINGS` and the query-level `mysql_datatypes_support_level`.

This made
`SELECT toTypeName(ls) FROM mysql(...) SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32'`
infer a MySQL `LINESTRING` as `LineString` instead of `String`, and
regressed the existing `test_mysql_types` table-function assertions (e.g.
a MySQL `DATETIME(6)` read back as `DateTime64(6)` instead of `DateTime`).

`TableFunctionMySQL::parseArguments` now computes the effective
`mysql_datatypes_support_level` (the query-context value, overridden by a
function-local `SETTINGS` clause or named collection) and both
`getActualTableStructure` and `executeImpl` use it, so the table function
honors the configured level regardless of which path performs the
inference. This also addresses the review point that the per-call
`SETTINGS` opt-out was silently ignored.

Adds regression tests for the per-call `SETTINGS` opt-out and for the
per-call override of a query-level value.

CI report: #108944

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): fixed the integration-test regression and the blocking AI Review finding. Pushed 7ef7d7ac244..6d70f47544d.

Root cause (one bug behind all the failures). The commit that threaded the type-mapping level into StorageMySQL::getTableStructureFromData and the StorageMySQL constructor changed the constructor's fallback schema inference to read the engine MySQLSettings instead of the query context. The mysql(...) table function creates its StorageMySQL with a default MySQLSettings{}, so when it inferred the structure itself the default-on geometry (and decimal/datetime64) flags were always applied — silently ignoring both the per-call SETTINGS and the query-level mysql_datatypes_support_level.

This broke exactly the table-function code path (getActualTableStructure/executeImpl), which is why every failure was at the mysql(...) assertion and the database/table-engine paths stayed green:

  • test_storage_mysql/test.py::test_mysql_geometry (new): SELECT toTypeName(ls) FROM mysql(...) SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32' returned LineString instead of String (confirmed in the server log — inference ran with the default level).
  • test_mysql_database_engine/test_mysql57_database_engine ::test_mysql_types[...] (datetime_6_2/3, decimal_18_6_1/2, common_types_21/23): the table function sub-assertion (line 939/961) regressed, e.g. a MySQL DATETIME(6) read back as DateTime64(6) instead of DateTime. I verified via play.clickhouse.com that these type-mismatch AssertionErrors appear only on this PR — the same tests' failures on other PRs over the last 40 days are infra/cluster crashes, a different failure mode.

Fix. TableFunctionMySQL::parseArguments now computes the effective mysql_datatypes_support_level — the query-context value, overridden by a function-local SETTINGS clause or named collection — and both getActualTableStructure and executeImpl use it, so the table function honors the configured level regardless of which path performs the inference. This is also the AI Review's Minimum required action (the per-call SETTINGS opt-out was silently ignored); resolved the corresponding thread.

Tests. Added regression assertions to test_mysql_geometry for mysql(..., SETTINGS mysql_datatypes_support_level = '...') honoring the per-call opt-out, and for the per-call SETTINGS overriding a query-level value.

Verification. convertMySQLDataType.cpp already gates the mapping on type_support.isSet(GEOMETRY), so it was always receiving the wrong level — not mis-mapping. The changed translation unit compiles clean with the full production flags (-Weverything, ARM); ruff clean on the test. I could not run the integration tests locally (this build dir's mariadb-connector-c submodule is unpopulated, so CMake cannot reconfigure), so the fresh CI run on 6d70f47544d is the end-to-end check.

Comment thread src/Processors/Sources/MySQLSource.cpp Outdated
Addresses the AI Review "Request changes" finding on
`src/Processors/Sources/MySQLSource.cpp`: the new MySQL spatial read path
called `parseWKBFormat` with the default `0`, so a remote `LINESTRING`,
`POLYGON` or `MULTI*` value used only the hard-coded `100'000'000` element
cap instead of the configurable `max_wkb_geometry_elements` query setting
(which defaults to `1'000'000` specifically to bound WKB point/ring/polygon
counts before the parser reserves memory). `external_storage_max_read_bytes`
is checked only after a row has been parsed, so it could not prevent the
oversized allocation path.

Carry the (clamped) `max_wkb_geometry_elements` value in `StreamSettings`,
built from the query settings exactly as `readWkb` does, and thread it
through `insertValue`/`insertGeometryValue` into `parseWKBFormat`.

Added a focused regression test to `test_mysql_geometry` that reads a MySQL
`LINESTRING` of three points with `max_wkb_geometry_elements = 1` and asserts
the read is rejected with `TOO_LARGE_ARRAY_SIZE`, and reads back normally
with a sufficiently large limit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): addressed the blocking AI Review finding. Pushed 6d70f47544d..3410a1d35a8.

The AI Review requested changes because the new MySQL geometry read path called parseWKBFormat with the default 0, so a remote LINESTRING/POLYGON/MULTI* value used only the hard-coded 100'000'000 element cap instead of the configurable max_wkb_geometry_elements query setting (which defaults to 1'000'000 specifically to bound WKB point/ring/polygon counts before the parser reserves memory). external_storage_max_read_bytes is checked only after a row has been parsed, so it could not prevent the oversized allocation path.

Fix (3410a1d):

  • StreamSettings now carries the effective max_wkb_geometry_elements, computed from the query settings exactly as readWkb does (static_cast<UInt32>(std::min<UInt64>(settings[max_wkb_geometry_elements], MAX_WKB_GEOMETRY_ELEMENTS_HARD_LIMIT))). It is built from context_->getSettingsRef() in StorageMySQL::read, so a per-query SETTINGS max_wkb_geometry_elements = ... reaches the reader.
  • Threaded that value through insertValue/insertGeometryValue into parseWKBFormat.
  • Added a focused regression test to test_mysql_geometry: reading a MySQL LINESTRING of three points with max_wkb_geometry_elements = 1 is now rejected with TOO_LARGE_ARRAY_SIZE (before the fix it parsed the value), and reads back normally with a sufficiently large limit.

Note: the Point path parses the two coordinates directly (no element count), so it is unaffected. The other parseWKBFormat callers (Parquet, Arrow) still pass the default and are left out of scope for this MySQL PR.

Verification. MySQLSource.cpp compiles clean under the full production flags (-fsyntax-only, ARM, -Weverything/-Werror); ruff check clean on the test. The push triggered a fresh CI run on 3410a1d35a8.

…-types

# Conflicts:
#	src/Storages/StorageMySQL.cpp
#	src/Storages/StorageMySQL.h
#	src/TableFunctions/TableFunctionMySQL.cpp
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): merged master, resolved a 3-file conflict. Pushed 3410a1d35a8..11b59a1a5b0.

The PR had gone CONFLICTING/DIRTY against master because master landed the TableNameOrQuery refactor and the new query-backed structure inference (doQueryResultStructure), which touched the same three files this PR modifies. GitHub now reports MERGEABLE again.

Conflicts resolved by combining both sides (no logic dropped):

  • StorageMySQL.h / StorageMySQL.cppgetTableStructureFromData now takes master's const TableNameOrQuery & table_or_query and this PR's MultiEnum<MySQLDataTypesSupport> type_support. The constructor's fallback inference passes the renamed member remote_table_or_query with (*mysql_settings)[mysql_datatypes_support_level].
  • TableFunctionMySQL.cppgetActualTableStructure uses configuration->table_or_query with the effective type_support; executeImpl keeps both master's query-backed-insert rejection and this PR's effective-MySQLSettings construction.

Verified: both changed TUs pass -fsyntax-only with the full production flags (declaration/definition/call-sites all consistent, 5-arg getTableStructureFromData, fetchTablesColumnsList still takes type_support); all geometry integration tests survived the test.py merge; 0 unresolved review threads; AI Review was ✅ Approve on the prior head and this merge changes no logic.

One design note for your call (left as master's behavior, not changed). master's new doQueryResultStructure (the schema-inference path taken when a MySQL source is defined over a query rather than a table name) maps types using the query-context mysql_datatypes_support_level, not the effective per-engine/per-call value threaded in by this PR. So for a query-backed MySQL storage or mysql(...) call, a per-engine/per-call SETTINGS mysql_datatypes_support_level = '...' geometry opt-out would not reach that inference. No test covers this intersection and it is master's unrelated new feature, so I did not speculatively change it. If you want the opt-out honored there too, threading type_support into doQueryResultStructure is a one-line follow-up.

const TableNameOrQuery & table_or_query,
const ContextPtr & context_)
const ContextPtr & context_,
MultiEnum<MySQLDataTypesSupport> type_support)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Query-backed MySQL schema inference still drops the effective mysql_datatypes_support_level.

This new type_support parameter is only used by the table-name path. When table_or_query.isQuery is true, doQueryResultStructure re-reads context_->getSettingsRef, and the MYSQL_FIELD converter still maps any MYSQL_TYPE_GEOMETRY result to Point. As a result, a call such as mysql(..., query('SELECT ls FROM test_mysql_geometry'), SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32') ignores the function-local opt-out; with the default setting, selecting a MySQL LINESTRING through the query-backed form is inferred as Point, then the reader takes the vtPoint path and throws Only Point data type is supported for WKB type 2. The same dropped value affects decimal, datetime64, and date2Date32 overrides for query-backed table engines and table functions.

Please thread type_support into doQueryResultStructure and use it in the MYSQL_FIELD conversion. For MYSQL_TYPE_GEOMETRY query-result metadata, avoid guessing Point for non-Point spatial values; infer String unless the subtype can be proven or declared explicitly. A focused regression test with query('SELECT ls FROM ...') and a per-call mysql_datatypes_support_level override would cover this path.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): merged master, resolved the CONFLICTING/DIRTY state. Pushed 11b59a1a5b0..8262da8fa51.

GitHub reported the PR as CONFLICTING (branch was 71 commits behind, merge-base at 07:59Z today), but the actual git merge origin/master was clean — the status was stale. The merge integrated only master's additive changes into the three high-churn settings files (Settings.cpp, SettingsEnums.cpp, SettingsChangesHistory.cpp); the MySQL translation units (StorageMySQL.cpp, MySQLSource.cpp, convertMySQLDataType.cpp, TableFunctionMySQL.cpp) are byte-identical to the already-CI-tested head 11b59a1a5b0.

Verified:

  • master did not independently change the mysql_datatypes_support_level default (it is still decimal,datetime64,date2Date32 there), so the merge correctly kept this PR's decimal,datetime64,date2Date32,geometry value — no silent revert.
  • The SettingsChangesHistory.cpp entry stayed under the 26.7 block; geometry enum and max_wkb_geometry_elements declarations intact; no conflict markers, no manual edits.

GitHub now reports MERGEABLE. Fresh CI is running on 8262da8fa51.

On the one remaining unresolved thread (StorageMySQL.cpp — query-backed doQueryResultStructure dropping the effective mysql_datatypes_support_level): left unchanged, per your prior call. This is master's pre-existing behavior, not a regression introduced here — the MYSQL_FIELD overload of convertMySQLDataType maps MYSQL_TYPE_GEOMETRY to Point identically on master and this branch, and this PR only adds the concrete-type mapping (linestringLineString, etc.) to the string-typename overload used by table-name inference. Threading type_support into doQueryResultStructure is the one-line follow-up you noted; it is an enhancement to master's orthogonal query-backed-inference feature, so I did not make that scope decision on your behalf.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): the two red CI checks are unrelated master-wide flakes; no code change.

Head 8262da8f is unchanged from the last merge. The fresh CI run has two failures, both confirmed pre-existing flakes with open tracking issues and zero MySQL frames in their stacks:

  • Stress test (amd_msan)Hung check failed, possible deadlock foundHung check failed, possible deadlock found #107941. Master-wide: 62 occurrences across 39 PRs today and 528 across 180 PRs yesterday (test_status IN ('FAIL','ERROR') on the checks table). The stack is an analyzer/AST-fuzzer deadlock with no MySQL storage involved.
  • AST fuzzer (arm_asan_ubsan)Logical error: Block structure mismatch in A stream: different columns: (STID: 0993-27f0)Logical error: Block structure mismatch in A stream: different columns: (STID: 0993-27f0) #108142 (exact STID match; the CI summary already links it). The failing query is a random fuzzer query hitting UnionStep::updateOutputHeader; the same signature appears on master and ~14 other unrelated PRs today alone.

Neither is reachable from a MySQL type-mapping change (the AST fuzzer never creates a MySQL storage, and the geometry code is on neither stack).

The single remaining unresolved review thread is the query-backed doQueryResultStructure inference (StorageMySQL.cpp:129). This is pre-existing master behavior — the query path and the MYSQL_FIELD GEOMETRYPoint mapping are byte-identical to master; this PR's convertMySQLDataType.cpp diff only touches the table-name overload, so the PR strictly improves that path and neither introduces nor worsens the query path. Threading type_support into doQueryResultStructure is reserved as a follow-up (thread dismissed in r3504570103).

So the PR is effectively green; it remains BLOCKED only on human approval and the still-running CH Inc sync/PR gates.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

The AI Review requested changes on the generic GEOMETRY mapping (one unresolved thread, now resolved). I confirmed it is a real, default-on regression: a MySQL column declared with the generic GEOMETRY type can hold values of any subtype, including MULTIPOINT (WKB id 4) and GEOMETRYCOLLECTION (id 7); parseWKBFormat only accepts ids 1/2/3/5/6 and the Geometry Variant has no alternative for those, so mapping it to Geometry made previously-readable rows throw BAD_ARGUMENTS at SELECT time — and it contradicted this PR's own stated "fall back to String" intent.

The AI review was right, but we made a change in the wrong direction.
Here is what I want:

Map GEOMETRY to Geometry, and accept an incompatibility - that we can throw an exception if the data in MySQL contains rarely used MULTIPOINT or GEOMETRYCOLLECTION.

Per maintainer request, map a MySQL column declared with the generic
`GEOMETRY` type to the ClickHouse umbrella `Geometry` type (a `Variant`
over the concrete geometric types) instead of falling back to `String`.

A generic `GEOMETRY` column can hold a value of any subtype, including ones
that have no ClickHouse counterpart (`MULTIPOINT`, `GEOMETRYCOLLECTION`).
Reading such a value now throws an exception at read time (`parseWKBFormat`
rejects the WKB type). This incompatibility is intentionally accepted in
exchange for exposing the column as a proper geometric type: values of the
representable subtypes (`Point`, `LineString`, `Polygon`, `MultiLineString`,
`MultiPolygon`) are read through the matching `Geometry` variant.

The read path already handles the umbrella `Geometry` type: it is mapped to
`vtGeometry` in `ExternalResultDescription`, and `insertGeometryValue`
routes the parsed object to the matching variant discriminator. Only the
schema inference in the string-based `convertMySQLDataType` overload (the
table-name path) needed the mapping; columns declared as `MULTIPOINT` or
`GEOMETRYCOLLECTION` still fall back to `String` (the raw WKB).

Updated the `test_mysql_geometry` integration test: the generic `geometry`
columns are now inferred as `Geometry`; a `geometry` column holding a
`LINESTRING` reads back as a geometric value, and one holding a `MULTIPOINT`
is rejected with `Incorrect geometry type`. Updated the setting description,
the `MySQLDataTypesSupport` enum comment, and the MySQL engine docs.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): mapped the generic MySQL GEOMETRY type to the umbrella Geometry type, per your request. Pushed 8262da8fa51..5873568d1e7.

Reverting the earlier "wrong direction" change: the string-based (table-name) convertMySQLDataType overload now maps a MySQL column declared with the generic GEOMETRY type to the ClickHouse umbrella Geometry type (a Variant over the concrete geometric types), gated on the geometry flag, instead of falling back to String. As you asked, the incompatibility is accepted: reading a value whose subtype has no ClickHouse counterpart (MULTIPOINT, GEOMETRYCOLLECTION) throws at read time (parseWKBFormat rejects the WKB type). Columns declared as MULTIPOINT/GEOMETRYCOLLECTION themselves still fall back to String.

The read side already supports this end-to-end: Geometry maps to vtGeometry in ExternalResultDescription, and insertGeometryValue routes each parsed WKB object to the matching Variant discriminator. Only the schema inference needed to change.

Updated test_mysql_geometry (the generic geometry columns now infer as Geometry; a geometry column holding a LINESTRING reads back as [(5,5),(6,6)], and one holding a MULTIPOINT is rejected with Incorrect geometry type), the setting description, the MySQLDataTypesSupport enum comment, and the MySQL engine docs. Verified -fsyntax-only (full ARM production flags) on convertMySQLDataType.cpp and ruff/py_compile on the test.

One decision left for you — the query-backed path (the remaining AI thread on src/Storages/StorageMySQL.cpp:129). The above applies to the table-name path only. For a query-backed source (mysql(..., query('SELECT ...'))), doQueryResultStructure uses the MYSQL_FIELD overload, where MySQL reports every spatial column as MYSQL_TYPE_GEOMETRY (it can't distinguish POINT from LINESTRING), and that overload still maps it to Point. Applying the same principle there (map MYSQL_TYPE_GEOMETRYGeometry) would fix the current LINESTRING-via-query throw and only MULTIPOINT/GEOMETRYCOLLECTION would throw — but it would also change the surface type of a POINT column read via the query form from Point to Geometry. That AI thread also notes a separate point: the effective per-call/per-engine mysql_datatypes_support_level is not threaded into doQueryResultStructure (it re-reads the query context), which affects decimal/datetime64/date2Date32 too. Both are the follow-up you reserved; say the word and I'll thread type_support into doQueryResultStructure and map query-backed MYSQL_TYPE_GEOMETRYGeometry in the same way.

Comment thread src/Core/SettingsChangesHistory.cpp Outdated
The `26.7` `SettingsChangesHistory` entry for `mysql_datatypes_support_level`
still described the intermediate behavior where the generic `GEOMETRY` type fell
back to `String`. Since commit `5873568d1e7` the generic `GEOMETRY` column maps to
the umbrella `Geometry` type, and reading a value whose subtype has no ClickHouse
counterpart (`MULTIPOINT`, `GEOMETRYCOLLECTION`) throws at read time. Update the
reason string to match the final behavior, matching the setting description in
`Settings.cpp` and the `MySQLDataTypesSupport` enum comment.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): addressed the stale SettingsChangesHistory note and refreshed the PR description. Pushed 5873568d1e7..7e8427366fc.

  • SettingsChangesHistory.cpp (AI thread, now resolved): the 26.7 entry for mysql_datatypes_support_level still said the generic GEOMETRY type "still maps to String", which was the intermediate behavior before 5873568d1e7. Updated the reason string to the final behavior (generic GEOMETRY → umbrella Geometry; MULTIPOINT/GEOMETRYCOLLECTION values throw at read time), matching the setting description in Settings.cpp and the MySQLDataTypesSupport enum comment.
  • PR description / changelog entry: the body table and changelog entry still described the old "generic GEOMETRY stays String" contract. Updated both to match the committed behavior so the CHANGELOG entry is accurate.

Still reserved for you — the query-backed path (remaining AI thread on src/Storages/StorageMySQL.cpp:129). Unchanged from the prior note: for mysql(..., query('SELECT ...')), doQueryResultStructure uses the MYSQL_FIELD overload, where MySQL reports every spatial column as MYSQL_TYPE_GEOMETRY and that overload still maps it to Point (so a LINESTRING read via the query form is inferred as Point and throws), and the effective per-call/per-engine mysql_datatypes_support_level is not threaded into doQueryResultStructure. Mapping query-backed MYSQL_TYPE_GEOMETRYGeometry would fix the LINESTRING throw but would also change a POINT column's surface type from Point to Geometry on that path. Say the word and I'll thread type_support into doQueryResultStructure and apply the same mapping.

CI is green so far (0 failures; coverage/stateless/integration jobs still running).

/// (and freeze it into the table definition) so that it is honored during schema inference,
/// the same way the MySQL database engine does. An explicit per-engine SETTINGS value, loaded
/// right after, takes precedence over it.
mysql_settings.loadFromQueryContext(args.getLocalContext(), *args.storage_def);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new loadFromQueryContext call comes after getConfiguration, so on the named-collection path it overwrites mysql_datatypes_support_level that processNamedCollectionResult has already loaded. loadFromQuery(*args.storage_def) only replays explicit SETTINGS, not values that came from the named collection, so a table such as ENGINE = MySQL(my_nc, table='geom') created under a conflicting session SET mysql_datatypes_support_level = 'decimal,datetime64,date2Date32,geometry' now ignores my_nc's opt-out during schema inference and also persists the session value into SHOW CREATE.

The table-function path gives named collections higher precedence than the query context, which is the behavior users will expect here too. Please either apply the query-context bridge before loading named-collection values, or make loadFromQueryContext skip settings that were already supplied by the engine arguments, and add a regression test for a named collection conflicting with a session SET.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): triaged the new AI review (2 Majors) and the sole CI red on 7e842736. No code pushed — both Majors are precedence/design decisions I'm leaving to you.

CI: the only failure is Stateless tests (amd_tsan, parallel, 2/2)Server died, which is the known master-wide flake #107487 (a TSan/fuzz crash, auto-linked by the CI comment). play confirms it fires overwhelmingly on the master side — 9/9 failures today, 27/28 yesterday, across many unrelated PRs — and there are no MySQL/geometry frames, so it is unrelated to this type-mapping change. Not re-running it yet, since the merge is blocked on the decisions below and your next push will trigger fresh CI anyway.

AI Major #1 — query-backed path (src/Storages/StorageMySQL.cpp:129, doQueryResultStructure + the MYSQL_FIELD convertMySQLDataType overload): unchanged from the prior notes and still reserved for you. For mysql(..., query('SELECT ...')), MySQL reports every spatial column as MYSQL_TYPE_GEOMETRY (it cannot distinguish POINT from LINESTRING), and that overload still maps it to Point; type_support is not threaded into doQueryResultStructure. Mapping it → Geometry would fix the LINESTRING-via-query throw but change a POINT column's query-form surface type. Say the word and I'll thread type_support in and apply the mapping.

AI Major #2 — named-collection precedence (src/Storages/StorageMySQL.cpp:460-468, new): the AI is technically right that on the named-collection path, getConfiguration (line 460) loads the named-collection value first, then loadFromQueryContext (line 466) overwrites it with the session mysql_datatypes_support_level, and loadFromQuery (line 468) only replays explicit SETTINGS — so a table created from a named collection can have its own geometry opt-out overridden by the session default. But this is not a new behavior invented here — it faithfully mirrors the pre-existing DatabaseMySQL convention (src/Databases/MySQL/DatabaseMySQL.cpp on master: processNamedCollectionResult at 611 → loadFromQueryContext at 640 → loadFromQuery at 642, identical ordering; that file is untouched by this PR). So making the storage engine let the named collection win would make it disagree with the database engine on named-collection precedence — a different cross-entrypoint inconsistency — and a consistent fix would have to change DatabaseMySQL too, which is out of scope for a geometry PR. This is the same loadFromQueryContext bridge I flagged earlier for your call (offered to drop it and keep only the type_support threading). Two clean options: (a) drop loadFromQueryContext from the storage engine so table-engine schema inference honors only explicit SETTINGS/named-collection values (simplest, removes the finding, but loses the session-SET bridging you asked to mirror from the DB engine); or (b) make both engines skip loadFromQueryContext when the value was already supplied by the named collection. Say which and I'll implement it.

Both review threads left unresolved pending your decision.

@clickhouse-gh

clickhouse-gh Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.50% 85.50% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.70% 77.70% +0.00%

Changed lines: Changed C/C++ lines covered: 140/151 (92.72%) · Uncovered code

Full report · Diff report

@alexey-milovidov

Copy link
Copy Markdown
Member Author

Status update (continue-pr): completed the CI triage on 7e842736 — both reds are unrelated chronic flakes. No code pushed; the two AI Majors remain the precedence/design decisions left to you.

My previous triage (06:54Z) only covered the amd_tsan red, because the second failure was still running at that time. Both reds are now in and both are known master-wide flakes:

  • AST fuzzer (amd_debug)MEMORY_LIMIT_EXCEEDED (server RSS reached ~58 GiB against the 46.32 GiB limit after the full 60-minute fuzz run; the watchdog gdb got No stack — the server was stuck, not crashed — and the job was then canceled). This is unrelated to this PR: the AST fuzzer runs generated SQL against a local server with no MySQL connection, so MySQLSource/convertMySQLDataType/the mysql(...) path (the only code this PR touches) is never exercised — a fuzzer mysql(...) query fails to connect, it cannot allocate 58 GiB. The fuzzer log contains no geometry/WKB/MySQL frames. play confirms AST fuzzer (amd_debug) fails on master itself (pull_request_number = 0) on 7 of the last 11 days and across dozens of unrelated PRs daily (24/163 today, 42/357 yesterday, 56/446 on 06-30).
  • Stateless tests (amd_tsan, parallel, 2/2)Server died — the known TSan/fuzz flake #107487 (OPEN), firing today on unrelated PRs (101841, 100185) as well.

Mergeable Check / PR are derived from those two. The PR is MERGEABLE (no conflicts, ~1 day behind master), so no re-merge is needed, and I did not re-run the flakes since your next push will trigger fresh CI anyway.

Both AI Majors remain reserved for you (unchanged from the 06:54Z note):

  • Repair subtree2 #1 query-backed path (doQueryResultStructure + the MYSQL_FIELD convertMySQLDataType overload): threading type_support in and mapping query-backed MYSQL_TYPE_GEOMETRYGeometry would fix the LINESTRING-via-query('…') throw, but would change a POINT column's query-form surface type from Point to Geometry (MySQL reports every spatial column as MYSQL_TYPE_GEOMETRY on that path). Say the word and I'll implement it.
  • Repair subtree2 #2 named-collection precedence (StorageMySQL.cpp:460-468): confirmed the ordering getConfigurationloadFromQueryContextloadFromQuery lets a session SET override a named-collection mysql_datatypes_support_level; this faithfully mirrors the pre-existing DatabaseMySQL convention, so a consistent fix would also touch DatabaseMySQL (out of scope). Two options as before: (a) drop loadFromQueryContext from the storage engine, or (b) make both engines skip settings already supplied by the named collection. Say which and I'll implement it.

The PR is blocked only on these two decisions (and human merge).

alexey-milovidov and others added 2 commits July 3, 2026 03:11
Addresses the AI Review "Request changes" verdict on
#108944 (first required
action): query-backed `MySQL` schema inference ignored the effective
`mysql_datatypes_support_level` and always guessed every
`MYSQL_TYPE_GEOMETRY` result as `Point`.

A MySQL source defined over a query rather than a table name
(`mysql(..., query('SELECT ls FROM t'))`, or the `query('...')`/subquery
forms of the `MySQL` table engine/function) infers its columns via
`doQueryResultStructure`, which read the type-mapping level straight from
the global query context. So a per-call, per-engine, or named-collection
override for `geometry`, `decimal`, `datetime64` or `date2Date32` was
silently dropped on this path, while it was honored on the table-name path.

`doQueryResultStructure` now receives the effective
`MultiEnum<MySQLDataTypesSupport>` (`type_support`) threaded down from
`StorageMySQL::getTableStructureFromData`, the same value the table-name
path already uses, so the query-backed path honors the configured level.

In addition, the `MYSQL_FIELD` overload of `convertMySQLDataType` no longer
maps `MYSQL_TYPE_GEOMETRY` to `Point`. A query result set reports
`MYSQL_TYPE_GEOMETRY` for a value of any spatial subtype (`POINT`,
`LINESTRING`, `POLYGON`, ...) without exposing which one it is, so the
concrete geometric type cannot be inferred here. Guessing `Point` made
reading any non-`Point` value throw `Only Point data type is supported` at
read time; the value now falls back to the raw WKB `String` instead. The
concrete-type mapping (`linestring` -> `LineString`, etc.) is applied only
where MySQL exposes the concrete column type - the table-name path, via the
string overload - so a user who knows the subtype can still declare the
column type explicitly.

Note: this changes the pre-existing behavior of reading a query-backed
`POINT` column, which previously inferred `Point` and now infers `String`
(raw WKB). This is an isolated, easily reverted change if the umbrella
`Point` guess is preferred.

Added a regression test to `test_mysql_geometry`: a query-backed
`LINESTRING` is inferred as `String` (not `Point`) and reads back without a
read-time exception.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mapping

Addresses the AI Review "Request changes" verdict on
#108944 (second required
action): the new query-context bridge in `registerStorageMySQL` gave a
session `SET mysql_datatypes_support_level` higher precedence than the value
loaded from a named collection.

`registerStorageMySQL` (and the pre-existing `DatabaseMySQL` registration)
load named-collection settings first, then call
`MySQLSettings::loadFromQueryContext` to bridge the query-context value into
the engine settings, and only afterwards replay the explicit `SETTINGS` via
`loadFromQuery`. Because `loadFromQueryContext` overwrote the value whenever
the session value differed, a table such as
`ENGINE = MySQL(my_nc, table='geom')` created under a conflicting session
`SET mysql_datatypes_support_level = '...'` ignored `my_nc`'s opt-out during
schema inference and also persisted the session value into `SHOW CREATE`.

`loadFromQueryContext` now returns early when the value was already supplied
explicitly for this engine instance (`impl->isChanged(...)`, set by a named
collection or the engine's own `SETTINGS`). The query-context bridge only
fills in the value when the engine did not set it, so the intended
precedence holds: engine `SETTINGS` > named collection > session `SET` >
default. The default case (no explicit value, session equals the default)
still persists nothing.

Added a regression test to `test_mysql_geometry` using the new
`mysql_no_geo_types` named collection (which disables `geometry`): under a
session that enables `geometry`, a `LINESTRING` column is still inferred as
`String`, and `SHOW CREATE` does not freeze the session value.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-improvement Pull request with some product improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant