Map MySQL geometry types to ClickHouse geometric types#108944
Map MySQL geometry types to ClickHouse geometric types#108944alexey-milovidov wants to merge 13 commits into
Conversation
Previously the MySQL integrations (the `MySQL` database engine, table
engine and table function) mapped only `POINT` to the ClickHouse `Point`
type and converted every other spatial type (`LINESTRING`, `POLYGON`,
`MULTILINESTRING`, `MULTIPOLYGON`, generic `GEOMETRY`) to `String`,
exposing the raw WKB bytes.
This adds a new `geometry` flag to the `mysql_datatypes_support_level`
setting, enabled by default, that maps MySQL spatial types to the
corresponding ClickHouse geometric types:
- `LINESTRING` -> `LineString`
- `POLYGON` -> `Polygon`
- `MULTILINESTRING`-> `MultiLineString`
- `MULTIPOLYGON` -> `MultiPolygon`
- `GEOMETRY` -> `Geometry` (a `Variant` over all the geometric types,
used when the concrete subtype is not known from the
column metadata)
`POINT` keeps being converted to `Point` unconditionally, for backward
compatibility (it has been supported since version 24.1). `MULTIPOINT`
and `GEOMETRYCOLLECTION` have no ClickHouse counterpart and, as all
unknown types, fall back to `String`.
`MySQLSource` reads these values by stripping the 4-byte SRID prefix and
parsing the standard WKB payload with `parseWKBFormat`, then inserting
the result into the concrete geometric column or into the `Geometry`
`Variant` by discriminator.
This was raised in the review of
#107740 (discussion
r3496279299): the query-passing path could only report the generic
`MYSQL_TYPE_GEOMETRY` and would otherwise have to guess `Point`. Mapping
to `Geometry` removes that guess. PostgreSQL uses different (non-WKB)
representations for its built-in geometric types and EWKB for PostGIS, so
it is left for a follow-up.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Workflow [PR], commit [7e84273] Summary: ❌
AI ReviewSummaryThis PR adds default-on MySQL geometry mappings, a WKB read path for ClickHouse geometric types, and setting plumbing for schema inference. The table-backed paths are mostly wired up, but the current head still leaves one query-backed inference bug unresolved and introduces a new precedence bug for named collections in the table-engine path, so the advertised Findings
Final VerdictStatus: Minimum required actions:
|
A MySQL column declared with the generic `GEOMETRY` type can store values of any subtype, including `MULTIPOINT` and `GEOMETRYCOLLECTION`, which have no ClickHouse counterpart. `parseWKBFormat` only accepts the WKB type ids `1`, `2`, `3`, `5`, `6`, and the umbrella `Geometry` type is a `Variant` over `Point`, `LineString`, `Polygon`, `MultiLineString`, `MultiPolygon`, `Ring` only - it cannot represent those subtypes. So mapping a generic `GEOMETRY` column to `Geometry` (which was enabled by default) made previously-readable rows throw `BAD_ARGUMENTS` at `SELECT` time, and contradicted this PR's own stated intent that `MULTIPOINT` and `GEOMETRYCOLLECTION` fall back to `String`. Keep the generic `GEOMETRY` column type mapped to `String` (its raw WKB, the previous behavior). The concrete spatial column types (`LINESTRING`, `POLYGON`, `MULTILINESTRING`, `MULTIPOLYGON`) are still mapped to the corresponding ClickHouse geometric types under the `geometry` flag, because MySQL enforces a single subtype per such column, so the read can never encounter an unrepresentable value. A user who knows a generic `GEOMETRY` column only holds representable subtypes can still read it as `Geometry` by declaring the column type explicitly (e.g. via the MySQL table engine); that read path is unchanged. The integration test now asserts the generic `geometry` column (including a row holding a `MULTIPOINT`) is read as `String` without error, which is the regression the review asked to cover. Addresses the review finding on src/DataTypes/convertMySQLDataType.cpp. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Status update (continue-pr): addressed the blocking AI Review finding. The AI Review requested changes on the generic Fix (f515b85):
This reverses one design element of the PR (generic The only CI failure was |
This addresses the review finding on the MySQL table engine and fixes the `test_mysql_database_engine/test.py::test_backup_database` integration test. Two related problems, both rooted in the same omission: when this PR added `geometry` to the default of the `mysql_datatypes_support_level` server setting, it left the engine-level `MySQLSettings` default unchanged at `decimal,datetime64,date2Date32`. 1. The engine and server defaults no longer matched, so `loadFromQueryContext` considered the value "changed" for every freshly created MySQL database (and table) engine, persisting a redundant `SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32,geometry'` clause into the `CREATE` statement. That is what broke `test_backup_database`, which asserts the round-tripped `SHOW CREATE DATABASE` has no `SETTINGS` clause. Fixed by giving `MySQLSettings` the same default as the server setting, so the default case persists nothing again. 2. `StorageMySQL::getTableStructureFromData` read the type-mapping level from the global query context instead of the engine settings, so an explicit per-engine opt-out such as `CREATE TABLE t ENGINE = MySQL(...) SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32'` (with omitted columns) was silently ignored: schema inference still mapped `LINESTRING` to `LineString`, so the documented "disable `geometry` to restore `String`" behavior was false for the persistent table engine. Fixed by threading the effective type-support level into `getTableStructureFromData`. The table engine now passes its own engine setting; for that value to also reflect a query-context `SET`, the table engine bridges the context value into the engine settings via `loadFromQueryContext`, the same way the MySQL database engine already does. The table function keeps reading the query context, since it has no persistent engine settings. The integration test now also covers the per-engine opt-out: a schema-inferred `MySQL` table engine with `SETTINGS mysql_datatypes_support_level` disabling `geometry` reads `LINESTRING` as `String`, and the default-on case reads it as `LineString`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Status update (continue-pr): fixed the failing integration test and addressed the blocking AI Review finding. Pushed Both problems shared one root cause: this PR added
Added a regression test covering the per-engine opt-out ( Design note for your call: item 2 mirrors the database engine by calling Verified: |
The earlier commit that threaded the type-mapping level into
`StorageMySQL::getTableStructureFromData` and the `StorageMySQL`
constructor changed the constructor's fallback schema inference to read
the engine `MySQLSettings` instead of the query context. The `mysql(...)`
table function creates its `StorageMySQL` with a default `MySQLSettings{}`,
so when it inferred the structure itself the default-on `geometry` (and
`decimal`/`datetime64`) flags were always applied, silently ignoring both
the per-call `SETTINGS` and the query-level `mysql_datatypes_support_level`.
This made
`SELECT toTypeName(ls) FROM mysql(...) SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32'`
infer a MySQL `LINESTRING` as `LineString` instead of `String`, and
regressed the existing `test_mysql_types` table-function assertions (e.g.
a MySQL `DATETIME(6)` read back as `DateTime64(6)` instead of `DateTime`).
`TableFunctionMySQL::parseArguments` now computes the effective
`mysql_datatypes_support_level` (the query-context value, overridden by a
function-local `SETTINGS` clause or named collection) and both
`getActualTableStructure` and `executeImpl` use it, so the table function
honors the configured level regardless of which path performs the
inference. This also addresses the review point that the per-call
`SETTINGS` opt-out was silently ignored.
Adds regression tests for the per-call `SETTINGS` opt-out and for the
per-call override of a query-level value.
CI report: #108944
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Status update (continue-pr): fixed the integration-test regression and the blocking AI Review finding. Pushed Root cause (one bug behind all the failures). The commit that threaded the type-mapping level into This broke exactly the table-function code path (
Fix. Tests. Added regression assertions to Verification. |
Addresses the AI Review "Request changes" finding on `src/Processors/Sources/MySQLSource.cpp`: the new MySQL spatial read path called `parseWKBFormat` with the default `0`, so a remote `LINESTRING`, `POLYGON` or `MULTI*` value used only the hard-coded `100'000'000` element cap instead of the configurable `max_wkb_geometry_elements` query setting (which defaults to `1'000'000` specifically to bound WKB point/ring/polygon counts before the parser reserves memory). `external_storage_max_read_bytes` is checked only after a row has been parsed, so it could not prevent the oversized allocation path. Carry the (clamped) `max_wkb_geometry_elements` value in `StreamSettings`, built from the query settings exactly as `readWkb` does, and thread it through `insertValue`/`insertGeometryValue` into `parseWKBFormat`. Added a focused regression test to `test_mysql_geometry` that reads a MySQL `LINESTRING` of three points with `max_wkb_geometry_elements = 1` and asserts the read is rejected with `TOO_LARGE_ARRAY_SIZE`, and reads back normally with a sufficiently large limit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Status update (continue-pr): addressed the blocking AI Review finding. Pushed The AI Review requested changes because the new MySQL geometry read path called Fix (3410a1d):
Note: the Verification. |
…-types # Conflicts: # src/Storages/StorageMySQL.cpp # src/Storages/StorageMySQL.h # src/TableFunctions/TableFunctionMySQL.cpp
|
Status update (continue-pr): merged The PR had gone Conflicts resolved by combining both sides (no logic dropped):
Verified: both changed TUs pass One design note for your call (left as |
| const TableNameOrQuery & table_or_query, | ||
| const ContextPtr & context_) | ||
| const ContextPtr & context_, | ||
| MultiEnum<MySQLDataTypesSupport> type_support) |
There was a problem hiding this comment.
MySQL schema inference still drops the effective mysql_datatypes_support_level.
This new type_support parameter is only used by the table-name path. When table_or_query.isQuery is true, doQueryResultStructure re-reads context_->getSettingsRef, and the MYSQL_FIELD converter still maps any MYSQL_TYPE_GEOMETRY result to Point. As a result, a call such as mysql(..., query('SELECT ls FROM test_mysql_geometry'), SETTINGS mysql_datatypes_support_level = 'decimal,datetime64,date2Date32') ignores the function-local opt-out; with the default setting, selecting a MySQL LINESTRING through the query-backed form is inferred as Point, then the reader takes the vtPoint path and throws Only Point data type is supported for WKB type 2. The same dropped value affects decimal, datetime64, and date2Date32 overrides for query-backed table engines and table functions.
Please thread type_support into doQueryResultStructure and use it in the MYSQL_FIELD conversion. For MYSQL_TYPE_GEOMETRY query-result metadata, avoid guessing Point for non-Point spatial values; infer String unless the subtype can be proven or declared explicitly. A focused regression test with query('SELECT ls FROM ...') and a per-call mysql_datatypes_support_level override would cover this path.
|
Status update (continue-pr): merged GitHub reported the PR as Verified:
GitHub now reports On the one remaining unresolved thread ( |
|
Status update (continue-pr): the two red CI checks are unrelated master-wide flakes; no code change. Head
Neither is reachable from a MySQL type-mapping change (the AST fuzzer never creates a The single remaining unresolved review thread is the query-backed So the PR is effectively green; it remains |
The AI review was right, but we made a change in the wrong direction. Map GEOMETRY to Geometry, and accept an incompatibility - that we can throw an exception if the data in MySQL contains rarely used MULTIPOINT or GEOMETRYCOLLECTION. |
Per maintainer request, map a MySQL column declared with the generic `GEOMETRY` type to the ClickHouse umbrella `Geometry` type (a `Variant` over the concrete geometric types) instead of falling back to `String`. A generic `GEOMETRY` column can hold a value of any subtype, including ones that have no ClickHouse counterpart (`MULTIPOINT`, `GEOMETRYCOLLECTION`). Reading such a value now throws an exception at read time (`parseWKBFormat` rejects the WKB type). This incompatibility is intentionally accepted in exchange for exposing the column as a proper geometric type: values of the representable subtypes (`Point`, `LineString`, `Polygon`, `MultiLineString`, `MultiPolygon`) are read through the matching `Geometry` variant. The read path already handles the umbrella `Geometry` type: it is mapped to `vtGeometry` in `ExternalResultDescription`, and `insertGeometryValue` routes the parsed object to the matching variant discriminator. Only the schema inference in the string-based `convertMySQLDataType` overload (the table-name path) needed the mapping; columns declared as `MULTIPOINT` or `GEOMETRYCOLLECTION` still fall back to `String` (the raw WKB). Updated the `test_mysql_geometry` integration test: the generic `geometry` columns are now inferred as `Geometry`; a `geometry` column holding a `LINESTRING` reads back as a geometric value, and one holding a `MULTIPOINT` is rejected with `Incorrect geometry type`. Updated the setting description, the `MySQLDataTypesSupport` enum comment, and the MySQL engine docs. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Status update (continue-pr): mapped the generic MySQL Reverting the earlier "wrong direction" change: the string-based (table-name) The read side already supports this end-to-end: Updated One decision left for you — the query-backed path (the remaining AI thread on |
The `26.7` `SettingsChangesHistory` entry for `mysql_datatypes_support_level` still described the intermediate behavior where the generic `GEOMETRY` type fell back to `String`. Since commit `5873568d1e7` the generic `GEOMETRY` column maps to the umbrella `Geometry` type, and reading a value whose subtype has no ClickHouse counterpart (`MULTIPOINT`, `GEOMETRYCOLLECTION`) throws at read time. Update the reason string to match the final behavior, matching the setting description in `Settings.cpp` and the `MySQLDataTypesSupport` enum comment. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Status update (continue-pr): addressed the stale
Still reserved for you — the query-backed path (remaining AI thread on CI is green so far (0 failures; coverage/stateless/integration jobs still running). |
| /// (and freeze it into the table definition) so that it is honored during schema inference, | ||
| /// the same way the MySQL database engine does. An explicit per-engine SETTINGS value, loaded | ||
| /// right after, takes precedence over it. | ||
| mysql_settings.loadFromQueryContext(args.getLocalContext(), *args.storage_def); |
There was a problem hiding this comment.
This new loadFromQueryContext call comes after getConfiguration, so on the named-collection path it overwrites mysql_datatypes_support_level that processNamedCollectionResult has already loaded. loadFromQuery(*args.storage_def) only replays explicit SETTINGS, not values that came from the named collection, so a table such as ENGINE = MySQL(my_nc, table='geom') created under a conflicting session SET mysql_datatypes_support_level = 'decimal,datetime64,date2Date32,geometry' now ignores my_nc's opt-out during schema inference and also persists the session value into SHOW CREATE.
The table-function path gives named collections higher precedence than the query context, which is the behavior users will expect here too. Please either apply the query-context bridge before loading named-collection values, or make loadFromQueryContext skip settings that were already supplied by the engine arguments, and add a regression test for a named collection conflicting with a session SET.
|
Status update (continue-pr): triaged the new AI review (2 Majors) and the sole CI red on CI: the only failure is AI Major #1 — query-backed path ( AI Major #2 — named-collection precedence ( Both review threads left unresolved pending your decision. |
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered: 140/151 (92.72%) · Uncovered code |
|
Status update (continue-pr): completed the CI triage on My previous triage (06:54Z) only covered the
Both AI Majors remain reserved for you (unchanged from the 06:54Z note):
The PR is blocked only on these two decisions (and human merge). |
Addresses the AI Review "Request changes" verdict on #108944 (first required action): query-backed `MySQL` schema inference ignored the effective `mysql_datatypes_support_level` and always guessed every `MYSQL_TYPE_GEOMETRY` result as `Point`. A MySQL source defined over a query rather than a table name (`mysql(..., query('SELECT ls FROM t'))`, or the `query('...')`/subquery forms of the `MySQL` table engine/function) infers its columns via `doQueryResultStructure`, which read the type-mapping level straight from the global query context. So a per-call, per-engine, or named-collection override for `geometry`, `decimal`, `datetime64` or `date2Date32` was silently dropped on this path, while it was honored on the table-name path. `doQueryResultStructure` now receives the effective `MultiEnum<MySQLDataTypesSupport>` (`type_support`) threaded down from `StorageMySQL::getTableStructureFromData`, the same value the table-name path already uses, so the query-backed path honors the configured level. In addition, the `MYSQL_FIELD` overload of `convertMySQLDataType` no longer maps `MYSQL_TYPE_GEOMETRY` to `Point`. A query result set reports `MYSQL_TYPE_GEOMETRY` for a value of any spatial subtype (`POINT`, `LINESTRING`, `POLYGON`, ...) without exposing which one it is, so the concrete geometric type cannot be inferred here. Guessing `Point` made reading any non-`Point` value throw `Only Point data type is supported` at read time; the value now falls back to the raw WKB `String` instead. The concrete-type mapping (`linestring` -> `LineString`, etc.) is applied only where MySQL exposes the concrete column type - the table-name path, via the string overload - so a user who knows the subtype can still declare the column type explicitly. Note: this changes the pre-existing behavior of reading a query-backed `POINT` column, which previously inferred `Point` and now infers `String` (raw WKB). This is an isolated, easily reverted change if the umbrella `Point` guess is preferred. Added a regression test to `test_mysql_geometry`: a query-backed `LINESTRING` is inferred as `String` (not `Point`) and reads back without a read-time exception. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mapping Addresses the AI Review "Request changes" verdict on #108944 (second required action): the new query-context bridge in `registerStorageMySQL` gave a session `SET mysql_datatypes_support_level` higher precedence than the value loaded from a named collection. `registerStorageMySQL` (and the pre-existing `DatabaseMySQL` registration) load named-collection settings first, then call `MySQLSettings::loadFromQueryContext` to bridge the query-context value into the engine settings, and only afterwards replay the explicit `SETTINGS` via `loadFromQuery`. Because `loadFromQueryContext` overwrote the value whenever the session value differed, a table such as `ENGINE = MySQL(my_nc, table='geom')` created under a conflicting session `SET mysql_datatypes_support_level = '...'` ignored `my_nc`'s opt-out during schema inference and also persisted the session value into `SHOW CREATE`. `loadFromQueryContext` now returns early when the value was already supplied explicitly for this engine instance (`impl->isChanged(...)`, set by a named collection or the engine's own `SETTINGS`). The query-context bridge only fills in the value when the engine did not set it, so the intended precedence holds: engine `SETTINGS` > named collection > session `SET` > default. The default case (no explicit value, session equals the default) still persists nothing. Added a regression test to `test_mysql_geometry` using the new `mysql_no_geo_types` named collection (which disables `geometry`): under a session that enables `geometry`, a `LINESTRING` column is still inferred as `String`, and `SHOW CREATE` does not freeze the session value. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The MySQL integrations (the
MySQLdatabase engine, table engine and table function) previously mapped onlyPOINTto the ClickHousePointtype and converted every other MySQL spatial type (LINESTRING,POLYGON,MULTILINESTRING,MULTIPOLYGON, genericGEOMETRY) toString, exposing the raw WKB bytes.This adds a new
geometryflag to themysql_datatypes_support_levelsetting, enabled by default, that maps MySQL's spatial column types to the corresponding ClickHouse geometric types:POINTPoint(always, regardless of the flag)LINESTRINGLineStringPOLYGONPolygonMULTILINESTRINGMultiLineStringMULTIPOLYGONMultiPolygonGEOMETRY(generic)Geometry(the umbrellaVariant)MULTIPOINT,GEOMETRYCOLLECTIONStringA concrete spatial column type (
LINESTRING,POLYGON,MULTILINESTRING,MULTIPOLYGON) is mapped to its ClickHouse counterpart, because MySQL enforces a single subtype per such column, so a read can never encounter a value of a different subtype.The generic
GEOMETRYcolumn type is mapped to the umbrellaGeometrytype (aVariantoverPoint/LineString/Polygon/MultiLineString/MultiPolygon/Ring). Such a column can store values of any subtype, including ones that have no ClickHouse counterpart (MULTIPOINT,GEOMETRYCOLLECTION); this incompatibility is accepted: a value of a representable subtype reads back correctly, while a value of a subtype with no ClickHouse counterpart throws an exception at read time, becauseparseWKBFormatrejects the WKB type and theGeometryVarianthas no alternative for it.POINTkeeps being converted toPointunconditionally for backward compatibility (it has been supported since version 24.1). Columns declared asMULTIPOINTorGEOMETRYCOLLECTIONthemselves have no ClickHouse counterpart and, as all unknown types, fall back toString. When thegeometryflag is disabled, the previous behavior is restored (everything exceptPOINTbecomesString).MySQLSourcereads the geometric values (for the concrete spatial columns, and for columns declared as a geometric orGeometrytype) by stripping the 4-byte SRID prefix and parsing the standard WKB payload withparseWKBFormat, then inserting the result into the concrete geometric column or into theGeometryVariantby discriminator.Motivation. This was raised in the review of #107740 (discussion r3496279299): mapping all spatial values to
Pointis wrong. The concrete spatial types map to their exact ClickHouse counterparts; the genericGEOMETRYmaps to the umbrellaGeometrytype, accepting that the rarely usedMULTIPOINTandGEOMETRYCOLLECTIONsubtypes are not representable and throw at read time.PostgreSQL uses different (non-WKB) text representations for its built-in geometric types and EWKB for PostGIS, and currently has no geometric type mapping at all, so it is left for a follow-up.
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
The
MySQLdatabase engine, table engine and table function now map MySQL's spatial column types (LINESTRING,POLYGON,MULTILINESTRING,MULTIPOLYGON, and the genericGEOMETRY) to the corresponding ClickHouse geometric types instead ofString. This is controlled by the newgeometryflag of themysql_datatypes_support_levelsetting, enabled by default.POINTis still always converted toPoint. The genericGEOMETRYcolumn maps to the umbrellaGeometrytype; reading a value whose subtype has no ClickHouse counterpart (MULTIPOINT,GEOMETRYCOLLECTION) throws an exception at read time.