Native CREATE TABLE / DROP TABLE for DataLakeCatalog; optionally prune on DROP#98670
Native CREATE TABLE / DROP TABLE for DataLakeCatalog; optionally prune on DROP#98670zvonand wants to merge 15 commits into
CREATE TABLE / DROP TABLE for DataLakeCatalog; optionally prune on DROP#98670Conversation
5133bc3 to
a81e99e
Compare
b3846c8 to
b381484
Compare
b381484 to
97a60e7
Compare
|
The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix. |
CREATE TABLE / DROP TABLE for DataLakeCatalog; optionally prune on DROP
a0bf1f2 to
b3cb15b
Compare
b3cb15b to
ce5a19c
Compare
- Port \`IcebergPathResolver::reverseResolve\` from upstream (predates the PR upstream but was missing on this branch). - Handle the Altinity-only \`S3_TABLES\` catalog type in \`getLocationSchemeForTableCreation\` (it is always S3-backed). - Drop the \`unique_key\` check in \`DatabaseDataLake::createTable\`: \`ASTStorage\` has no \`UNIQUE KEY\` clause on this branch. - \`getInMemoryMetadataPtr\` does not take a context argument on this branch. ClickHouse#98670 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This is not a valid concern; see #98670 (comment) |
`getPartitionField` serialized `icebergBucket(0, id)`, `icebergBucket(-1, id)`, and `icebergTruncate(0, id)` into `bucket[0]`, `bucket[-1]`, and `truncate[0]`. The Iceberg spec requires a positive number of buckets and a positive truncate width; an invalid transform produces metadata that some catalogs (e.g. Glue) register, leaving an unreadable Iceberg table. Validate the argument before metadata generation so such a `CREATE TABLE ... PARTITION BY ...` is rejected. Addresses an AI Review finding on ClickHouse#98670 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
In `writeMetadataFileAndVersionHint`, when another writer already committed an equal-or-newer version, we remove the just-written `vN-<uuid>.metadata.json` so the version-hint resolver does not later pick this uncommitted file. Previously a failure to remove it was logged and swallowed, and the function returned `false` - the same signal a clean loss returns - so the caller retried as if the file were gone while the uncommitted metadata could still be resolved as the latest version. Let the removal exception propagate so the operation fails closed. Addresses an AI Review finding on ClickHouse#98670 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`RestCatalog::createTable` passed the table location (base/namespace/table) to `createNamespaceIfNotExists`, which stored it as the namespace's `properties.location`. That made the namespace default warehouse path point at the first table's directory, so another client creating a table in the same namespace without an explicit location could be placed under the first table's path. Strip the trailing table-name segment so the namespace location is base/namespace. Addresses an AI Review finding on ClickHouse#98670 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A catalog-side `DROP TABLE` removes remote catalog metadata and, when `data_lake_delete_data_on_drop` is set, can request deletion of the underlying data, but the only success log was `TRACE`. Raise it to `INFO` so accidental drops or catalog-side data purges leave an operationally visible audit trail. Addresses an AI Review finding on ClickHouse#98670 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`test_create_table_invalid_partition_transforms` checks that `icebergBucket(0, id)`, `icebergBucket(-1, id)`, and `icebergTruncate(0, id)` are rejected and leave no table registered in the catalog, while a valid `icebergBucket(8, id)` succeeds. `test_create_table_namespace_location` checks that the REST namespace `location` property points at the namespace base rather than the first table's directory. Tests for AI Review findings on ClickHouse#98670 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erg-table # Conflicts: # src/Databases/DataLake/DatabaseDataLake.h # src/Storages/ObjectStorage/DataLakes/Iceberg/IcebergMetadata.cpp # src/Storages/ObjectStorage/DataLakes/Iceberg/Mutations.cpp
Both this branch and master independently added a `virtual bool isDatalakeCatalog() const` method to `IDatabase`, at different locations, so the merge left two identical declarations and `class member cannot be redeclared`. Keep the one master placed next to `isExternal`/`isRemoteDatabase` and drop the duplicate.
The setting was recorded in the 26.6 bucket, but 26.6 is already released and does not contain it; the current development version is 26.7. Keeping it in 26.6 makes the settings-history consistency check (`02995_new_settings_history`) disagree with the released 26.6 binary.
|
Pushed updates addressing the outstanding AI-review findings, plus a fresh AI-review blockers/majors (all four addressed):
The changelog entry already uses the correct Merge with
Follow-ups from the merge: removed a duplicate All 16 touched translation units verified with |
`test_drop_table_purge` constructs a `Minio` client but the `minio` package was never imported, so `ruff check` failed with `F821 Undefined name Minio`. Add `from minio import Minio` to match the import used by `tests/integration/helpers/cluster.py`. CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=98670&sha=60c079ba48738ecaf5eb567865aea45cb048a4e6&name_0=PR&name_1=Style%20check Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| return fmt::format("abfss://{}@{}/{}/{}/{}", container, account_host, extra_path, namespace_name, table_name); | ||
| } | ||
|
|
||
| if (location_scheme == "s3") |
There was a problem hiding this comment.
constructTableLocation now derives the s3://... table location from uri.getPath() only. That breaks configurations using storage_uri_style = 'virtual_hosted' (which DatabaseDataLakeSettings already documents), for example https://warehouse-rest.minio.example.com/prefix: here the bucket lives in the host, so path is just prefix (or empty) and this code either creates s3://prefix/ns/table or throws does not contain a bucket.
Existing DataLakeCatalog reads work with that endpoint style, so CREATE TABLE would now be unavailable or generate the wrong table location for a supported configuration. Please derive the bucket from the host when virtual-hosted style is configured, or reject that endpoint style explicitly before advertising CREATE TABLE support.
| throw Exception(ErrorCodes::BAD_ARGUMENTS, "Mutations with ON CLUSTER are not allowed for KeeperMap tables"); | ||
|
|
||
| if (table_id) | ||
| checkDatabaseSupportsOnClusterDDL(DatabaseCatalog::instance().tryGetDatabase(table_id.database_name)); |
There was a problem hiding this comment.
checkDatabaseSupportsOnClusterDDL only runs here when the initiator resolved the table locally (if (table_id)). If ALTER TABLE dl.\ns.t` ... ON CLUSTERis issued from a node that does not have thedl DataLakeCatalogdatabase/table locally,table_id` stays empty, this guard is skipped, and the query is still fanned out to every worker.
Each worker then commits the same shared-catalog mutation, which is the same duplicate-mutation race the new ON CLUSTER ban is supposed to prevent for catalog-backed tables. Please reject DataLakeCatalog ON CLUSTER by database name before the local lookup, or re-check on the worker before executing the mutation.
| order_by, | ||
| context_); | ||
|
|
||
| catalog->createTable(namespace_name, table_name, /* metadata_path */ "", metadata_content); |
There was a problem hiding this comment.
This path is now generic for every DataLakeCatalog, but only RestCatalog and GlueCatalog currently override ICatalog::createTable / dropTable. hive, unity, onelake, biglake, and paimon_rest still hit the base ICatalog implementation and throw NOT_IMPLEMENTED, even though the new docs and changelog entry describe CREATE TABLE / DROP TABLE as DataLakeCatalog functionality in general.
Please either gate these statements to the catalog types that actually implement catalog mutations, or narrow the user-facing contract to REST / Glue only.
| columns_declare_list->set(columns_declare_list->columns, columns_expression_list); | ||
| create_query.set(create_query.columns_list, columns_declare_list); | ||
|
|
||
| for (const auto & column : properties.columns) |
There was a problem hiding this comment.
CREATE TABLE ... AS source can still silently drop source-column COMMENT, CODEC, column TTL, SETTINGS, and STATISTICS. properties.columns already carries those fields here, but this AST rebuild only copies name, type, and defaults, so DatabaseDataLake::createTable never sees the unsupported modifiers and the Iceberg table is created with weaker semantics than the source.
Could we reuse formatColumns(properties.columns) (or otherwise copy these fields into the synthesized ASTColumnDeclaration) so the existing rejection path fires for CTAS too?
LLVM Coverage ReportChanged lines: Changed C/C++ lines covered: 337/598 (56.35%) · Uncovered code |

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Supports
CREATE TABLE,CREATE TABLE … AS source, andDROP TABLEforDataLakeCatalog;DROP TABLEcan request catalog-side data purge via the newdata_lake_delete_data_on_dropsetting.Documentation entry for user-facing changes