Make row policies follow the table on RENAME, EXCHANGE and RENAME DATABASE by groeneai · Pull Request #108478 · ClickHouse/ClickHouse · GitHub
Skip to content

Make row policies follow the table on RENAME, EXCHANGE and RENAME DATABASE#108478

Open
groeneai wants to merge 7 commits into
ClickHouse:masterfrom
groeneai:fix-row-policy-escapes-rename-table
Open

Make row policies follow the table on RENAME, EXCHANGE and RENAME DATABASE#108478
groeneai wants to merge 7 commits into
ClickHouse:masterfrom
groeneai:fix-row-policy-escapes-rename-table

Conversation

@groeneai

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a row-policy escape: RENAME TABLE, EXCHANGE TABLES and RENAME DATABASE left row policies bound to the old name, so after the operation the table was readable with no filtering under its new name. Row policies now follow the table/database on rename.

Description

Row policies are access entities keyed by (database, table_name) (RowPolicyName). When a table was renamed the policy stayed bound to the old name and was never reattached to the new name, so after the rename ALL rows became visible under the new name, fully escaping the policy. EXCHANGE TABLES and RENAME DATABASE (for ON db.* and per-table policies) had the same effect.

InterpreterRenameQuery::executeToTables already carefully re-keys every other kind of table binding from the old StorageID to the new one (referential/loading/MV dependencies, source-view dependencies, named collections), but never touched row policies. This PR re-keys row policies the same way, through AccessControl, so the policy follows the data:

  • plain RENAME TABLE a TO b moves policies a -> b;
  • EXCHANGE TABLES a AND b swaps both directions. The policy follows the data, which is the only behaviour that preserves protection across the swap (otherwise exchanging a protected table with an unprotected one would move the protected data to a name with no policy);
  • cross-database RENAME db1.a TO db2.b updates both database and table;
  • RENAME DATABASE moves both database-wide (ON db.*) and per-table policies;
  • on a failed rename the original bindings are restored in the existing catch block, consistent with how dependencies are already restored.

The re-key is applied collision-free by routing each affected policy through a unique temporary table name first, then to its destination, so an EXCHANGE of two tables that each have a policy with the same short name does not hit a transient name collision in the access storage. Updates go through AccessControl::update by UUID, so they persist on disk and replicate like any other access-entity change.

Reproduction (100% on a clean checkout, with access_management enabled):

CREATE TABLE data (id UInt64, secret String, dept String) ENGINE = MergeTree ORDER BY id;
INSERT INTO data VALUES (1, 'public', 'eng'), (2, 'classified', 'fin'), (3, 'public2', 'eng');
CREATE USER restricted;
GRANT SELECT, DROP TABLE ON default.data TO restricted;
GRANT SELECT, INSERT, CREATE TABLE ON default.data2 TO restricted;
CREATE ROW POLICY rp ON default.data FOR SELECT USING dept = 'eng' TO restricted;
-- as restricted:
SELECT * FROM data;            -- only the 2 eng rows: policy applied
RENAME TABLE data TO data2;
SELECT * FROM data2;           -- before this PR: ALL 3 rows, policy escaped

Regression test 04401_row_policy_follows_table_rename covers RENAME, EXCHANGE (including same-short-name and asymmetric cases), cross-database RENAME, RENAME DATABASE, and the rollback-on-failure path.

Notes for reviewers (cc access-control maintainers)

  1. Re-key vs block. This implements the re-key approach (policy follows the data). The alternative would be to block RENAME/EXCHANGE while active policies exist. Re-keying preserves protection without breaking legitimate workflows (e.g. tenants renaming their own tables), so it seems clearly preferable, but please confirm this is the behaviour you want.
  2. Severity / changelog category. Filed as Bug Fix. This is an RBAC escape, so it may warrant Critical Bug Fix (crash, data loss, RBAC) instead. Happy to switch the category if you agree.
  3. Replicated databases. On a Replicated database the rename is enqueued and each replica re-runs the interpreter, so the re-key runs once per replica. Because it targets policies by UUID with absolute destination names, repeated application is idempotent; the redundant access-storage writes mirror how the existing dependency re-keying already behaves per-replica.

…ABASE

Row policies are keyed by (database, table). On RENAME TABLE, EXCHANGE TABLES
and RENAME DATABASE the policy was left bound to the old name and never
reattached, so after the operation the data was readable with no filtering
under the new name, fully escaping the policy.

InterpreterRenameQuery::executeToTables already re-keys table dependencies,
source-view dependencies and named collections from the old StorageID to the
new one, but never touched row policies. This change re-keys row policies the
same way, through AccessControl, so the policy follows the data:

- plain RENAME TABLE a TO b moves policies a -> b;
- EXCHANGE TABLES a AND b swaps both directions (the policy follows the data,
  which is the only behaviour that preserves protection across the swap);
- cross-database RENAME db1.a TO db2.b updates both database and table;
- RENAME DATABASE moves both database-wide (ON db.*) and per-table policies;
- on a failed rename the original bindings are restored in the catch block,
  consistent with how dependencies are already restored.

The re-key is applied collision-free by routing each affected policy through a
unique temporary table name first, then to its destination, so an EXCHANGE of
two tables that each have a policy with the same short name does not hit a
transient name collision in the access storage.

Updates go through AccessControl::update by UUID, so they persist on disk and
replicate like any other access-entity change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @pufit @vitlibar — could you review this? Row policies are keyed by (database, table) and InterpreterRenameQuery never re-keyed them, so RENAME TABLE / EXCHANGE TABLES / RENAME DATABASE left the policy orphaned on the old name and the table became readable unfiltered under the new name (an RBAC escape). This re-keys row policies through AccessControl alongside the existing dependency re-keying, with collision-safe handling for EXCHANGE and rollback on failure. Two design questions for you are noted in the PR description (re-key vs block, and whether this should be Critical Bug Fix).

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 25, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [0f8ce6e]

Summary:


AI Review

Summary

This PR rekeys row policies so explicit RENAME TABLE, EXCHANGE TABLES, and RENAME DATABASE do not leave protected data unfiltered under a new name. The direct rename/exchange cases and prior issues around cross-database db.* policies and replacement DDL are now addressed, but the fix still has unresolved atomicity gaps around access-entity updates, including a replicated-access divergence, so I cannot approve it yet.

PR Metadata

💡 The selected category should be Critical Bug Fix (crash, data loss, RBAC) rather than Bug Fix (user-visible misbehavior in an official stable release) because the PR fixes an RBAC escape. The changelog entry itself is specific and user-readable.

Findings

❌ Blockers

  • [src/Interpreters/InterpreterRenameQuery.cpp:489] Replicated database table rename and ReplicatedAccessStorage row-policy rename are not coordinated. The first replica to execute the DDL updates the shared row-policy entity to the new table name, so lagging replicas can still have the old table name but no old_db.old_table policy. This temporarily exposes rows unfiltered until their DDL worker catches up. Fix by coordinating the access update with replicated database DDL, keeping both names protected through the transition, or rejecting affected policies in replicated access storage / Replicated databases.
  • [src/Interpreters/InterpreterRenameQuery.cpp:489] [dismissed by author -- https://github.com/ClickHouse/ClickHouse/pull/108478#discussion_r3475937537] The preflight snapshot is still not protected from concurrent access DDL. CREATE ROW POLICY can target non-existing table names and does not take the table DDLGuard, so a policy can be created on the source after collection or a conflicting policy can be created at the destination/transient name before applyRowPolicyRekeys; the table/database rename has already committed by then, so the policy can be left on the wrong name or the apply can throw after the commit. I still consider this real because the current code has no lock or version check spanning collection, preflight, metadata rename, and access update.
Tests

⚠️ Missing a focused integration test for Replicated database with <user_directories><replicated> where one replica lags the DDL rename; the test should prove the old-name table stays filtered on the lagging replica, or that the operation is rejected. The new stateless test explicitly opts out of replicated databases, so it cannot catch this path.
⚠️ The concurrent access-DDL window also needs coverage or a design reason it is impossible: create a policy on the source/destination/transient name between preflight and applyRowPolicyRekeys and verify no committed rename leaves data unfiltered.

Final Verdict

Status: ❌ Block

Minimum required actions: close the replicated-access / Replicated database policy exposure, make the concurrent access-DDL window atomic or reject affected renames before commit, and update the changelog category to Critical Bug Fix (crash, data loss, RBAC).

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 25, 2026
NamedCollectionFactory::instance().renameDependencies(to_table_id, from_table_id);

/// Re-key row policies last: it has its own rollback, and nothing after it can throw.
applyRowPolicyRekeys(access_control, row_policy_rekeys);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

applyRowPolicyRekeys can still throw here after database->renameTable has already committed. For example, row policies loaded from users.xml are read-only, and a policy with the same short name can already exist on the destination table name even when the table name itself is free. In those cases the outer catch restores dependencies but cannot undo the table rename; the row policy is restored/left on the old name and the renamed table is left readable without the intended filter. executeToDatabase has the same post-commit failure mode after db->renameDatabase. Please preflight all re-key failures before the table/database rename, or block the rename before the commit point when the affected policies cannot be moved.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The read-only and final-destination cases are preflighted now, but the post-commit exception is still reachable through the phase-1 temporary name. applyRowPolicyRekeys first updates the policy to short_name ON old_db.\.tmp_rename_row_policy__0`afterrenameTablehas returned; if that row policy name already exists,AccessControl::updatethrows there. The UUID is visible insystem.row_policies.id, so this is deterministic, not just a concurrent race. The outer catch` still cannot undo the committed table/database rename, so the policy remains on the old name and the renamed table can be exposed without the intended filter. Please preflight or reserve the temporary names as well, or avoid a post-commit rename step that can still hit an access-entity name collision.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current code still has the same post-commit exposure class. The snapshot/preflight at src/Interpreters/InterpreterRenameQuery.cpp:410-418 is not protected against concurrent access DDL, while applyRowPolicyRekeys still performs ordinary access-storage writes after renameTable has committed at src/Interpreters/InterpreterRenameQuery.cpp:478.

Two concrete interleavings remain possible because CREATE ROW POLICY is independent of the table DDLGuard and can target non-existing table names:

  • another session creates a new policy on the source table after collectRowPolicyRekeys but before the rename commits; it is not in row_policy_rekeys, so it stays bound to the old name after the table moves;
  • another session creates a conflicting policy on the destination or transient name after preflightRowPolicyRekeys; the post-commit AccessControl::update throws, restore puts the moving policy back on the old name, and the already-renamed table is left without that filter.

Storage-write exceptions in AccessControl::update have the same shape: preflight cannot prove the later persisted updates will succeed. This needs serialization/reservation that covers access-entity changes until the row-policy re-key is complete, or a fail-closed design that does not leave protected data at a name where the original policy is not applied when a post-commit access update fails.

The row-policy re-key in InterpreterRenameQuery previously ran after
database->renameTable / db->renameDatabase had already committed. If the
re-key threw at that point (a policy in a read-only storage such as one
loaded from users.xml, or a destination short-name collision), the
surrounding catch could restore the dependency bookkeeping but could not
undo the committed table/database rename, leaving the renamed object
readable without its row filter. That reintroduced the very row-policy
escape this change closes, on the error path.

Move the re-key feasibility check ahead of the first mutation: collect the
planned re-keys and preflight them (reject the rename when a policy is in a
read-only storage or its destination name is already taken by a policy that
is not itself moving). A rejected rename now changes nothing. The post-rename
apply step keeps its own rollback for any residual error.

Add regression coverage for both unmovable-policy paths: a read-only
users.xml policy (via clickhouse-local) and a destination row-policy name
collision. Both must fail the rename cleanly, leaving the table in place and
the policy still filtering.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Addressed the post-commit failure path in 569f980.

The re-key now runs in two stages. preflightRowPolicyRekeys runs BEFORE the first mutation (removeDependencies / renameTable in executeToTables, renameDatabase in executeToDatabase) and rejects the rename when a policy cannot be moved: it is in a read-only storage (e.g. loaded from users.xml), or its destination short name is already taken by a policy that is not itself moving (an EXCHANGE of two same-short-name policies is not a collision because both are in the re-key set). A rejected rename changes nothing. The apply step still runs after the rename with its existing rollback for any residual error.

Regression tests cover both unmovable-policy paths: a read-only users.xml policy (via clickhouse-local) and a destination row-policy name collision. Each fails the rename cleanly, leaving the table in place and the policy still filtering.

Comment thread src/Interpreters/InterpreterRenameQuery.cpp Outdated
Follow-up to the preflight commit, addressing two review findings.

1. Cross-database db.* policy. collectRowPolicyRekeys matched only an exact
   table_name, so a database-wide `CREATE ROW POLICY ... ON db1.*` was not
   carried when a table moved out of db1 (RENAME db1.t TO db2.t, or cross-db
   EXCHANGE). EnabledRowPolicies::getFilter then looks up db2.t and db2.*, never
   the old db1.*, so the moved data was readable unfiltered or under an unrelated
   destination policy. A db.* policy governs every table in db1 and cannot follow
   a single table out, so reject such cross-database moves/exchanges (NOT_IMPLEMENTED)
   instead of silently dropping the filter. Same-database renames are unaffected:
   the db.* policy keeps covering the table via the ANY_TABLE_MARK fallback.

2. Phase-1 transient name. applyRowPolicyRekeys parks each policy under
   `.tmp_rename_row_policy_<uuid>_<i>` before moving it to the destination, and
   that step runs after renameTable/renameDatabase has committed. The name is
   derived from the policy UUID (visible in system.row_policies.id), so a
   pre-existing policy can occupy it deterministically; AccessControl::update
   would then throw post-commit and the outer catch cannot undo the committed
   rename, leaving the table unfiltered. Extend preflightRowPolicyRekeys to also
   reject when the transient name is taken by a non-moving policy, so the whole
   re-key is proven applicable before the rename commits. The temp-name format is
   factored into tempRekeyTableName so preflight and apply cannot drift.

Regression tests added to 04401 for both: cross-database RENAME with a db.*
policy is rejected (table not moved, source still filtered), and a transient-name
collision is rejected pre-commit (table stays put, policy still filters).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Both findings fixed in 779feb3.

  1. Cross-database db.* drop: a database-wide ON db.* policy governs every table in db and cannot follow a single table out of it, so RENAME db1.t TO db2.t (and cross-db EXCHANGE) are now rejected with NOT_IMPLEMENTED when a source db.* policy applies, instead of silently dropping the filter. Same-database renames are unaffected (the db.* policy still covers the table via the ANY_TABLE_MARK fallback).

  2. Phase-1 temp-name post-commit throw: preflightRowPolicyRekeys now also checks the .tmp_rename_row_policy_<uuid>_<i> parking name for a non-moving-policy collision, so the entire re-key is proven applicable before the table/database rename commits. The temp-name format is factored into tempRekeyTableName so preflight and apply cannot drift.

Regression tests added to 04401: cross-db db.* move is rejected (table not moved, source still filtered); a transient-name collision is rejected pre-commit (table stays put, policy still filters). 30/30 pass with full randomization; the existing access/rename suite still passes.

@groeneai

Copy link
Copy Markdown
Contributor Author

Pre-PR validation gate for the follow-up commit 779feb3 (two review findings).

Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. Cross-DB: CREATE ROW POLICY p ON db1.* ...; RENAME TABLE db1.t TO db2.t2 (and cross-db EXCHANGE) leaked the moved data on pre-fix code; now rejected NOT_IMPLEMENTED on demand. Temp-name: create tp ON db.t, read its UUID from system.row_policies.id, create a second tp ON db.\.tmp_rename_row_policy__0`, then RENAME db.t TO db.bthrows post-commit pre-fix; now rejectedACCESS_ENTITY_ALREADY_EXISTS` pre-commit. Both reproduced and verified on a Build-ID-matched binary (9bbf43a3).
b Root cause explained? (1) collectRowPolicyRekeys matched only exact table_name, so a database-wide ON db1.* (table_name = ANY_TABLE_MARK) was never collected; after the move, EnabledRowPolicies::getFilter looks up db2.t then db2.*, never db1.*, so the moved data is unfiltered. (2) applyRowPolicyRekeys parks each policy under .tmp_rename_row_policy_<uuid>_<i> AFTER renameTable commits; the UUID is user-visible, so a pre-existing policy occupying that name makes AccessControl::update throw post-commit, and the outer catch cannot undo the committed rename -> table left unfiltered.
c Fix matches root cause? Yes. (1) A db.* policy governs every table in db1 and cannot follow one table out, so cross-database moves/exchanges are rejected when a source db.* applies (hasDatabaseWideRowPolicy), addressing the exact unmovable-policy mechanism. (2) preflightRowPolicyRekeys now also checks the phase-1 transient name, proving the whole re-key applies BEFORE the rename commits -> no post-commit throw. No symptom-masking.
d Test intent preserved / new tests added? Yes. Two regression cases added to 04401: cross-db db.* move rejected (table not moved, source still filtered) and transient-name collision rejected pre-commit (table stays, policy still filters). Existing cases unchanged.
e Both directions demonstrated? Yes. Both vectors leak / throw-post-commit on pre-fix code; both are cleanly rejected with the fix and the table+policy stay intact. 04401: 30/30 with full randomization.
f Fix is general across code paths? Yes. The temp-name preflight lives in the shared preflightRowPolicyRekeys, so both executeToTables (RENAME/EXCHANGE) and executeToDatabase (RENAME DATABASE) inherit it. The cross-db db.* check is in executeToTables; executeToDatabase does not need it because collectRowPolicyRekeysForDatabase already moves db.* to to_db.*.
g Fix generalizes across inputs? Yes. Covered: plain RENAME, EXCHANGE (both directions), same-short-name EXCHANGE (still succeeds via the moving-id guard), cross-DB RENAME and EXCHANGE, RENAME DATABASE with db.* and per-table policies, read-only (users.xml) source, destination collision, transient-name collision, and the no-policy no-op. db.* is identified by RowPolicy::isForDatabase() (table_name == ANY_TABLE_MARK), independent of short name.
h Backward compatible? Yes. No setting, on-disk, wire, or metadata format change. The cross-db db.* rejection is a new, fail-closed error on a previously-silent escape (a security fix, not a behavior regression for valid use); same-database renames and all single-database policy moves are unaffected. No SettingsChangesHistory entry needed.
i Invariants and contracts preserved? Yes. The "every re-key proven applicable before the first mutation" invariant is strengthened (transient-name case added). The rename's existing dependency rollback is untouched; applyRowPolicyRekeys keeps its own rollback for residual errors. Policy updates still go through AccessControl::update by UUID (the ALTER ROW POLICY path), so persistence/replication are unchanged. No new throw after the commit point.

Session id: cron:clickhouse-worker-slot-3:20260625-151900

@groeneai

Copy link
Copy Markdown
Contributor Author

Confirmed, and this is now an atomicity question rather than a missing guard. The remaining exposure is structural to InterpreterRenameQuery, not specific to the row-policy re-key: the dependency, source-view and named-collection re-keys at lines 449-473 also run after renameTable commits, and the catch restores only the in-memory dependency bindings, not the committed rename, so a post-commit throw there leaves the same committed-rename / inconsistent-binding shape. Access-entity DDL is not under the table DDLGuard (CREATE ROW POLICY can target a non-existent name), so preflight cannot be made atomic with the commit by adding more checks. The original escape (a policy silently not following the table) is closed; what is left is hardening this race.

I do not want to pick the locking model unilaterally on a security PR, so flagging the fork for @pufit @vitlibar to decide:

  1. Serialize: have the rename hold a reservation in AccessControl over the affected policy names for the commit window, and take the snapshot under it. Closes all three interleavings (concurrent create on the source, destination/transient-name collision, storage-write failure). Most invasive; needs a new AccessControl/IAccessStorage entry point.
  2. Fail-closed ordering: re-key policies before the rename commit and roll them back if the rename then fails. Not a clean close on its own: it opens a transient window where the old name is unfiltered, and it still misses a policy created on the source after the snapshot.
  3. Block: reject RENAME/EXCHANGE/RENAME DATABASE while an affected row policy exists (the fallback from the original report). Trivially fail-closed (data never lands at an unfiltered name), much smaller, but refuses the rename whenever a policy is present.

My read: option 1 is the only complete close but is an access-subsystem change; option 3 is the smallest fail-closed option. Is fully closing this race in scope for this patch, or acceptable as a follow-up given it matches the existing post-commit re-key semantics for the other bindings? Will implement whichever you prefer.

The regression test grew across review rounds to eight cases (RENAME, EXCHANGE,
cross-DB rename, RENAME DATABASE, and the destination/read-only/cross-DB/transient-name
rejection paths), two of which start a separate clickhouse-local server. A single run is
cheap, but under the flaky-check job's repeated parallel runs on sanitizer builds the
per-run wall clock exceeds the 180s soft cap, so the flaky check fails it with TOO_LONG.

The long tag exempts the test from that 180s flaky-check cap and reduces its flaky-check
repeat count from ~50 to ~5 (long_test_runs_ratio), while still running it once on every
regular and per-test-coverage lane (only the LLVM aggregation and fast-test lanes pass
--no-long). No assertion is weakened: each run still executes all eight security cases.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

The flaky check failed this PR's own test 04401_row_policy_follows_table_rename with TOO_LONG (> 180s). The test grew to eight cases across review rounds (two start a separate clickhouse-local server), so under the flaky check's repeated parallel runs on sanitizer builds a single run crosses the 180s soft cap.

Tagged the test long in bd4af75. That exempts the 180s flaky-check cap and cuts its flaky-check repeat count from ~50 to ~5, while the test still runs once on every regular and per-test-coverage lane (only the LLVM aggregation and fast-test lanes pass --no-long). No assertion is weakened: each run executes all eight security cases. Verified locally: flaky check now schedules 5 runs (was 50), all pass, each ~15s.

The outstanding atomicity/TOCTOU design question from the 16:19 UTC review is unaffected and still awaits a maintainer's pick on the locking-vs-fail-closed fork.

Comment thread src/Interpreters/InterpreterRenameQuery.cpp Outdated
The row-policy re-key added for RENAME/EXCHANGE also fired for the synthetic
exchange that CREATE OR REPLACE TABLE / REPLACE TABLE build internally
(InterpreterCreateQuery::doCreateOrReplaceTable), where from = a freshly built
temporary table and to = the existing target. The unconditional destination
re-key moved the target table's row policy onto that temporary name, which is
dropped right after the swap, leaving the replaced table readable with no
filter. That reintroduced the row-policy escape this code closes, for
replacement DDL.

Mark the synthetic rename with an internal create_or_replace flag and skip the
row-policy re-key for it: only the target's storage is replaced, the target
keeps its name, so the policies bound to that name stay put and filter the new
data. Explicit RENAME/EXCHANGE still re-key as before. The flag is rebuilt
in-process on every replica via doCreateOrReplaceTable, so it is preserved for
Replicated databases and ON CLUSTER.

Covered by two new cases in 04401_row_policy_follows_table_rename: an existing
protected table replaced via CREATE OR REPLACE TABLE and via REPLACE TABLE, each
asserting the policy stays bound to the target name and keeps filtering.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Fixed in 41e138b. CREATE OR REPLACE TABLE / REPLACE TABLE reach the rename via the synthetic exchange built by InterpreterCreateQuery::doCreateOrReplaceTable (from = the temporary replacement table, to = the existing target), and the unconditional to_rekeys collection moved the target table's row policy onto that temporary name, which is dropped right after the swap, leaving the replaced table unfiltered.

I added an internal create_or_replace flag on the synthetic ASTRenameQuery (set for both the rename and exchange branches in doCreateOrReplaceTable) and skip the row-policy re-key for it in executeToTables: only the target's storage is replaced, the target keeps its name, so the policies bound to that name stay put and filter the new data. Explicit RENAME/EXCHANGE re-key exactly as before. The flag is rebuilt in-process on every replica via doCreateOrReplaceTable, so it is preserved for Replicated databases and ON CLUSTER.

Verified on a debug build: your CREATE OR REPLACE TABLE reproducer now returns only the filtered row (was unfiltered), REPLACE TABLE likewise, and plain RENAME/EXCHANGE still move policies with the data. Regression coverage added to 04401_row_policy_follows_table_rename for an existing protected table replaced via both CREATE OR REPLACE TABLE and REPLACE TABLE, each asserting the policy stays on the target name and keeps filtering.

The CREATE OR REPLACE TABLE / REPLACE TABLE cases run as the restricted
user and specify ENGINE = MergeTree. The default test config enables
table_engines_require_grant, so they need an explicit TABLE ENGINE grant;
without it the statement is rejected with ACCESS_DENIED.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Fixed the deterministic 04401 failure on the CREATE OR REPLACE / REPLACE TABLE cases (commit 3f8b2dd).

Root cause: those two cases run as the restricted test user and specify ENGINE = MergeTree. The default test config sets table_engines_require_grant, so the user needs GRANT TABLE ENGINE ON MergeTree (it only had CREATE/INSERT/SELECT/DROP). Without it the statement is rejected with Code: 497 ... TABLE ENGINE ON MergeTree (ACCESS_DENIED). The earlier local run passed under a minimal config that did not enforce engine grants.

Fix: add GRANT TABLE ENGINE ON MergeTree TO ${USER}. Scope is minimal: only these two statements specify an engine as the restricted user (every other CREATE TABLE runs as admin). The grant governs engine specification only and is orthogonal to row-policy filtering, so the security assertions are unchanged: after the replace, the restricted user still sees only the eng row, and the policy stays bound to the target name.

Verified against the default test config (table_engines_require_grant=true): without the grant the case fails with the exact TABLE ENGINE ON MergeTree ACCESS_DENIED; with it the replace succeeds and the policy still filters. Full test passes 10/10 with randomization.

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — 3f8b2dd

Every failure below has an owner: a fixing PR (ours or external), or CH Inc sync (exempt).

Check / test Reason Owner / fixing PR
Stateless tests (amd_tsan, s3 storage, parallel, 2/2) / ThreadSanitizer data race (STID 4071-3348) + Server died trunk TSAN race in QueryStatus/MemoryReservation release at query finish, unrelated to this row-policy change (7 master + ~15 unrelated PRs/30d) #108391 (merged 2026-06-25 23:49Z); the failed build predated it, so merged master @ 0f8ce6e to pick it up
Stress test (arm_debug) / Logical error "Unexpected exception in refresh scheduling" (STID 2508-34af) + Cannot start + Check failed chronic RefreshTask::doScheduling shutdown race (29 master + 20+ unrelated PRs/30d), unrelated to row-policy rename #105588 (ours, open)
Mergeable Check / PR aggregate of the above resolves once the merged-master CI run is green

Session id: cron:our-pr-ci-monitor:20260626-050000

/// Re-key row policies last. Preflight above already rejected the unrecoverable cases
/// (read-only storage, destination or transient-name collision); this still has its own
/// rollback for any residual error, and nothing after it can throw.
applyRowPolicyRekeys(access_control, row_policy_rekeys);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not safe for Replicated databases when the affected row policies live in ReplicatedAccessStorage. The table rename is applied by each replica's replicated-DDL worker, but this AccessControl update writes the row-policy name to the shared access storage as soon as the first replica reaches this line. A lagging replica can then receive rp ON new_name while its local table is still old_name; queries there lookup old_db.old_table and then old_db.*, so the still-old table is temporarily unfiltered until that replica processes the DDL.

Please either coordinate the row-policy move with the replicated database DDL so all replicas stay protected, keep both old and new names protected until every replica has processed the rename, or reject this operation for affected policies in replicated access storage / Replicated databases. A focused integration test with <user_directories><replicated> and a lagging database replica would cover the invariant.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed. ReplicatedAccessStorage propagates the re-key on its own ZooKeeper path (global, immediate), while the table rename propagates through the Replicated database DDL queue (per-replica, async); the two channels are not synchronized, so a lagging replica briefly sees rp ON new_name with its local table still at old_name.

Scoping note: on master the policy does not follow the table at all, so a renamed table in a Replicated database stays unfiltered permanently. This PR reduces that to a transient window on a lagging replica only, so it does not introduce the exposure; fully closing the window is the multi-node face of the atomicity question already open at #108478 (comment).

On the fix: a fail-closed reject has to be pre-enqueue (DatabaseReplicated::checkQueryValid, next to the existing DETACH reject). A throw at this line runs inside replicated DDL execution after the entry is committed to ZooKeeper, so it cannot cleanly reject on a lagging replica. Of the three options, coordinate and dual-protect are architectural because ReplicatedAccessStorage policies are globally shared (there is no per-replica re-key and no all-replicas-caught-up barrier today); reject is the only option needing no new infrastructure but is backward-incompatible. Folding this into the same pending design decision rather than shipping a backward-incompatible reject unilaterally on a security PR.

@clickhouse-gh

clickhouse-gh Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.40% 85.40% +0.00%
Functions 92.60% 92.60% +0.00%
Branches 77.60% 77.60% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 112/152 (73.68%) | Lost baseline coverage: none · Uncovered code

Full report · Diff report

@clickhouse-gh

clickhouse-gh Bot commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

📊 Cloud Performance Report

✅ AI verdict: no_change — no significant changes across 38 queries analysed

This PR only changes RENAME TABLE / EXCHANGE / CREATE OR REPLACE DDL handling so that row policies follow the table (plus a new stateless test); the public and private diffs are identical and touch no query-execution code. The three flagged ClickBench improvements (Q4 -15.4%, Q15 -12.2%, Q28 -6.5%) exercise read-path SELECTs that never run any of the modified code, so these consistent-looking deltas are run-to-run variance, not a real effect of this change. All three have been downgraded to not-sure. No performance impact is expected from this PR.

clickbench

⚠️ 3 inconclusive

Flagged queries (3 of 43)
Query Verdict Baseline median (ms) PR median (ms) Change q-value Hint
⚠️ 4 not_sure 263 222 -15.4% <0.0001 PR only touches RENAME/CREATE-OR-REPLACE DDL + row-policy rekeying; Q4 SELECT path untouched, -15.4% is variance
⚠️ 15 not_sure 237 208 -12.2% <0.0001 DDL/row-policy-only change cannot speed up a ClickBench SELECT; -12.2% off-path, run-to-run variance
⚠️ 28 not_sure 6694 6262 -6.5% <0.0001 Rename/access-entity code does not run in this aggregation query; -6.5% unrelated to the diff

q-value = BH-FDR adjusted p; smaller is stronger evidence. MIRAI flags a query when q < fdr_q (default 0.10) — the value the verdict is based on.

tpch_adapted_1_official

🟢 No significant changes

Debug info
  • StressHouse run: 52302b65-c788-4a3c-a749-2247e507435f
  • MIRAI run: b5b40abf-4bf9-4ebf-9d41-ddeda543b17f
  • PR check IDs:
    • clickbench_287582_1782628168
    • clickbench_287596_1782628168
    • clickbench_287602_1782628168
    • tpch_adapted_1_official_287609_1782628168
    • tpch_adapted_1_official_287620_1782628168
    • tpch_adapted_1_official_287634_1782628168

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants