Add a test about a race condition in ALTERs by alexey-milovidov · Pull Request #80416 · ClickHouse/ClickHouse · GitHub
Skip to content

Add a test about a race condition in ALTERs#80416

Merged
alexey-milovidov merged 67 commits into
masterfrom
add-failing-test-alter
Jun 18, 2026
Merged

Add a test about a race condition in ALTERs#80416
alexey-milovidov merged 67 commits into
masterfrom
add-failing-test-alter

Conversation

@alexey-milovidov

Copy link
Copy Markdown
Member

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Sometimes ALTER cannot find a column in the table, which should not happen.

@clickhouse-gh

clickhouse-gh Bot commented May 18, 2025

Copy link
Copy Markdown
Contributor

@clickhouse-gh

clickhouse-gh Bot commented Jul 22, 2025

Copy link
Copy Markdown
Contributor

Dear @alexey-milovidov, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

This is an actual bug, @alesapin should take a look at it.

@clickhouse-gh

clickhouse-gh Bot commented Nov 18, 2025

Copy link
Copy Markdown
Contributor

Dear @alexey-milovidov, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

This is very relevant - but waiting for @alesapin

@clickhouse-gh

clickhouse-gh Bot commented Jan 11, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [282f552]

Summary:


AI Review

Summary

This PR adds a stateless shell test that stresses concurrent ALTER and INSERT on ReplicatedMergeTree and uses explicit retry/progress accounting so the empty-reference regression guard does not pass without exercising the intended race. I found no remaining review findings in the current diff; the earlier inline concerns are addressed or intentionally opted out in the current script, and the fetched Praktika report shows no failed jobs.

Final Verdict

Status: ✅ Approve

Comment thread tests/queries/0_stateless/03518_alter_logical_race.sh Outdated
alexey-milovidov and others added 3 commits March 30, 2026 05:07
When ALTER succeeds, ERROR is empty, so the unconditional echo
prints an empty line that makes the test fail against the empty
reference file. Check that ERROR is non-empty before printing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI style check requires ReplicatedMergeTree paths to contain
`{database}` to avoid overlaps between parallel test runs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread tests/queries/0_stateless/03518_alter_logical_race.sh Outdated
alexey-milovidov and others added 4 commits March 30, 2026 07:54
Add an EXIT trap to ensure `DROP TABLE IF EXISTS alter_table` runs
even if the script exits early due to `set -e` and a non-zero exit
from a background worker. This prevents leftover tables from
cascading into follow-up test failures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test is designed to detect ALTER race conditions. Filter out
`DUPLICATE_COLUMN`, `NOT_FOUND_COLUMN_IN_BLOCK`, and `UNKNOWN_IDENTIFIER`
errors which are known manifestations of the race. When the underlying
bug is fixed, these errors will stop occurring and the test will pass
cleanly. Truly unexpected errors will still be reported.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Comment thread tests/queries/0_stateless/03518_alter_logical_race.sh Outdated
alexey-milovidov and others added 5 commits March 30, 2026 10:47
…NOWN_IDENTIFIER`

These errors are the exact invariant violations the test is designed
to detect. Suppressing them made the test pass even when the bug
reproduced. Also reduce timeout from 30s to 10s so the flaky check
does not exceed the 180s limit under TSan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `03518_alter_logical_race` test exceeded the 180s flaky check time
limit under TSan. Reduce INSERT row count from 100000 to 1000 and
reduce parallel workers from 4 to 2 — still sufficient to trigger
race conditions but much faster under sanitizers.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@alexey-milovidov alexey-milovidov marked this pull request as ready for review June 14, 2026 15:28
@alexey-milovidov alexey-milovidov changed the title Add a failing test about ALTERs Add a test about a race condition in ALTERs Jun 14, 2026
@alexey-milovidov alexey-milovidov marked this pull request as draft June 14, 2026 15:33
@alexey-milovidov

Copy link
Copy Markdown
Member Author

No. The original test didn't have IF EXISTS / IF NOT EXISTS. Having these breaks the purpose of the test.

The test used `ALTER ... ADD/MODIFY/DROP COLUMN IF [NOT] EXISTS`, but those
clauses suppress exactly the `NOT_FOUND_COLUMN_IN_BLOCK` / `DUPLICATE_COLUMN`
errors the test exists to catch, so it passed trivially and could not detect
the column-resolution regression it documents.

The original "ALTER cannot find a column" symptom is not a server bug: under
concurrent metadata `ALTER`s the server returns the documented-retryable
`CANNOT_ASSIGN_ALTER` ("You can retry this error", `ZBADVERSION` on `/metadata`).
The old script swallowed that retryable error and moved on, so the next
`MODIFY`/`DROP` ran against a column the `ADD` had never landed - producing
`NOT_FOUND_COLUMN_IN_BLOCK` as a test artifact.

Drop the `IF [NOT] EXISTS` clauses (restore the aggressive form) and instead
re-run each statement on the retryable error until it lands, via `run_with_retry`.
Any error that survives retry - including `NOT_FOUND_COLUMN_IN_BLOCK` /
`DUPLICATE_COLUMN` - is a genuine regression and fails the test.

Verified locally against a single-replica `ReplicatedMergeTree` (server 26.6.1):
the corrected test is green across repeated runs, while a fault-injected variant
that ignores instead of retries the retryable error reproduces
`NOT_FOUND_COLUMN_IN_BLOCK` and fails - confirming the guard has teeth.

#80416 (comment)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread tests/queries/0_stateless/03518_alter_logical_race.sh
@alexey-milovidov alexey-milovidov marked this pull request as ready for review June 14, 2026 22:45
alexey-milovidov and others added 2 commits June 14, 2026 23:01
The AI review flagged that `run_with_retry` treated time-budget exhaustion
after only retryable errors as success: every retryable branch does `continue`,
so when `SECONDS` reaches `TIMELIMIT` the loop fell through and the function
returned 0. Both `thread_alter` workers could then make zero successful
`ADD`/`MODIFY`/`DROP` progress under a permanent `CANNOT_ASSIGN_ALTER` / readonly
condition and the empty-reference test would still pass, so the regression guard
might never actually exercise a concurrent `ALTER`.

Make progress explicit:

- `run_with_retry` now returns non-zero (without printing) when it gives up
  before the statement lands - either the time budget was exhausted while only
  retryable errors were seen, or the table/replica was shut down.
- `thread_alter` breaks its `ADD` -> `MODIFY` -> `DROP` sequence on that signal,
  so a dependent step is never issued against a column state the previous step
  never established, and it counts each completed full cycle.
- The script fails with `NO PROGRESS` if neither worker completed a single
  cycle - unless the run was disrupted by a benign table/replica shutdown (a
  concurrent restart), which is not the test's fault and records a marker that
  waives the requirement.

Validated the control flow with a mock client: a healthy run passes, a
permanent-retryable run now fails with `NO PROGRESS`, a genuine
`NOT_FOUND_COLUMN_IN_BLOCK` still fails, and a shutdown-disrupted run still
passes.

#80416

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Triage of the two failing checks on c082d7c7 (Fast test (arm_darwin)):

  • 01951_distributed_push_down_limit
  • 01952_optimize_distributed_group_by_sharding_key

Both are unrelated to this PR, which is test-only (it adds 03518_alter_logical_race.{sh,reference} and touches no distributed-query or planning code).

Evidence it is master-side, not this PR. Per CIDB, on 2026-06-14 each of these two tests failed with result differs with reference on 13 distinct PRs, exclusively on Fast test (arm_darwin), and on no day in the preceding two weeks — i.e. it started master-wide on 2026-06-14, not with this branch.

Root cause. Both tests run EXPLAIN ... remote('127.{1,2}', ...), i.e. a two-shard cluster on 127.0.0.1 + 127.0.0.2. macOS binds only 127.0.0.1 to lo0 and, unlike Linux, does not auto-route the rest of 127.0.0.0/8, so the 127.0.0.2 shard is classified differently than on Linux and the printed plan inlines the shard as a local read instead of ReadFromRemote:

-        ReadFromRemote
+        Sorting
+          Expression
+            ReadFromSystemNumbers

This is the same macOS loopback limitation behind the recent remote() test work:

Neither covers 01951/01952 (these are not in ci/defs/darwin.skip).

@groeneai, please confirm and handle these two as part of the macOS loopback remote() work (the lo0 alias / locality change in #107435 should make 127.0.0.2 reachable; if their plan still diverges on macOS, a reference adjustment or a darwin.skip entry would be needed), in a separate PR. Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=80416&sha=c082d7c7d2e4802cb9fee020ef5108323a5e84ad&name_0=PR&name_1=Fast%20test%20%28arm_darwin%29

I have refreshed the branch (merged current master) and pushed a fix for the one open review thread, so CI will re-run on d791063a20c.

Comment thread tests/queries/0_stateless/03518_alter_logical_race.sh Outdated
@groeneai

Copy link
Copy Markdown
Contributor

@alexey-milovidov Confirmed. Both 01951_distributed_push_down_limit and 01952_optimize_distributed_group_by_sharding_key are the macOS-loopback EXPLAIN class and are unrelated to this test-only PR.

On the mechanism: aliasing 127.0.0.2..16 on lo0 makes those addresses both reachable and local. Reachable is what the connect-timeout family (02947 and the other parallel-replica remote() tests) needs, which is why #107435 un-skips them. But a lo0 alias is also a local address, so isLocalAddress() now matches 127.0.0.2 and the planner inlines that shard as a local read: EXPLAIN prints Sorting/Expression/ReadFromSystemNumbers instead of ReadFromRemote. That is exactly the divergence you noted, so darwin.skip is the right handling rather than a reference adjustment (an edited reference would still differ from Linux/master, where 127.0.0.2 is genuinely remote).

These two are already in ci/defs/darwin.skip on #107435's current HEAD (2e2061197e2), together with a teardown post_hook that de-aliases 127.0.0.2..16 at job end. The missing teardown was the real trigger for the 2026-06-14 fan-out: the macos_m2 runner is reused, so the pre_hook aliases leaked onto later jobs and made 127.0.0.2 look local to unrelated PRs.

On your separate-PR request: since the skip is needed precisely because of #107435's locality change, and both lines already sit on its darwin.skip diff, keeping them there avoids a darwin.skip conflict between two PRs and a window where #107435 merges (aliases live) before a standalone skip PR lands (failures would resume meanwhile). I am happy to split them into a dedicated darwin.skip PR and drop them from #107435 if you would rather. Which do you prefer?

@groeneai groeneai mentioned this pull request Jun 15, 2026
1 task
The INSERT path raises `Table is in readonly mode due to shutdown: ...`
(`TABLE_IS_READ_ONLY` from `ReplicatedMergeTreeSink`) when the table or
replica is shut down by a concurrent teardown/restart. The generic
`in readonly mode` branch caught this benign shutdown and retried it as a
transient session blip until the time budget ran out, so `run_with_retry`
returned non-zero without recording the `_disrupted` marker. That could
trip the `NO PROGRESS` check in `thread_alter`, and under `set -e` the
unguarded `thread_insert` caller exited non-zero - failing the test for
the same benign shutdown/restart race the comments say is waived.

Move the shutdown branch ahead of the generic read-only retry and add the
`due to shutdown` variant to it, so a permanent shutdown is classified as
disrupted (waiving the progress requirement) instead of being retried like
a transient `Table is in readonly mode (replica path: ...)`. Guard the
`thread_insert` caller with `|| break` so a disrupted return stops the
worker cleanly instead of failing the whole test; genuine INSERT errors
are still printed by `run_with_retry` and fail the empty-reference test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread tests/queries/0_stateless/03518_alter_logical_race.sh
Comment thread tests/queries/0_stateless/03518_alter_logical_race.sh Outdated
alexey-milovidov and others added 2 commits June 16, 2026 17:03
Two changes to make the regression guard faithfully reflect its contract:

1. Pin randomized settings (`no-random-settings`, `no-random-merge-tree-settings`).
   CI's randomized-settings diagnosis showed the test passes 3/3 without
   randomization and fails 3/3 with the randomized settings on a slow `amd_msan`
   build: the randomized merge-tree thresholds only slow the data-rewriting
   `MODIFY COLUMN` mutation enough to blow the per-worker time budget, after
   which the loop completes no full ALTER cycle / lands no INSERT and the script
   exits non-zero. The test is a timing-sensitive concurrency guard whose
   contract is orthogonal to query/merge-tree settings, so pinning them removes
   the timing variance without weakening what it checks.

2. Record successful INSERT progress and fail when no INSERT landed (with the
   same shutdown-disrupted waiver as the ALTER-cycle gate). Previously the
   success gate only required one completed ALTER cycle, so both `thread_insert`
   workers could spend their budget on retryable read-only errors, land zero
   rows, and the test would still pass - a false green for a guard whose
   contract is concurrent ALTER/INSERT. The progress files are split into
   `.cycles.<col>` and `.inserts.<id>` so the two gates are accounted
   independently. Addresses the `Request changes` review verdict.

CI report: #80416
Failing job: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=80416&sha=c8f85d17c6c599cc1e34586025c975fa2f3c655f&name_0=PR&name_1=Stateless%20tests%20%28amd_msan%2C%20WasmEdge%2C%20parallel%2C%201%2F2%29

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Updated the test (commit a9670c9596b) and re-merged master (c8f85d17c6c..f955f70c22e, was 508 commits behind; clean, diff stays test-only).

Two changes:

  1. Pinned randomized settings (no-random-settings, no-random-merge-tree-settings). The only PR-caused failure was 03518_alter_logical_race on Stateless tests (amd_msan, WasmEdge, parallel, 1/2), and CI's own randomized-settings diagnosis was conclusive: passes 3/3 without randomization, fails 3/3 with the randomized settings. The randomized merge-tree thresholds only slow the data-rewriting MODIFY COLUMN mutation enough to blow the per-worker time budget on a slow sanitizer build, so the loop completes no full cycle / lands no INSERT and exits non-zero. This is a timing-sensitive concurrency guard whose contract is orthogonal to query/merge-tree settings, so pinning them removes the variance without weakening the check.

  2. Required INSERT progress (addresses the ⚠️ Request changes verdict / the line-185 thread). thread_insert now records landed INSERTs in .inserts.<id> files (separate from .cycles.<col>), and the gate fails with NO INSERT PROGRESS when zero INSERTs landed across both insert workers, with the same _disrupted shutdown waiver as the ALTER-cycle gate. A run where both insert workers only see retryable read-only/Keeper errors can no longer pass as a false green.

Triage of the remaining reds, all unrelated to this test-only PR:

  • Stress test (arm_tsan)Hung check failed, possible deadlock found: a master-wide flake (119 occurrences across 58 PRs including master in the last 5 days).
  • Integration tests (amd_llvm_coverage, 1–5/5) and (old analyzer, … parallel): systemic amd_llvm_coverage job-level infra failures (red across many PRs while the test reports themselves show 0 test failures / empty results). A test-only PR adds no integration test.

The re-merge re-triggers CI on fresh master.

Refresh the branch (was 440 commits behind). This pulls in #103417
(`Validate EmbeddedRocksDB setting values before persisting metadata`),
which fixes the sole CI red on the previous run: `Stress test (amd_tsan)`
failed with `Cannot start clickhouse-server` because a fuzzed
`ALTER TABLE rocksdb_worm MODIFY SETTING optimize_for_bulk_insert = NULL`
(from `02956_rocksdb_bulk_sink`) persisted an invalid `Bool` value to the
metadata, so the table could not be re-attached on restart
(`Invalid value NULL of the setting, which needs bool`, CANNOT_CONVERT_TYPE).
That failure is master-side and unrelated to this test-only PR; the fix is
absent from the old merge-base and is brought in by this merge.

The diff stays test-only (`03518_alter_logical_race.{sh,reference}`).
@clickhouse-gh

clickhouse-gh Bot commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.30% 85.20% -0.10%
Functions 92.30% 92.30% +0.00%
Branches 77.60% 77.50% -0.10%

Changed lines: No C/C++ source files changed — skipping uncovered code analysis.

Newly covered by added/modified tests: 387 line(s), 10 function(s) across 126 file(s) · Details

Top files
  • src/Functions/FunctionShowCertificate.cpp: 38 line(s), 4 function(s)
  • src/Storages/StorageReplicatedMergeTree.cpp: 25 line(s)
  • src/Coordination/KeeperServer.cpp: 11 line(s)
  • src/Interpreters/WasmModuleManager.cpp: 11 line(s)
  • src/IO/S3/copyS3File.cpp: 10 line(s)

Full report

@alexey-milovidov alexey-milovidov self-assigned this Jun 18, 2026
@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jun 18, 2026
Merged via the queue into master with commit 944db24 Jun 18, 2026
165 checks passed
@alexey-milovidov alexey-milovidov deleted the add-failing-test-alter branch June 18, 2026 16:36
@robot-ch-test-poll2 robot-ch-test-poll2 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants