Add a test about a race condition in ALTERs#80416
Conversation
|
Dear @alexey-milovidov, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message. |
|
This is an actual bug, @alesapin should take a look at it. |
|
Dear @alexey-milovidov, this PR hasn't been updated for a while. Will you continue working on it? If not, please close it. Otherwise, ignore this message. |
|
This is very relevant - but waiting for @alesapin |
|
Workflow [PR], commit [282f552] Summary: ✅ AI ReviewSummaryThis PR adds a stateless shell test that stresses concurrent Final VerdictStatus: ✅ Approve |
When ALTER succeeds, ERROR is empty, so the unconditional echo prints an empty line that makes the test fail against the empty reference file. Check that ERROR is non-empty before printing. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The CI style check requires ReplicatedMergeTree paths to contain
`{database}` to avoid overlaps between parallel test runs.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add an EXIT trap to ensure `DROP TABLE IF EXISTS alter_table` runs even if the script exits early due to `set -e` and a non-zero exit from a background worker. This prevents leftover tables from cascading into follow-up test failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The test is designed to detect ALTER race conditions. Filter out `DUPLICATE_COLUMN`, `NOT_FOUND_COLUMN_IN_BLOCK`, and `UNKNOWN_IDENTIFIER` errors which are known manifestations of the race. When the underlying bug is fixed, these errors will stop occurring and the test will pass cleanly. Truly unexpected errors will still be reported. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…NOWN_IDENTIFIER` These errors are the exact invariant violations the test is designed to detect. Suppressing them made the test pass even when the bug reproduced. Also reduce timeout from 30s to 10s so the flaky check does not exceed the 180s limit under TSan. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The `03518_alter_logical_race` test exceeded the 180s flaky check time limit under TSan. Reduce INSERT row count from 100000 to 1000 and reduce parallel workers from 4 to 2 — still sufficient to trigger race conditions but much faster under sanitizers. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
No. The original test didn't have IF EXISTS / IF NOT EXISTS. Having these breaks the purpose of the test. |
The test used `ALTER ... ADD/MODIFY/DROP COLUMN IF [NOT] EXISTS`, but those
clauses suppress exactly the `NOT_FOUND_COLUMN_IN_BLOCK` / `DUPLICATE_COLUMN`
errors the test exists to catch, so it passed trivially and could not detect
the column-resolution regression it documents.
The original "ALTER cannot find a column" symptom is not a server bug: under
concurrent metadata `ALTER`s the server returns the documented-retryable
`CANNOT_ASSIGN_ALTER` ("You can retry this error", `ZBADVERSION` on `/metadata`).
The old script swallowed that retryable error and moved on, so the next
`MODIFY`/`DROP` ran against a column the `ADD` had never landed - producing
`NOT_FOUND_COLUMN_IN_BLOCK` as a test artifact.
Drop the `IF [NOT] EXISTS` clauses (restore the aggressive form) and instead
re-run each statement on the retryable error until it lands, via `run_with_retry`.
Any error that survives retry - including `NOT_FOUND_COLUMN_IN_BLOCK` /
`DUPLICATE_COLUMN` - is a genuine regression and fails the test.
Verified locally against a single-replica `ReplicatedMergeTree` (server 26.6.1):
the corrected test is green across repeated runs, while a fault-injected variant
that ignores instead of retries the retryable error reproduces
`NOT_FOUND_COLUMN_IN_BLOCK` and fails - confirming the guard has teeth.
#80416 (comment)
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The AI review flagged that `run_with_retry` treated time-budget exhaustion after only retryable errors as success: every retryable branch does `continue`, so when `SECONDS` reaches `TIMELIMIT` the loop fell through and the function returned 0. Both `thread_alter` workers could then make zero successful `ADD`/`MODIFY`/`DROP` progress under a permanent `CANNOT_ASSIGN_ALTER` / readonly condition and the empty-reference test would still pass, so the regression guard might never actually exercise a concurrent `ALTER`. Make progress explicit: - `run_with_retry` now returns non-zero (without printing) when it gives up before the statement lands - either the time budget was exhausted while only retryable errors were seen, or the table/replica was shut down. - `thread_alter` breaks its `ADD` -> `MODIFY` -> `DROP` sequence on that signal, so a dependent step is never issued against a column state the previous step never established, and it counts each completed full cycle. - The script fails with `NO PROGRESS` if neither worker completed a single cycle - unless the run was disrupted by a benign table/replica shutdown (a concurrent restart), which is not the test's fault and records a marker that waives the requirement. Validated the control flow with a mock client: a healthy run passes, a permanent-retryable run now fails with `NO PROGRESS`, a genuine `NOT_FOUND_COLUMN_IN_BLOCK` still fails, and a shutdown-disrupted run still passes. #80416 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Triage of the two failing checks on
Both are unrelated to this PR, which is test-only (it adds Evidence it is master-side, not this PR. Per CIDB, on 2026-06-14 each of these two tests failed with Root cause. Both tests run This is the same macOS loopback limitation behind the recent
Neither covers @groeneai, please confirm and handle these two as part of the macOS loopback I have refreshed the branch (merged current |
|
@alexey-milovidov Confirmed. Both On the mechanism: aliasing These two are already in On your separate-PR request: since the skip is needed precisely because of #107435's locality change, and both lines already sit on its |
The INSERT path raises `Table is in readonly mode due to shutdown: ...` (`TABLE_IS_READ_ONLY` from `ReplicatedMergeTreeSink`) when the table or replica is shut down by a concurrent teardown/restart. The generic `in readonly mode` branch caught this benign shutdown and retried it as a transient session blip until the time budget ran out, so `run_with_retry` returned non-zero without recording the `_disrupted` marker. That could trip the `NO PROGRESS` check in `thread_alter`, and under `set -e` the unguarded `thread_insert` caller exited non-zero - failing the test for the same benign shutdown/restart race the comments say is waived. Move the shutdown branch ahead of the generic read-only retry and add the `due to shutdown` variant to it, so a permanent shutdown is classified as disrupted (waiving the progress requirement) instead of being retried like a transient `Table is in readonly mode (replica path: ...)`. Guard the `thread_insert` caller with `|| break` so a disrupted return stops the worker cleanly instead of failing the whole test; genuine INSERT errors are still printed by `run_with_retry` and fail the empty-reference test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Two changes to make the regression guard faithfully reflect its contract: 1. Pin randomized settings (`no-random-settings`, `no-random-merge-tree-settings`). CI's randomized-settings diagnosis showed the test passes 3/3 without randomization and fails 3/3 with the randomized settings on a slow `amd_msan` build: the randomized merge-tree thresholds only slow the data-rewriting `MODIFY COLUMN` mutation enough to blow the per-worker time budget, after which the loop completes no full ALTER cycle / lands no INSERT and the script exits non-zero. The test is a timing-sensitive concurrency guard whose contract is orthogonal to query/merge-tree settings, so pinning them removes the timing variance without weakening what it checks. 2. Record successful INSERT progress and fail when no INSERT landed (with the same shutdown-disrupted waiver as the ALTER-cycle gate). Previously the success gate only required one completed ALTER cycle, so both `thread_insert` workers could spend their budget on retryable read-only errors, land zero rows, and the test would still pass - a false green for a guard whose contract is concurrent ALTER/INSERT. The progress files are split into `.cycles.<col>` and `.inserts.<id>` so the two gates are accounted independently. Addresses the `Request changes` review verdict. CI report: #80416 Failing job: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=80416&sha=c8f85d17c6c599cc1e34586025c975fa2f3c655f&name_0=PR&name_1=Stateless%20tests%20%28amd_msan%2C%20WasmEdge%2C%20parallel%2C%201%2F2%29 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Updated the test (commit Two changes:
Triage of the remaining reds, all unrelated to this test-only PR:
The re-merge re-triggers CI on fresh |
Refresh the branch (was 440 commits behind). This pulls in #103417 (`Validate EmbeddedRocksDB setting values before persisting metadata`), which fixes the sole CI red on the previous run: `Stress test (amd_tsan)` failed with `Cannot start clickhouse-server` because a fuzzed `ALTER TABLE rocksdb_worm MODIFY SETTING optimize_for_bulk_insert = NULL` (from `02956_rocksdb_bulk_sink`) persisted an invalid `Bool` value to the metadata, so the table could not be re-attached on restart (`Invalid value NULL of the setting, which needs bool`, CANNOT_CONVERT_TYPE). That failure is master-side and unrelated to this test-only PR; the fix is absent from the old merge-base and is brought in by this merge. The diff stays test-only (`03518_alter_logical_race.{sh,reference}`).
LLVM Coverage ReportChanged lines: No C/C++ source files changed — skipping uncovered code analysis. Newly covered by added/modified tests: 387 line(s), 10 function(s) across 126 file(s) · Details Top files
|

Changelog category (leave one):
Sometimes ALTER cannot find a column in the table, which should not happen.