Allowlist `throwIf` upgrade-check errors from `04341_broken_mutation_part_log_flush` by alexey-milovidov · Pull Request #108856 · ClickHouse/ClickHouse · GitHub
Skip to content

Allowlist throwIf upgrade-check errors from 04341_broken_mutation_part_log_flush#108856

Merged
alexey-milovidov merged 7 commits into
masterfrom
fix-upgrade-check-04341-throwif
Jul 1, 2026
Merged

Allowlist throwIf upgrade-check errors from 04341_broken_mutation_part_log_flush#108856
alexey-milovidov merged 7 commits into
masterfrom
fix-upgrade-check-04341-throwif

Conversation

@alexey-milovidov

@alexey-milovidov alexey-milovidov commented Jun 29, 2026

Copy link
Copy Markdown
Member

The Upgrade check fails on many unrelated pull requests with Error message in clickhouse-server.log (see upgrade_error_messages.txt). The leaked error is:

Code: 395. DB::Exception: Value passed to 'throwIf' function is non-zero ... (FUNCTION_THROW_IF_VALUE_IS_NON_ZERO)

It originates from stateless tests that deliberately run a failing async mutation ALTER TABLE ... UPDATE <col> = <col> <op> throwIf(1):

  • 04341_broken_mutation_part_log_flush — plain MergeTree, so the error is emitted by MutatePlainMergeTreeTask, which is not covered by the existing MutateFromLogEntryTask allowlist entry. (This is the variant seen in the example report below.)
  • 02597_column_{update,delete,update_tricky_expression}_and_replicationReplicatedMergeTree, where MutateFromLogEntryTask catches only the first line and the wrapping MergeTreeBackgroundExecutor line leaks.

After the upgrade restart the broken mutation is retried in the background and logged to clickhouse-server.upgrade.log. This is the expected test-induced error, not a backward incompatibility — exactly the same class as the CANNOT_PARSE_TEXT mutation errors already allowlisted (issue #39174, tests 00834/01414/04338). throwIf only raises when a query explicitly calls it with a truthy argument, so matching the message substring can only match such a test, never an upgrade incompatibility; matching the message rather than the task type also covers the wrapping MergeTreeBackgroundExecutor line.

Master-wide on ~19 distinct pull requests over 2026-06-26..2026-06-29 (confirmed via play.clickhouse.com checks history). Example report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108657&sha=70df14b1ec107f06eb401ecb1bda859af68e52a6&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Documentation entry for user-facing changes

  • Documentation is written (mandatory for new features)

Version info

  • Merged into: 26.7.1.381

…part_log_flush`

The stateless test `04341_broken_mutation_part_log_flush` runs an
intentionally-failing async mutation
`ALTER TABLE ... UPDATE x = x + throwIf(1) WHERE 1` (expecting
`serverError UNFINISHED`) to assert that the failed `MutatePart`
`part_log` entry is flushed before the failure is reported to the client.

After the upgrade restart the broken mutation is retried in the
background and logs `Value passed to 'throwIf' function is non-zero`
(`FUNCTION_THROW_IF_VALUE_IS_NON_ZERO`, Code: 395) to
`clickhouse-server.upgrade.log`, which fails the `Upgrade check` with
`Error message in clickhouse-server.log (see upgrade_error_messages.txt)`.

This is the expected test-induced error, not a backward incompatibility
or a regression - exactly the same class as the `CANNOT_PARSE_TEXT`
mutation errors already allowlisted (issue #39174). Note the table uses
a plain `MergeTree`, so the error is emitted by `MutatePlainMergeTreeTask`
and is not covered by the existing `MutateFromLogEntryTask` exclusion.
`throwIf` only raises when a query explicitly calls it with a truthy
argument, so the message can only come from such a test, never from an
upgrade incompatibility; the narrow message substring also covers the
wrapping `MergeTreeBackgroundExecutor` line.

The failure is not caused by any product change - it surfaces on any
pull request that runs the Upgrade check. Observed master-wide on ~19
distinct pull requests over the four days 2026-06-26..2026-06-29
(confirmed via play.clickhouse.com `checks` history), e.g. PR #108657.

This mirrors the earlier allowlist additions for `04338` ('x'),
`01414` ('Hello') and `00834` ('a'/'b').

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 29, 2026
…icated family

The narrow allowlisted message matches the `throwIf` mutation error from
both `04341_broken_mutation_part_log_flush` (plain `MergeTree`,
`MutatePlainMergeTreeTask`) and the
`02597_column_{update,delete,update_tricky_expression}_and_replication`
tests (`ReplicatedMergeTree`, where only the first `MutateFromLogEntryTask`
line was already excluded and the wrapping `MergeTreeBackgroundExecutor`
line leaked). Matching the message rather than the task type covers all
leaking lines. No behavior change beyond the comment; the rg entry is
unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread tests/docker_scripts/upgrade_runner.sh Outdated
alexey-milovidov and others added 2 commits June 30, 2026 00:45
Address review feedback from `clickhouse-gh`: the previous fixed-string
exclusion `-e "Value passed to 'throwIf' function is non-zero"` dropped every
`<Error>` line containing that default message globally, which is broader than
the invariant the filter protects. The default `throwIf` message is generic, so
a real upgrade regression in any other background or persisted object that uses
`throwIf` as an assertion would silently disappear from
`upgrade_error_messages.txt`.

Move the exclusion from the primary fixed-string list into the secondary regex
pipe, pairing the message with the mutation-task logger that emits it, exactly
like the other scoped exclusions in this file:

- `MutatePlainMergeTreeTask` + the message (plain `MergeTree`, test
  `04341_broken_mutation_part_log_flush`);
- `MergeTreeBackgroundExecutor` + the message (the wrapping background-executor
  line, present for both the plain and `ReplicatedMergeTree` cases).

The replicated first line is already covered by the existing
`MutateFromLogEntryTask` exclusion, so all leaking lines remain suppressed while
`throwIf` errors from any other context now surface.

Verified against the real leaking log from the example report
(https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=108657&sha=70df14b1ec107f06eb401ecb1bda859af68e52a6&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29):
all 273 `MutatePlainMergeTreeTask`/`MergeTreeBackgroundExecutor` retry lines are
dropped, while a synthetic unrelated `throwIf` error (e.g. from a materialized
view assertion) still reaches `upgrade_error_messages.txt` - whereas the old
global filter masked it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Addressed the clickhouse-gh review (the ⚠️ Request changes verdict) in cff0cc1: narrowed the throwIf allowlist from a global fixed-string match to scoped regex pairs in the secondary pipe, requiring both the mutation-task logger (MutatePlainMergeTreeTask / MergeTreeBackgroundExecutor) AND the message together. The replicated first line stays covered by the existing MutateFromLogEntryTask exclusion.

I verified the change against the real leaking log from the example report (PR=108657, Upgrade check (amd_release)), which contains 273 throwIf retry lines:

  • All 273 MutatePlainMergeTreeTask/MergeTreeBackgroundExecutor retry lines are still suppressed.
  • A synthetic unrelated throwIf regression (<Error> StorageMaterializedView (db.mv_assert): ... Value passed to 'throwIf' function is non-zero ...) now reaches upgrade_error_messages.txt, whereas the old global filter masked it.

Also merged master (was 147 commits behind; no conflicts, master had not touched this file).

The only red checks are Stress test (amd_tsan) and Stress test (arm_tsan) — both the chronic master-wide "Hung check failed, possible deadlock found" flake (#107941, which the CI summary itself links), unrelated to this CI-script-only change. The Upgrade check (amd_release) that this PR targets passed.

Comment thread tests/docker_scripts/upgrade_runner.sh Outdated
@azat azat self-assigned this Jun 30, 2026
Per @azat's review, collapse the two scoped `throwIf` exclusions
(`MutatePlainMergeTreeTask.*` and `MergeTreeBackgroundExecutor.*`) into a
single global `grep -av -e "Value passed to 'throwIf' function is non-zero"`
entry in the secondary pipe.

`throwIf` is a user-level SQL function that only raises when a query
explicitly calls it with a truthy argument, so the default
`FUNCTION_THROW_IF_VALUE_IS_NON_ZERO` message can only ever originate from
such a test query, never from a background or internal assertion - the
ClickHouse internals never use `throwIf` for their own checks. Matching the
message alone is therefore safe to suppress globally, exactly like the
already-global `Code: 236 ... Cancelled mutating parts` message that the same
cancelled test mutations emit. Matching the message rather than the task type
also keeps the wrapping `MergeTreeBackgroundExecutor` line of the replicated
case covered in a single entry.

Rewrote the comment block to explain the global match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@alexey-milovidov

Copy link
Copy Markdown
Member Author

Pushed be05d1c addressing @azat's review: collapsed the two scoped throwIf exclusions (MutatePlainMergeTreeTask.* / MergeTreeBackgroundExecutor.*) into the single global grep -av -e "Value passed to 'throwIf' function is non-zero" he suggested, and rewrote the comment block to match.

The global match is safe because throwIf is a user-level SQL function that only raises when a query explicitly passes a truthy argument — the FUNCTION_THROW_IF_VALUE_IS_NON_ZERO default message can therefore only ever come from such a test query, never from a background or internal assertion. This is consistent with the already-global Code: 236 ... Cancelled mutating parts exclusion (line above) that the same cancelled test mutations emit. Verified locally that the single entry still drops the mutation-retry lines (both MutatePlainMergeTreeTask and MergeTreeBackgroundExecutor contexts) while a non-throwIf upgrade error still surfaces.

Resolved the review thread. The only red on the previous commit was Stress testHung check failed, possible deadlock found (#107941, the chronic master-wide flake the CI summary itself links), which is unrelated to this Upgrade check log-allowlist change.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr pass (head be05d1c7): no code changes needed.

  • Change is correct and final. CI-script-only (tests/docker_scripts/upgrade_runner.sh, +16): a documented comment block plus a single global grep -av -e "Value passed to 'throwIf' function is non-zero" in the secondary allowlist pipe — the exact form @azat requested. The line sits just before the final grep -Fa "<Error>" (correct pipe), and the comment's analogy to the existing global Code: 236 ... Cancelled mutating parts exclusion is accurate. The safety argument holds: FUNCTION_THROW_IF_VALUE_IS_NON_ZERO (Code 395) is only ever raised by a SQL query that explicitly calls throwIf with a truthy argument — never by an internal/background path — so a global substring match cannot mask a real upgrade incompatibility.
  • AI Review: ✅ Approve at this head; no unresolved review threads (the clickhouse-gh Request-changes and @azat's review are both addressed and resolved).
  • The PR's target check Upgrade check (amd_release) passed — direct confirmation the throwIf retry lines no longer leak into upgrade_error_messages.txt.
  • Only red checks are Stress test (amd_tsan) and Stress test (arm_tsan), both Hung check failed, possible deadlock found = the chronic master-wide flake #107941 that the CI summary itself links — unrelated to this log-allowlist change. Triggered a rerun of the two failed stress jobs.
  • No re-merge: merge-base is 2026-06-29 23:39 (<1 day), MERGEABLE with no conflicts, and the only reds fail on master too. The author already merged fresh master at cff0cc1177e.

Blocked on the stress-flake rerun and human merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr pass (head be05d1c7): no code changes; the PR is ready for merge.

  • Change is correct and final — CI-script-only (tests/docker_scripts/upgrade_runner.sh, +16): one global grep -av -e "Value passed to 'throwIf' function is non-zero" in the secondary allowlist pipe (the exact form @azat requested) plus its documenting comment. FUNCTION_THROW_IF_VALUE_IS_NON_ZERO (Code 395) is only ever raised by a SQL query that explicitly calls throwIf with a truthy argument, so a global substring match cannot mask a real upgrade incompatibility — consistent with the existing global Code: 236 ... Cancelled mutating parts exclusion.
  • AI Review: ✅ Approve at this head; no unresolved review threads; CH Inc sync ✅, Mergeable Check ✅, Style check ✅; and the PR's target Upgrade check (amd_release) passed.
  • Only reds are Stress test (amd_tsan) and (arm_tsan), both Hung check failed, possible deadlock found = the chronic master-wide flake #107941 the CI summary itself links. This change is provably off their code path: the Stress test jobs run stress_runner.sh, while upgrade_runner.sh (the only file this PR touches) is invoked solely by the Upgrade check job (ci/jobs/stress_job.py:161-164). Only the two TSan variants failed (all other stress variants are green) and the log tails show server-side QueryAnalyzer/executeQuery TSan stacks — a bash log-allowlist filter cannot produce a deadlock. They failed again on the previous rerun, so I re-triggered the two failed jobs once more.
  • No re-merge: merge-base 2026-06-29 23:39 (<1 day), MERGEABLE with no conflicts, and the only reds fail on master too.

Blocked solely on human merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr pass — re-merged master (pushed 024655111fe).

Since the last pass, master gained new changes to the exact file this PR touches — tests/docker_scripts/upgrade_runner.sh — via the "Remove strange trash (again)" / "Narrow the upgrade-check filter to only the auto(N) quote rendering difference" commits (they landed on master's mainline after the previous cff0cc1177e merge). So the earlier "no re-merge, master hasn't touched this file" no longer held. I merged fresh master:

  • Clean merge, no conflicts. master's change is in the settings-changes-history section (~line 174, the auto(N) quote-rendering suppression), a different region from this PR's allowlist pipe (~line 502), so both changes coexist.
  • Net diff vs master is unchanged — still exactly the intended +16 lines: the documenting comment plus the single global grep -av -e "Value passed to 'throwIf' function is non-zero" in the secondary allowlist pipe.

State on the new head:

  • AI Review: ✅ Approve; 0 unresolved review threads (both threads resolved).
  • The PR's target check Upgrade check is what this fix addresses; the only reds on the prior head were Stress test (amd_tsan)/(arm_tsan)Hung check failed, possible deadlock found = the chronic master-wide flake Hung check failed, possible deadlock found #107941 the CI summary itself links. This change is provably off their code path: upgrade_runner.sh is invoked only by the Upgrade check job, never by the Stress test jobs, and a bash log-allowlist filter cannot produce a deadlock.

Fresh CI is running on 024655111fe. Blocked solely on human merge.

…4341-throwif

# Conflicts:
#	tests/docker_scripts/upgrade_runner.sh
@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr pass — re-merged master to resolve a conflict (pushed 883ff065381).

The PR had gone CONFLICTING/DIRTY: since the previous merge (024655111fe), master gained two more allowlist entries in the same secondary pipe this PR edits — the PostgreSQL background-reconnect filters (8b3cf74966f, #108998) and the 01155_old_mutation_parts_to_do Cannot parse string 'fail' as Int8 comment/entry (44a4c5a5c92, #108835). Both master and this branch insert a new grep -av line right after the DEADLOCK_AVOIDED line, so git flagged an overlap.

  • Resolved by keeping all three additions in the pipe (throwIf + the two PostgreSQL entries); order is irrelevant for independent grep -av filters. master's 01155 comment block is preserved and this PR's throwIf comment now follows it.
  • Net diff vs master is unchanged — still exactly the intended addition: the documenting comment block plus the single global grep -av -e "Value passed to 'throwIf' function is non-zero" in the secondary allowlist pipe (the form @azat requested). bash -n on the merged script passes.

State on the new head:

  • Now MERGEABLE (was CONFLICTING); fresh CI running on 883ff065381.
  • AI Review: ✅ Approve; 0 unresolved review threads (both clickhouse-gh and @azat threads resolved).
  • The PR's target check is Upgrade check; the only reds on prior heads were Stress test (amd_tsan)/(arm_tsan)Hung check failed, possible deadlock found = the chronic master-wide flake Hung check failed, possible deadlock found #107941 the CI summary itself links, provably off this code path (upgrade_runner.sh is invoked only by the Upgrade check job, never by Stress test).

Blocked on fresh CI and human merge.

@alexey-milovidov

Copy link
Copy Markdown
Member Author

continue-pr pass (head 883ff065381): no code changes; the PR is correct and ready. The only remaining red is CH Inc sync, and it is provably off this PR's code path.

  • Public CI is green, including the PR's target check Upgrade check (amd_release) (the check this fix addresses), Mergeable Check, Style check, Code Review, all Build/Stress test variants. MERGEABLE, 0 unresolved review threads, AI Review ✅. The prior TSan Hung-check flake (Hung check failed, possible deadlock found #107941) no longer recurs.

  • CH Inc sync = tests failed on sync PR ClickHouse/clickhouse-private#62669 (merges cleanly, MERGEABLE). The sole failing job is Stress test (arm_asan_ubsan, SharedCatalog, meta in keeper), which fails with a LOGICAL_ERROR in the Cloud coordination path, not a test assertion:

    Logical error: Current component is empty, please set it for your scope using Coordination::setCurrentComponent (STID: 2884-409a)
    

    This is the enforce_keeper_component_tracking server setting (added in 5abb7bee938, default false upstream but enabled under the SharedCatalog, meta in keeper config) firing because some ZooKeeper-request path in that config lacks a Coordination::setCurrentComponent guard — a trunk coverage gap in Cloud/SharedCatalog code.

  • This cannot be caused by this PR. The entire diff is +16 lines in one bash file, tests/docker_scripts/upgrade_runner.sh, which is invoked only by the Upgrade check job (never by Stress test, which runs stress_runner.sh). A log-allowlist grep -av filter cannot produce a Keeper coordination LOGICAL_ERROR, and the public Upgrade check passed.

  • No re-merge: merge-base is today (e9db786beec, ~05:42Z, < 1 day), MERGEABLE, no conflicts.

  • Action taken: re-triggered the failed private stress job (the SharedCatalog stress runs a random query mix, so a rerun may not re-hit the unguarded path).

@groeneai, please investigate the failure https://github.com/ClickHouse/clickhouse-private/actions/runs/28506310479/job/84499776097 and provide a fix in a separate PR — it looks like a missing Coordination::setCurrentComponent guard on a ZooKeeper-request path exercised under the SharedCatalog, meta in keeper stress config (setting enforce_keeper_component_tracking). I did not find an existing tracking issue; if a fix is already in progress, please link it here.

Blocked on the private stress rerun / a separate clickhouse-private fix, and human merge.

@groeneai

groeneai commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

@alexey-milovidov Investigated STID 2884-409a. Summary of what I found and where I'm blocked.

Mechanism (as you diagnosed): Coordination::ZooKeeper::pushRequest (src/Common/ZooKeeper/ZooKeeperImpl.cpp:~1524) throws this LOGICAL_ERROR when enforce_keeper_component_tracking=true and the thread issuing the ZK request has no active ComponentGuard (thread-local current_component empty). So the gap is a ZK-request path reached with no Coordination::setCurrentComponent(...) anywhere above it on the stack.

Public surface is already guarded. The setting is enabled in public CI too (tests/config/install.sh for stateless, tests/integration/helpers/cluster.py for integration), so any commonly-hit public path would already fail publicly. I built master (2a3a502d2ed) with the setting on + embedded Keeper and probed 35+ ZK-touching paths: system.{replicas,replication_queue,part_moves_between_shards,mutations,zookeeper,zookeeper_connection,zookeeper_info,distributed_ddl_queue}, SYSTEM SYNC/RESTART/RESTORE REPLICA, OPTIMIZE, ALTER ... UPDATE/DELETE, lightweight DELETE, DatabaseReplicated create/insert/rename/drop/exchange, background mutation finalize, BACKUP, CREATE USER/NAMED COLLECTION. Zero un-guarded component-tracking failures. (system.part_moves_between_shards is guarded via PartMovesBetweenShardsOrchestrator::getEntries.)

Only public reproduction of this error in CIDB (60d) is on an unrelated open PR adding system.s3(azure)_queue_metadata (#108522). Its new fillData calls tryGetChildren inside a ZooKeeperRetriesControl::retryLoop on the TCPHandler query thread with no guard, which is the same class of gap. Left for that PR's author since it is that PR's own new code, but it confirms the fix pattern.

Where I'm blocked. I can't reach the private run (clickhouse-private/actions/runs/28506310479 returns 404 for me), and SharedCatalog is Cloud-only so I can't reproduce the failing config in an open-source build. Given the public surface is clean, STID 2884-409a is most likely an un-guarded ZK path in the private SharedCatalog/meta-in-keeper code rather than in public code.

Ask: could you paste the server-side stack trace for the LOGICAL_ERROR from that job (the frames below ZooKeeper::pushRequest)? With it I can pinpoint the exact method:

  • if the un-guarded method is in shared public code (a ZK helper SharedCatalog also calls), I will open a public fix PR right away;
  • if it is in private SharedCatalog code, the one-line fix is to add at the top of the method (or the nearest common ancestor on that stack that owns the ZK operation):
    auto component_guard = Coordination::setCurrentComponent("SharedCatalog::<method>");

No existing tracking issue or fix PR found for this STID.

Unrelated finding while probing: ALTER TABLE x MOVE PART 'p' TO SHARD '/path' aborts a debug server with Inconsistent AST formatting (the query formats back without the SHARD keyword, failing the round-trip check in executeQuery.cpp:1247), reproducible on 2a3a502d2ed. I'll track that separately.

Session id: cron:clickhouse-worker-slot-0:20260701-105800

@alexey-milovidov

Copy link
Copy Markdown
Member Author

@alexey-milovidov alexey-milovidov left a comment

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trivial.

@alexey-milovidov alexey-milovidov added this pull request to the merge queue Jul 1, 2026
Merged via the queue into master with commit 91d6e70 Jul 1, 2026
174 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants