Allowlist remote database engine connection errors in upgrade check by groeneai · Pull Request #108560 · ClickHouse/ClickHouse · GitHub
Skip to content

Allowlist remote database engine connection errors in upgrade check#108560

Merged
alexey-milovidov merged 3 commits into
ClickHouse:masterfrom
groeneai:groeneai/upgrade-check-postgresql-cleaner-noise
Jul 2, 2026
Merged

Allowlist remote database engine connection errors in upgrade check#108560
alexey-milovidov merged 3 commits into
ClickHouse:masterfrom
groeneai:groeneai/upgrade-check-postgresql-cleaner-noise

Conversation

@groeneai

@groeneai groeneai commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

...

Description

The Upgrade check asserts the server log contains no <Error> lines after the upgrade restart. It started failing with five lines from a single test, originating from two remote database engines:

<Error> mysqlxx::Pool: Failed to connect to MySQL (fake_db@192.0.2.1:3306 as user user): mysqlxx::ConnectionFailed
<Error> Application: Connection to mysql failed 1 times
<Error> DatabaseMySQL: Code: 279. DB::Exception: Connections to mysql failed: fake_db@192.0.2.1:3306 as user user, ERROR 2002 ... (ALL_CONNECTION_TRIES_FAILED)
<Error> PostgreSQLConnectionPool: Connection error: connection to server at "192.0.2.1", port 5432 failed: timeout expired
<Error> void DB::DatabasePostgreSQL::removeOutdatedTables(): Code: 614. DB::Exception: Try 2. Connection to `192.0.2.1:5432` failed with error: connection to server at "192.0.2.1", port 5432 failed: timeout expired

Root cause. Test 04210_show_remote_databases_in_system_tables creates two remote database engines pointing at the deliberately unreachable host 192.0.2.1 (RFC5737 TEST-NET-1, a documentation address):

  • CREATE DATABASE ... ENGINE = PostgreSQL('192.0.2.1:5432', ...)
  • ATTACH DATABASE ... ENGINE = MySQL('192.0.2.1:3306', ...) SETTINGS connect_timeout = 1, connection_max_tries = 1

For PostgreSQL, DatabasePostgreSQL::loadStoredObjects activates the removeOutdatedTables cleaner task, whose periodic pool->get() connects unconditionally and times out against the unreachable host, logging <Error> for each retry, including right after the upgrade restart reloads the persisted engine.

For MySQL, the engine probes the server while loading the persisted object after the upgrade restart, logging <Error> from mysqlxx::Pool, the PoolWithFailover retry logger, and the DatabaseMySQL ALL_CONNECTION_TRIES_FAILED exception.

The upgrade-test environment runs the stateless suite and has no real PostgreSQL or MySQL server, so any connection error against this fixture host is noise rather than a compatibility regression. Genuine PostgreSQL/MySQL regressions are covered by integration tests with a live server.

Fix. Allowlist these lines in tests/docker_scripts/upgrade_runner.sh. Each matcher requires the component AND the connection-failure symptom together, so other DatabasePostgreSQL/DatabaseMySQL errors (logic errors, parsing, a different error code) still surface. This follows the established pattern for the same class of test-fixture-against-unreachable-resource noise already in the allowlist (for example the librdkafka and StorageKafka2 broker-transport entries).

The errors themselves are correct engine behavior and are intentionally not changed.

Validation. Running the exact upgrade_runner.sh filter pipeline against the real upgrade-check clickhouse-server.upgrade.log from the failing run (CI report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107927&sha=fd3cb89f6afe8c19648b2cab9a32bb74ed8dc9c3&name_0=PR&name_1=Upgrade%20check%20%28amd_release%29 ):

  • Without the change (master): all 5 <Error> lines (3 MySQL + 2 PostgreSQL) leak through -> check FAILs.
  • With the change: 0 lines remain -> check passes (verified over the full 237645-line log).
  • The only delta is exactly those fixture lines; nothing else is masked.
  • Negative test: genuine DatabaseMySQL/mysqlxx::Pool logic, parse, and different-code errors, plus a non-mysql Application: Connection to ... line, all still surface.

Version info

  • Merged into: 26.7.1.407

@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @maxknv — could you review this? It allowlists two background-cleaner <Error> lines in the upgrade-check runner (upgrade_runner.sh): test 04210_show_remote_databases_in_system_tables creates a DatabasePostgreSQL engine pointing at the deliberately-unreachable RFC5737 host 192.0.2.1, whose periodic removeOutdatedTables cleaner times out and logs <Error>, tripping the upgrade check's no-error-lines assertion. Each matcher requires the component plus the connection-failure symptom together, so genuine DatabasePostgreSQL errors still surface.

The upgrade check asserts that the server log contains no <Error> lines.
Test 04210_show_remote_databases_in_system_tables creates two remote database
engines pointing at the deliberately unreachable host 192.0.2.1 (RFC5737
TEST-NET-1):
  - ENGINE = PostgreSQL('192.0.2.1:5432', ...)
  - ENGINE = MySQL('192.0.2.1:3306', ...)

PostgreSQL: DatabasePostgreSQL::loadStoredObjects activates the
removeOutdatedTables cleaner task, whose periodic pool->get() times out against
the unreachable host and logs <Error> for each retry, including right after the
upgrade restart reloads the persisted engine.

MySQL: the engine probes the server while loading the persisted object after the
upgrade restart, logging <Error> from mysqlxx::Pool, the PoolWithFailover retry
logger, and the DatabaseMySQL ALL_CONNECTION_TRIES_FAILED exception.

The upgrade-test environment has no real PostgreSQL or MySQL server, so these
connection errors are fixture noise, not a compatibility regression (real
PostgreSQL/MySQL regressions are covered by integration tests with a live server).

Allowlist them in upgrade_runner.sh by requiring the component AND the
connection-failure symptom together, so genuine DatabasePostgreSQL/DatabaseMySQL
errors (logic, parsing, a different error code) still surface.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai groeneai force-pushed the groeneai/upgrade-check-postgresql-cleaner-noise branch from 7ea15b6 to fe57785 Compare June 26, 2026 01:54
@groeneai groeneai changed the title Allowlist DatabasePostgreSQL background-cleaner connection errors in upgrade check Allowlist remote database engine connection errors in upgrade check Jun 26, 2026
@groeneai

Copy link
Copy Markdown
Contributor Author

Updated the fix to also cover the MySQL engine that test 04210_show_remote_databases_in_system_tables creates at the same unreachable 192.0.2.1 host. The PostgreSQL-only version left three MySQL <Error> lines leaking, so the upgrade check stayed red. This PR now allowlists both engines (5 matchers total). Updated against the authentic upgrade.log from the failing Upgrade check (amd_release) run on commit fd3cb89f6afe.

Pre-PR validation gate (click to expand)
# Question Answer
a Deterministic repro? Yes. Ran the exact upgrade_runner.sh filter pipeline against the real failing upgrade.log (PR #107927 commit fd3cb89f6afe, 237645 lines). Unpatched: 5 <Error> lines leak (3 MySQL + 2 PostgreSQL) → check FAILs. The MySQL subset leaks deterministically with the PG-only matchers.
b Root cause explained? Yes. Test 04210 creates ENGINE = PostgreSQL('192.0.2.1:5432',...) and ENGINE = MySQL('192.0.2.1:3306',...) at RFC5737 TEST-NET-1 (deliberately unreachable). On the upgrade restart, reloading the persisted engines triggers connection attempts: PostgreSQL via the removeOutdatedTables cleaner task, MySQL via the engine's server probe. Both log expected <Error> connection failures (mysqlxx::Pool, PoolWithFailover retry logger, DatabaseMySQL ALL_CONNECTION_TRIES_FAILED; PostgreSQLConnectionPool, DatabasePostgreSQL::removeOutdatedTables). The upgrade check's "zero <Error> lines" assertion then fails.
c Fix matches root cause? Yes. The errors are correct engine behavior against an intentionally-unreachable fixture host, so the proper fix is to allowlist that specific fixture noise in the upgrade-check log assertion (established precedent: librdkafka, StorageKafka2 broker-transport, and our merged metric_log/Kafka2 entries). Each matcher requires the engine component AND the connection-failure symptom together. The engine log levels are intentionally not changed.
d Test intent preserved / new tests added? Yes. Negative-tested: genuine DatabaseMySQL/mysqlxx::Pool logic/parse/different-code errors, a non-mysql Application: Connection to ... line, and genuine DatabasePostgreSQL errors all still surface. The upgrade check retains full sensitivity to real regressions. No new test file needed (this is a CI log-allowlist fix; real PG/MySQL regressions are covered by integration tests with live servers).
e Both directions demonstrated? Yes. Master (no matchers): 5 lines leak → FAIL. PG-only (prior state of this PR): 3 MySQL lines leak → still red. Patched (PG+MySQL): 0 lines leak over the full 237645-line authentic log → PASS.
f Fix is general across code paths? Yes. Test 04210 is the only fixture that points a remote DB engine at 192.0.2.1; covered both engines it creates (PG + MySQL). All three MySQL log sources (mysqlxx::Pool Pool.cpp:406, PoolWithFailover retry logger :252, DatabaseMySQL ALL_CONNECTION_TRIES_FAILED throw :281) are matched. No other unreachable-host fixture leaks remain in the log.
g Fix generalizes across inputs? N/A (CI log-allowlist fix, not a code bug). Matchers use no hardcoded IP, so a future TEST-NET fixture host is covered too.
h Backward compatible? N/A (CI script only; no server behavior, setting, or format change).
i Invariants and contracts preserved? Yes. The change only adds filter lines to the secondary grep -av pipe in upgrade_runner.sh; the assertion's invariant ("any non-allowlisted <Error> fails the check") is preserved. bash -n syntax check passes. Each matcher requires component + connection-failure symptom together, so no over-masking.

Session id: cron:clickhouse-worker-slot-0:20260626-014500

@nikitamikhaylov nikitamikhaylov added the can be tested Allows running workflows for external contributors label Jun 26, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [d6c123a]

Summary:


AI Review

Summary

This PR's remaining net change against master is the MySQL half of the upgrade-log allowlist in upgrade_runner.sh: three MySQL connection-failure patterns plus an explanatory comment. The change addresses the observed 04210_show_remote_databases_in_system_tables noise, but the allowlist is still broader than that fixture, so I cannot approve it as-is.

Findings

⚠️ Majors

  • [tests/docker_scripts/upgrade_runner.sh:510-512] [dismissed by author -- https://github.com/Allowlist remote database engine connection errors in upgrade check #108560#discussion_r3481695551] The new MySQL allowlist still drops every mysqlxx::Pool / DatabaseMySQL connection failure and the shared Application: Connection to mysql failed retry line without tying them to the known leftover fake_db@192.0.2.1:3306 fixture. That weakens the upgrade check from "ignore this one interrupted remote-database test" to "ignore MySQL connection-failure logs generally", so an upgrade regression in how a persisted DatabaseMySQL object reloads can disappear from upgrade_error_messages.txt as long as it still collapses to these shared failure messages. The maintainer's "arbitrary unreachable hostnames" rationale does not solve the hostless Application: line or the lack of per-fixture scoping.
  • Suggested fix: keep the host-bearing patterns tied to the fixture description (for example fake_db@.* as user user, or the exact test-net host if that is guaranteed), and suppress the hostless retry line only when it appears in the same MySQL error block as a fixture-matching mysqlxx::Pool / DatabaseMySQL line.
Final Verdict

Status: ⚠️ Request changes

Minimum required actions:

  • Narrow the MySQL allowlist so it only suppresses the known 04210 leftover-object error block instead of any MySQL connection failure that happens to use these shared messages.

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Jun 26, 2026
Comment thread tests/docker_scripts/upgrade_runner.sh Outdated
@groeneai

Copy link
Copy Markdown
Contributor Author

CI finish ledger — fe57785

Every failure below has an owner: a fixing PR (ours or external), or a full-effort fix task whose fixing-PR link will be posted here when it opens. Only CH Inc sync is exempt.

Check / test Reason Owner / fixing PR
Stress test (arm_debug) / Logical error 'future_error ... promise has been destructed' (STID 2508-3dee) trunk teardown race (packaged_task destroyed before its promise is satisfied in threadPoolCallbackRunnerUnsafe / TaskTracker shutdown; chronic across 6 PRs + 1 master / 30d; #108560 touches only upgrade_runner.sh, no link to thread-pool shutdown) #107383 (ours, open)
Upgrade check (amd_release) / Error message in server.log (MutatePlainMergeTreeTask CANNOT_PARSE_TEXT + throwIf) benign background-mutation log noise from fuzzer-typed test data (distinct from the PG/MySQL connection errors this PR allowlists) #108563 (ours, open)

Session id: cron:our-pr-ci-monitor:20260626-160000

@alexey-milovidov alexey-milovidov self-assigned this Jun 28, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member

@groeneai

Copy link
Copy Markdown
Contributor Author

@alexey-milovidov this logical error is not caused by this PR (it only changes upgrade_runner.sh). It is a chronic Stress-test finding: std::future_error broken_promise (STID 2508-3dee and siblings), 14+ days across all sanitizers, independent of the PR.

Root cause: TaskTracker::addFinal (the async S3 multipart-upload completion, added in #105487) stores the final task's future before scheduling it. When the thread fuzzer injects CANNOT_SCHEDULE_TASK, the final packaged_task is destroyed unrun, leaving the stored future with a broken promise; waitAll() then throws future_error (a std::logic_error), which aborts in debug/sanitizer instead of surfacing the real scheduling error.

Fix in #108730: run the final task inline if scheduling fails, so its future always carries the real result/exception. Reproduced locally (MinIO + fault injection) and added a unit test that fails without the fix.

@clickhouse-gh clickhouse-gh Bot added the manual approve Manual approve required to run CI label Jun 30, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member

Continued via /continue-pr.

Merged fresh master (branch was 1306 commits / ~5 days behind and red). Clean merge — the net diff vs master is unchanged, only this PR's MySQL/PostgreSQL matchers in tests/docker_scripts/upgrade_runner.sh (bash -n passes). Pushed fe57785..fda7664.

The merge also pulls in master's existing Cannot parse string 'x' as UInt64 allowlist entry, which the stale branch was missing. That closes the part of the prior Upgrade check (amd_release) failure originating from 04338_on_fly_mutation_read_overwritten_lc_source (MODIFY COLUMN v UInt64 over String 'x', CANNOT_PARSE_TEXT).

The two prior CI reds are unrelated to this change (which only adds MySQL/PostgreSQL connection-failure matchers):

Review: the clickhouse-gh AI "⚠️ Request changes" (narrow the allowlist to the documented 192.0.2.1 fixture) restates the inline thread that @alexey-milovidov already adjudicated — "This is ok, because the stress test can generate arbitrary unreachable hostnames." Narrowing to the hardcoded IP would defeat fuzzer-generated-hostname coverage, and the upgrade-check environment has no real MySQL/PostgreSQL server (genuine regressions are covered by integration tests with a live server). Per the maintainer's decision the matchers stay broad-but-bounded (component and connection-failure symptom together); the thread is resolved.

@alexey-milovidov

Copy link
Copy Markdown
Member

Continued via /continue-pr.

The branch is already fully merged with master (0 commits behind; master is an ancestor of HEAD fda7664), MERGEABLE, and approved. The net diff vs master is exactly this PR's change — the 5 MySQL/PostgreSQL connection-failure matchers plus the explanatory comment in tests/docker_scripts/upgrade_runner.sh, nothing else. No new review threads, all resolved.

Action this run: the gated CI workflow run for the merge commit fda7664 was still in action_required (the prior merge push left it unapproved). Approved it — fresh CI is now running on fda7664. The previously reported reds were against the stale pre-merge commit fe57785.

Expected on the fresh run: the Cannot parse string 'x' as UInt64 (CANNOT_PARSE_TEXT) part of the prior Upgrade check failure is now covered — the master merge pulled in that allowlist entry. The two remaining flakes are unrelated to this one-file change and tracked by dedicated PRs:

The clickhouse-gh AI "⚠️ Request changes" (narrow the allowlist to the literal 192.0.2.1 host) was already adjudicated by the maintainer in a now-resolved inline thread — narrowing to the hardcoded IP would defeat fuzzer-generated-hostname coverage, and the upgrade-check environment has no real MySQL/PostgreSQL server (genuine regressions are covered by integration tests with a live server). Matchers stay broad-but-bounded (component and connection-failure symptom together).

Blocked only on fresh CI completing + maintainer merge.

@alexey-milovidov

Copy link
Copy Markdown
Member

Continued via /continue-pr.

Fresh CI on fda7664 is complete and the final state differs from the run I projected earlier — the upgrade-check fix landed cleanly:

  • Upgrade check (amd_release) is now green. This is the exact check this PR targets, so the MySQL/PostgreSQL connection-failure allowlist (plus the Cannot parse string 'x' as UInt64 entry pulled in via the master merge) is confirmed working. The previously-leaking <Error> lines no longer trip the assertion.

The only two reds are unrelated flakes — impossible for a CI-only change to tests/docker_scripts/upgrade_runner.sh to cause:

  • Stress test (amd_tsan) and Stress test (arm_tsan)Hung check failed, possible deadlock found. CI itself auto-marked both as flaky (matched ... issue: #107941). The hung query in both dumps is 04412_deep_nested_literal_format.sh (deep-nested literal formatting), with no relation to the upgrade-check runner. This is the chronic master-wide hung-check, tracked by Hung check failed, possible deadlock found #107941.

CH Inc sync is the exempt internal sync check.

No code change needed: APPROVED, AI Review ✅ Approve, 0 unresolved threads, MERGEABLE. Ready to merge pending maintainer.

…eck-postgresql-cleaner-noise

# Conflicts:
#	tests/docker_scripts/upgrade_runner.sh
@alexey-milovidov

Copy link
Copy Markdown
Member

Continued via /continue-pr.

The branch was re-merged with fresh master (head d6c123a), but its PR workflow run was stuck in action_required (the fork-CI gate) and never fired. Approved it — CI is now running on the fresh merge.

One scope change worth noting: master has since picked up the PostgreSQL half of this fix independently, via #108998 (Ignore benign PostgreSQL background-reconnection errors in the upgrade check, merged 2026-07-01 03:01Z). So the net diff of this PR is now just the 3 MySQL matchers plus their comment in tests/docker_scripts/upgrade_runner.sh:

mysqlxx::Pool.*Failed to connect to MySQL
Application: Connection to mysql failed
DatabaseMySQL.*Connections to mysql failed

master still has zero MySQL allowlist entries, so these remain the unique remaining gap for the DatabaseMySQL engine that 04210_show_remote_databases_in_system_tables creates at the unreachable RFC 5737 host 192.0.2.1.

Still approved, 0 unresolved review threads. The only prior red was the unrelated #107941 hung-check flake (still open); the fresh CI run will re-confirm the Upgrade check fix on top of #108998.

@alexey-milovidov

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors manual approve Manual approve required to run CI pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants