Add loopback interface aliases on macOS to run parallel-replica remote() tests#107435
Conversation
…e() tests macOS only binds 127.0.0.1 to lo0 and, unlike Linux, does not auto-route the rest of 127.0.0.0/8. Tests that connect to 127.0.0.2+ via remote()/cluster() (parallel replicas, distributed, sharded) deterministically time out, so they were added to ci/defs/darwin.skip. Alias 127.0.0.2..127.0.0.16 on lo0 in the macOS Fast test pre-hook (the server already listens on 0.0.0.0, so the addresses become reachable once aliased) and un-skip the parallel-replica remote() tests that were skipped solely for this loopback limitation. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @maxknv — could you review this? Follow-up to #107420 (requested by @alexey-milovidov): aliases 127.0.0.2..16 on lo0 in the macOS Fast test pre-hook so remote()/cluster() tests can reach them, and un-skips the 5 parallel-replica remote() tests that were in darwin.skip solely for this loopback limitation. |
|
Workflow [PR], commit [0c3d0c7] Summary: ✅ AI ReviewSummaryThis PR updates the Darwin fast-test job to run through Final VerdictStatus: ✅ Approve |
The previous form `sudo ifconfig lo0 alias 127.0.0.$i up 2>/dev/null || true` swallowed every error, so a genuine sudo/ifconfig failure (permissions, runner change) would pass the pre-hook and the un-skipped parallel-replica remote() tests would later fail with misleading socket timeouts instead of surfacing the real setup failure. Now each address is aliased only when absent (idempotent on reused runners), and a real alias failure aborts the pre-hook with a non-zero exit so the job fails loudly at setup. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-PR validation gate (fixup 8c63b5b — fail-closed pre-hook)
Session id: cron:clickhouse-worker-slot-1:20260614-102600 |
|
@alexey-milovidov The alias approach has a fundamental conflict I need your call on before proceeding.
That deterministically breaks two EXPLAIN-plan tests that assert
Both were green on arm_darwin for the last 10 days (414/415 OK, 0 FAIL) and first failed on this PR's own job. Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=107435&sha=a722a05793171d58559dcb16e13e6d2ba998b593&name_0=PR&name_1=Fast%20test%20%28arm_darwin%29 . They are the only 2 of the 9 EXPLAIN-distributed tests not already in So the two macOS test classes have contradictory requirements at Side effect: the aliases persist on the shared macos runner with no cleanup, so 4 unrelated PRs (including one of yours) have already inherited them and failed 01951/01952. Those runners likely need recycling. Two ways forward, your pick:
I will implement whichever you choose. |
The lo0 aliases for 127.0.0.2+ make those addresses appear as local
interface addresses on macOS. isLocalAddress() treats 127.0.0.{2..255}
as non-local by design (matched via getifaddrs), so 01951 and 01952,
which EXPLAIN remote('127.{1,2}', ...) and assert a distributed plan,
see ReadFromRemote collapse to a local read and fail. These two tests
cannot coexist with the aliases on macOS, so skip them there (they
still run on Linux, joining their seven ReadFromRemote-EXPLAIN
siblings already in darwin.skip).
Also add a post_hook that removes the aliases after the job. The
macos_m2 runner is reused, and a system-level alias with no teardown
leaked onto later jobs on the same runner, causing the same 01951/01952
failures on unrelated PRs. The teardown is best-effort.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-PR validation gate (follow-up commit 85fe945 - option A completion)
Session id: cron:clickhouse-worker-slot-1:20260614-182500 |
|
Thanks for the LGTM. I pushed one follow-up commit (85fe945) to complete the approved approach so it lands without a regression. While validating the un-skipped tests I found that the lo0 aliases make The follow-up commit therefore:
Runners already contaminated by the earlier commits likely need recycling to clear the leftover aliases. |
The CI run for 85fe945 was dropped because the praktika Config Workflow job hung and was killed (exit -15), a transient infra issue that hit several PRs simultaneously. No source change; this empty commit re-runs the workflow to validate the loopback-alias teardown and the 01951/01952 darwin.skip entries. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The lo0 alias setup was in pre_hooks. In ci/praktika/runner.py the pre-hook outcome is stored in prehook_result but res is never reassigned from it, so the `if res:` block still launches fast_test.py even when the alias setup fails. The job is reported failed, but the un-skipped remote()/cluster() tests run anyway and report misleading socket timeouts instead of surfacing the real setup failure. Move the alias loop into the job command, chained with `&& python3 ./ci/jobs/fast_test.py` and guarded by `|| exit 1`, so a genuine alias failure stops before the SQL tests start. This is fail-closed at the actual run boundary without changing shared praktika framework behavior (darwin_fast_test_jobs is the only job using job-level pre_hooks). The loop stays idempotent (skip an address already aliased). The de-alias cleanup remains in post_hooks, which run regardless of job pass/fail. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Pre-PR validation gate (update for commit 933ac96 — command-path alias setup)
Session id: cron:clickhouse-worker-slot-1:20260615-085300 |
Move the lo0 de-alias into the job command (after fast_test.py) so its exit code reaches the job status: praktika does not propagate pre/post hook exit codes, so a teardown in a post_hook can leak a host-level alias while the job still passes, poisoning later fast-test jobs on the reused macos_m2 runner. Teardown now tolerates only the already-absent case (check ifconfig lo0 first) and sets rc=1 on a real removal failure, while preserving the fast_test.py exit code. The post_hook de-alias is kept as a best-effort net for the timeout path (a timed-out command is killed before its teardown runs). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Algunenano
left a comment
There was a problem hiding this comment.
@groeneai This is failing in https://github.com/ClickHouse/ClickHouse/actions/runs/27560062050/job/81469550963?pr=107435#step:4:32
Traceback (most recent call last):
File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/ci/praktika/__main__.py", line 422, in <module>
main()
File "/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/ci/praktika/__main__.py", line 347, in main
workflows = _get_workflows(
File "/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/ci/praktika/mangle.py", line 73, in _get_workflows
spec.loader.exec_module(foo)
File "<frozen importlib._bootstrap_external>", line 883, in exec_module
File "<frozen importlib._bootstrap>", line 241, in _call_with_frames_removed
File "/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/./ci/workflows/merge_queue.py", line 4, in <module>
from ci.defs.job_configs import JobConfigs
File "/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/ci/defs/job_configs.py", line 174, in <module>
class JobConfigs:
File "/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/ci/defs/job_configs.py", line 239, in JobConfigs
).parametrize(
File "/home/ubuntu/actions-runner/_work/ClickHouse/ClickHouse/ci/praktika/job.py", line 115, in parametrize
obj.command = obj.command.format(PARAMETER=param_set.parameter)
KeyError: ' sudo ifconfig lo0 -alias 127'
|
CI fully finished on HEAD Session: cron:our-pr-ci-monitor:20260615-160000 |
parametrize() runs str.format(PARAMETER=...) on the job command, so the
literal { } in the teardown brace group { sudo ... || rc=1; } was parsed as a
format field and raised KeyError, breaking the whole workflow config load.
Rewrite the teardown as if/then/fi so the command has no literal braces. The
fail-closed semantics are unchanged: only present aliases are removed, a
failed removal sets rc=1, an absent alias is tolerated, and the fast_test.py
exit code is preserved.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@Algunenano Fixed in 3665955. Root cause: parametrize() runs Fix: rewrote the teardown brace group as Verified |
`04308_remote_storage_engine` exercises the multi-shard case
`Remote('127.0.0.{1,2}', system, one)`, which connects to the second
loopback shard `127.0.0.2:9000`. On the `arm_darwin` Fast test runner,
macOS binds only `127.0.0.1` on `lo0` and does not auto-route the rest
of `127.0.0.0/8`, so `127.0.0.2` is unroutable and the query fails with
`ALL_CONNECTION_TRIES_FAILED`.
This is the same macOS shard-test infrastructure issue already handled
for the sibling tests `04327_rewrite_aggregate_function_with_if_distributed`
and `04336_parallel_blocks_marshalling_low_cardinality_native_format`,
which were added to `ci/defs/darwin.skip` in #107376. `04308` is a new
test introduced by this pull request, so it was not covered there. Add
it to `ci/defs/darwin.skip` for consistency until the runner-side fix
(#107435) aliases the additional loopback addresses and these tests can
run on macOS again.
The test itself is unchanged and keeps full coverage on Linux.
Report: https://s3.amazonaws.com/clickhouse-test-reports/json.html?PR=106189&sha=dbc0fc28753abffe3096e1064be57a6c440263b6&name_0=PR&name_1=Fast%20test%20%28arm_darwin%29
Related: #107376
Related: #107435
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The Fast test (arm_darwin) job inlined the lo0 alias setup, fast_test.py
invocation, and fail-closed teardown into the job command string. praktika
processes that string twice in ways that broke on shell syntax:
1. parametrize() runs str.format(PARAMETER=...) on the command, so a bash
brace group { ...; } was parsed as a format field -> KeyError.
2. the run-command validator splits on spaces and asserts the first token
containing '/' is a real path; 'python3 ./ci/jobs/fast_test.py;' carried
the trailing ';' -> AssertionError, failing Config Workflow and dropping
every downstream job.
Move the whole sequence into ci/jobs/scripts/fast_test_darwin.sh and make the
command the bare script path, mirroring the existing sqlancer_job.sh job. The
script keeps the fail-closed teardown (remove only present 127.0.0.2..16
aliases, fail the job on a failed removal, tolerate already-absent ones) and
preserves fast_test.py's exit code. Add the script to the Darwin fast test
digest so edits reschedule the job.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
@Algunenano The config-load breakage is fixed in 73a2728. After the parametrize() brace fix, the actual fork CI run hit a second praktika check on the same inlined command. The run-command validator (validator.py:296) splits the command on spaces and asserts the first token containing '/' is a real path, but that token was Both failures (the str.format brace KeyError and this) came from inlining shell into the job command string. Fix: moved the lo0 alias setup + fast_test.py + fail-closed teardown into ci/jobs/scripts/fast_test_darwin.sh and made the command the bare script path, mirroring the existing sqlancer_job.sh job. The command now has no braces and no trailing ';', so neither check can trip again. Verified locally: Validator().validate() passes over all 17 workflows (the Config Workflow step that failed). The fail-closed teardown is preserved (removes only present 127.0.0.2..16 aliases, fails the job on a failed removal, tolerates already-absent ones) and fast_test.py's exit code is kept. |
praktika's runner.py appends local-run selectors (--test, --param, ...) to job.command, and the previous inlined `python3 ./ci/jobs/fast_test.py` consumed them directly. The fast_test_darwin.sh wrapper invoked fast_test.py with no arguments, so a targeted Darwin run such as `python -m ci.praktika run fast --test 02947_parallel_replicas_remote --param test` ran the full suite instead of narrowing to the requested test/param. Pass "$@" through to fast_test.py so the wrapper preserves the original argument contract. The fail-closed lo0-alias setup/teardown and the fast_test.py exit code are unchanged. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Algunenano
left a comment
There was a problem hiding this comment.
LGTM. Although I think this should be handled in the machines themselves and they shouldn't be reused without control (cc @leshikus) but let's unblock this

Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
...
Description
Related: #107420, #106689
Follow-up to #107420 (requested by @alexey-milovidov).
macOS only binds
127.0.0.1tolo0and, unlike Linux, does not auto-route the rest of127.0.0.0/8. Tests that connect to127.0.0.2+viaremote()/cluster()(parallel replicas, distributed, sharded) deterministically time out, which is why a large set of them sit inci/defs/darwin.skip.This adds a pre-hook to the macOS Fast test job that aliases
127.0.0.2..127.0.0.16onlo0(the server already listens on0.0.0.0, so the addresses become reachable once aliased). The alias loop is idempotent on reused runners.With the aliases in place, the parallel-replica
remote()tests skipped solely for this loopback limitation can run again, so this also un-skips:02875_parallel_replicas_remote02947_parallel_replicas_remote03562_parallel_replicas_remote_with_cluster03572_pr_remote_in_subquery03595_parallel_replicas_join_remoteScope is kept conservative: only the parallel-replica
remote()family is un-skipped here. About 150 otherdarwin.skipentries connect to127.0.0.2+and should now pass too; they can be un-skipped incrementally once this lands and the runner behavior is confirmed.Version info
26.6.1.872