CI: add experimental serverfuzz stress test and BuzzHouse jobs by maxknv · Pull Request #101399 · ClickHouse/ClickHouse · GitHub
Skip to content

CI: add experimental serverfuzz stress test and BuzzHouse jobs#101399

Merged
maxknv merged 8 commits into
masterfrom
serverfuzz-experimental-jobs
Apr 15, 2026
Merged

CI: add experimental serverfuzz stress test and BuzzHouse jobs#101399
maxknv merged 8 commits into
masterfrom
serverfuzz-experimental-jobs

Conversation

@maxknv

@maxknv maxknv commented Mar 31, 2026

Copy link
Copy Markdown
Member

Re-enables the server-side AST fuzzer (ast_fuzzer_runs) that was disabled in #101274, but only for new dedicated CI jobs that carry the serverfuzz and experimental keywords in their names. Regular stress and BuzzHouse jobs are not affected.

New jobs added to master.py only (not run on PRs):

  • Stress test (experimental, serverfuzz, ...) — mirrors all existing stress test variants (11 build configs + 2 Azure)
  • BuzzHouse (experimental, serverfuzz, ...) — mirrors all existing BuzzHouse variants (4 build configs)

The setting is activated by checking Info().job_name in the Python job scripts:

  • stress_job.py passes ENABLE_SERVER_FUZZER=1 into docker; stress_tests.lib writes a separate XML profile when that env is set.
  • ast_fuzzer_job.py passes SERVER_FUZZER_ENABLED=1 into docker; run-fuzzer.sh writes a separate XML profile when that env is set. Enables both ast_fuzzer_runs and ast_fuzzer_any_query to cover write/DDL paths.

Additionally, when the server dies and a specific error is extracted from logs (e.g. Logical error, sanitizer finding), the redundant generic "Server died" test result is now excluded from the report.

Reverts the disable from: #101274

Changelog category (leave one):

  • CI Fix or Improvement (changelog entry is not required)

Version info

  • Merged into: 26.4.1.972

maxknv and others added 3 commits March 31, 2026 16:23
Add new `Stress test (experimental, serverfuzz, ...)` and
`BuzzHouse (experimental, serverfuzz, ...)` CI jobs that re-enable the
server-side AST fuzzer (`ast_fuzzer_runs`) which was disabled for regular
jobs in #101274.

The new jobs cover the same build variants as the existing stress and buzz
jobs. The `ast_fuzzer_runs` / `ast_fuzzer_any_query` settings are activated
only when `serverfuzz` is present in `Info().job_name`:
- `stress_job.py` passes `ENABLE_SERVER_FUZZER=1` to docker, and
  `stress_tests.lib` writes a separate XML config when that env is set.
- `ast_fuzzer_job.py` passes `SERVER_FUZZER_ENABLED=1` to docker, and
  `run-fuzzer.sh` writes a separate XML config when that env is set.

New jobs are added to both `master.py` and `pull_request.py` (for testing).

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@clickhouse-gh

clickhouse-gh Bot commented Mar 31, 2026

Copy link
Copy Markdown
Contributor

@clickhouse-gh clickhouse-gh Bot added the pr-ci label Mar 31, 2026
@alexey-milovidov

Copy link
Copy Markdown
Member

The Stress test (arm_msan) failure is fixed by #101239, which should be merged first. After it is merged, please update the branch to include the fix.

Comment thread ci/jobs/scripts/fuzzer/run-fuzzer.sh
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@maxknv maxknv marked this pull request as draft April 15, 2026 11:14
maxknv and others added 3 commits April 15, 2026 14:22
Keep them only in the master workflow.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When the server dies, the stress test reports both a generic "Server died"
from the test runner and a specific error parsed from server logs (e.g.
Logical error, sanitizer finding). Skip the generic "Server died" entry
since the log parsing always produces a more specific result.

#101399

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
If the server died but log parsing produced no specific result, re-add
"Server died" so the failure is never silently lost.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@maxknv maxknv marked this pull request as ready for review April 15, 2026 12:29
@maxknv maxknv enabled auto-merge April 15, 2026 12:40
@maxknv maxknv added this pull request to the merge queue Apr 15, 2026
Merged via the queue into master with commit 8a0628e Apr 15, 2026
161 checks passed
@maxknv maxknv deleted the serverfuzz-experimental-jobs branch April 15, 2026 15:41
@robot-clickhouse-ci-1 robot-clickhouse-ci-1 added the pr-synced-to-cloud The PR is synced to the cloud repo label Apr 15, 2026
@pufit

pufit commented Apr 15, 2026

Copy link
Copy Markdown
Member

@groeneai please read this PR.
It introduces a new experimental stress test that can fail in master. You should fix these failures as P0 priority. Update your notes accordingly.

@groeneai

Copy link
Copy Markdown
Contributor

Read the PR. Here is my understanding:

New jobs (MasterCI only):

  • Stress test (experimental, serverfuzz, <build>) — 13 variants (all build configs + 2 Azure)
  • BuzzHouse (experimental, serverfuzz, <build>) — 4 variants (debug, arm_asan, tsan, msan)

Mechanism: stress_job.py sets ENABLE_SERVER_FUZZER=1 when job name contains "serverfuzz", which enables ast_fuzzer_runs + ast_fuzzer_any_query server-side settings. This re-enables the server-side AST fuzzer that was disabled in #101274 — but only in these dedicated experimental jobs.

Monitoring updated:

  • CIDB scan now recognizes check_name LIKE '%experimental%serverfuzz%' patterns
  • Failures from these jobs will be classified as P0 (server crashes, LOGICAL_ERROR, sanitizer findings under active server fuzzing)
  • These fire on MasterCI only, so they will appear with pull_request_number = 0

Will start monitoring for failures as they come in.

@nikitamikhaylov

nikitamikhaylov commented Apr 16, 2026

Copy link
Copy Markdown
Member

@groeneai There are lots of crashes found by this job already. Investigate and fix them all: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

@groeneai

Copy link
Copy Markdown
Contributor

@nikitamikhaylov Here's a comprehensive catalog of all crashes found by the experimental serverfuzz jobs on master. I've been tracking these since the jobs went live (April 15).

Serverfuzz Crash Catalog — Master Branch (last 30 days)

22 total failures across 7 master commits, mapping to 11 distinct crash types. All are tracked with fix PRs or under active investigation.

# Crash Type STID Hits Fix PR Status
1 ReplicatedMergeTree startup raceparanoidCheckForCoveredParts assertion + server startup failures 2508-5dc3, 2508-6644 7 #102864 (ours) 🔄 Under review (@tavplubix, @alesapin)
2 Hung check / possible deadlock — chronic, low-rate (~1/day) 3 No dedicated fix 🔍 Chronic P2, monitoring
3 Aggregate function parameters assertionparameters == nested_func->getParameters() 2508-522c, 4870-4f21 2 #102856 Merged — monitoring for recurrence
4 isConst/isSparse/isReplicated type mismatchgetDataType() == rhs.getDataType() assertion 2508-35cc, 2508-35de 2 #101036 (ours) 🔄 Open, awaiting review
5 Column identifier already registeredGlobalPlannerContext UNION ALL edge case 4697-4326, 4697-369d 2 #100770, #101104 (@alexey-milovidov) 🔄 Two fix PRs open
6 Cannot fold actions for projection — join reorder + type-changing joins 3673-56a1 2 #102351, #102516 (external) 🔄 Two fix PRs open
7 Not-ready Set passed to function — IN subquery moved to PREWHERE 0250-41a5 1 #100375 (external) 🔄 Open
8 Unexpected number of rows in column subchunk — Parquet reader 4002-3ca5 1 #102628 (ours) 🔄 Open, under review
9 std::length_error — schema inference crash from oversized allocation 2508-3921 1 #102859 (ours) 🔄 Open, under review
10 No available columnsgetSmallestColumn() with subquery projections 3938-33a6 1 #101005 (ours) ✅ Approved by @pufit, awaiting 2nd reviewer
11 Block structure mismatch — different columns in stream 0993-28b7 1 No fix yet 🆕 New — investigating

Summary

I'll continue working through the queue. The highest priority remaining items are:

  1. Getting our fix PRs reviewed and merged (Fix false-positive chassert in paranoidCheckForCoveredPartsInZooKeeperOnStart #102864, Fix ColumnReplicated type mismatch crash in merge algorithms #101036, Fix server crash from std::length_error during schema inference #102859)
  2. Investigating the new block structure mismatch (STID 0993-28b7)
  3. Monitoring the external fix PRs for merge

@alexey-milovidov

Copy link
Copy Markdown
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

pr-ci pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants