perf(ci): UI E2E tests on GHA by davdhacs · Pull Request #20345 · stackrox/stackrox · GitHub
Skip to content

perf(ci): UI E2E tests on GHA#20345

Draft
davdhacs wants to merge 273 commits into
masterfrom
davdhacs/ui-e2e
Draft

perf(ci): UI E2E tests on GHA#20345
davdhacs wants to merge 273 commits into
masterfrom
davdhacs/ui-e2e

Conversation

@davdhacs

@davdhacs davdhacs commented May 4, 2026

Copy link
Copy Markdown
Contributor

Description

Run the full Cypress UI E2E test suite on GHA with 9 parallel shards. Supports four cluster modes: KinD (default), GKE (native provisioning), and any infractl flavor (GKE, OCP, ROSA HCP) via provisioning or cluster re-use.

New files:

  • .github/workflows/ui-e2e.yaml — main workflow (9 shards, KinD/GKE/infra modes)
  • .github/actions/connect-infra-cluster/action.yaml — connect to infractl-provisioned clusters with auto-detection of auth type (GKE gcloud plugin, OCP certificates, ROSA HCP OAuth token refresh)
  • .github/actions/create-gke-cluster/action.yaml — provision GKE cluster with spot VMs
  • .github/actions/create-kind-cluster/action.yaml — provision KinD cluster for per-shard testing
  • .github/actions/deploy-stackrox/action.yaml — deploy StackRox via roxie with configurable scanner version, Central env vars, and scanner-v4-matcher env vars
  • .github/workflows/build.yaml — trigger ui-e2e after image build
  • .github/workflows/OCP-UI-E2E-NOTES.md — test results and findings documentation

Cluster modes (workflow_dispatch):

Input Mode Provision Time Example
(default) KinD per-shard ~2m -f tag=...
gke=true GKE native (gcloud) ~5m -f gke=true
infra-flavor=gke-default GKE via infractl ~5m -f infra-flavor=gke-default
infra-flavor=rosahcp ROSA HyperShift via infractl ~15m -f infra-flavor=rosahcp
infra-flavor=openshift-4 OCP via infractl ~45m -f infra-flavor=openshift-4
infra-cluster-name=... Re-use any pre-provisioned cluster 0m -f infra-cluster-name=dh-06-05-...

Additional dispatch inputs:

  • force-redeploy — delete existing StackRox and redeploy fresh
  • scanner — scanner version: v2 (default), v4, or both
  • central-env — extra env vars for Central (e.g. ROX_NODE_INDEX_ENABLED=true)
  • scanner-v4-env — extra env vars for scanner-v4-matcher
  • shard-filter — only run shards matching substring (e.g. vulnmanagement)

connect-infra-cluster auto-detection:

  • GKE kubeconfigs: installs gke-gcloud-auth-plugin via setup-gcloud, activates GCP service account
  • OCP kubeconfigs: certificate-based auth, no refresh needed
  • ROSA HCP kubeconfigs: detects expired sha256~ OAuth tokens, discovers OAuth server via .well-known, refreshes token using console credentials from artifacts
  • OpenShift console plugin: automatically enabled on OCP clusters after deploy

Performance vs Prow (GKE mode):

Metric Prow GHA Change
Wall clock 78-92 min 14 min -85%
GKE cost (spot) 92 min-eq ~7 min-eq -92%

Scanner V2 vs V4 findings:
Tested across 4 cluster types (OCP, ROSA HCP, GKE, KinD) x 4 scanner configs. Scanner V2 passes all vulnmanagement tests on all cluster types. Scanner V4 fails image-related tests due to incompatible data format with the legacy vulnmanagement UI. V2 is the correct choice for these tests.

Test changes (2 files):

  • networkDeploymentSidebar.test.js — increased assertion timeout for real data
  • deploymentTimeline.test.js — increased drill-down assertion timeout

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added e2e tests
  • modified existing tests

How I validated my change

Validated across all cluster modes:

Provisioning (infra-flavor):

  • gke-default: 9/9 shards passing (run 26975681251)
  • rosahcp: 9/9 shards passing (run 26995337893)
  • openshift-4: 9/9 shards passing (run 26995339880)

Cluster re-use (infra-cluster-name):

  • OCP re-use: 9/9 passing (run 27017983121)
  • ROSA HCP re-use with token refresh: 9/9 passing (run 27021374621)
  • GKE infra re-use with gcloud auth: 9/9 passing (run 27027801667)

Other modes:

  • GKE native: 9/9 passing (run 26904056080)
  • KinD: 9/9 passing (run 26930460627)

Scanner experiments (16 runs across 4 cluster types):

  • V2: passes on all cluster types
  • V4 (all configs): fails on OCP, ROSA HCP, GKE (passes on KinD only because KinD was hardcoded to V2 — since fixed)

davdhacs and others added 30 commits May 1, 2026 20:19
Central chart: internal/defaults.yaml (single file)
Sensor chart: internal/defaults/*.yaml (directory with multiple files)

Use find+sed scoped to internal/ to handle both structures.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Gradle 9 skips test proto generation when --tests filter is used and
it determines no matching sources exist yet. Run testClasses first to
ensure proto generation and Groovy compilation happen, then run tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add cypress/integration/smoke.test.js that verifies 5 core pages load:
Dashboard, Clusters, Violations, System Health, Risk. Self-contained
login via /v1/auth/m/login session.

Use cypress-io/github-action@v6 (same as ui-component tests in this
repo) with Chrome headless. Groovy tests had Gradle 9 JUnit Platform
compatibility issues — Cypress runs directly without build system
dependencies.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cy.session localStorage persistence was flaky — some tests saw the
login page instead of the app. Set access_token directly in
beforeEach via cy.request. Also:
- Add API metadata test (pure API, no rendering)
- Use nav element as SPA-loaded indicator (present on all pages)
- Assert 'not contain Log in' to verify auth worked
- Remove overly specific selectors that don't exist in dev builds

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…OX_AUTH_TOKEN

/v1/auth/m/login returned 404 — that endpoint doesn't exist. The
existing Cypress tests use CYPRESS_ROX_AUTH_TOKEN generated via
/v1/apitokens/generate with basic auth, same as get-auth-token.sh.

Generate the token in a workflow step via curl, pass to Cypress via
env var. The test sets localStorage.access_token in beforeEach —
same pattern as helpers/basicAuth.js used by all other Cypress tests.

Also remove invalid 'headless' input from cypress-io/github-action.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
After the 5-test smoke test validates the deployment works, attempt
the full cypress/integration/**/*.test.{js,ts} suite (~900 tests
across 120 files). This first run will show us what percentage of
existing tests pass against a minimal KinD deployment.

- Smoke test runs first with install: true (installs Cypress)
- Full suite runs with install: false (reuses installed Cypress)
- Video recording disabled for the full suite (saves disk/time)
- Timeout extended to 120 minutes for the full suite
- CYPRESS_ORCHESTRATOR_FLAVOR=k8s set for tests that check cluster type

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Exclude vulnerabilities/ and vulnmanagement/ directories from the full
suite — these tests need scanner to populate CVE data (38 failures
from CVE table selectors alone). All other directories included.

First full run baseline: 718/813 passing (88%), 95 failing, 55 min.
With scanner-dependent tests excluded, expect ~95% pass rate on the
remaining ~640 tests.

Also disable screenshotOnRunFailure and video to avoid filling GHA
artifact storage on every run.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comma-separated --spec glob list broke Cypress pattern matching
(YAML > scalar merged everything into one line with spaces). Use
--config excludeSpecPattern to exclude vulnerabilities/ and
vulnmanagement/ directories, keeping the default specPattern for
everything else.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ntrol)

Deploy all components to maximize test coverage:
- Scanner V4 (indexer + matcher + db) with minimal resources
- Collector with NO_COLLECTION (daemonset exists but no kernel module)
- Admission control (1 replica)
- Cluster name set to "remote" (systemHealth/bundle.test.js expects this)

Run ALL cypress specs with no exclusions — the full stack should
satisfy tests that were previously failing due to missing scanner
or network data.

Previous run without scanner: 552/558 passing (98%), 6 failures
from scanner/collector-dependent tests. This run tests whether
deploying the full stack resolves those and enables the vulnerability
test suites too.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
NO_COLLECTION means the collector daemonset exists but doesn't gather
runtime data (process events, network flows). Tests for network graph,
listening endpoints, and runtime violations see empty data.

CORE_BPF uses eBPF for collection — the GHA ubuntu runner kernel
supports it, and KinD shares the host kernel. This gives us real
runtime data without needing kernel modules.

If eBPF isn't available in KinD, collector will crash-loop but other
tests continue unaffected.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Each shard gets its own KinD cluster + full StackRox deploy and runs
a subset of test directories:
- shard 1-core: accessControl, clusters, collections, compliance
- shard 2-config: configmanagement, dashboard, integrations, systemHealth
- shard 3-policy: policies, risk, violations, networkGraph, top-level
- shard 4-vuln: vulnerabilities, vulnmanagement (scanner-dependent)

Tests whether 4 parallel deploys on separate runners bring wall-clock
from ~55 min to ~15 min. Each shard has full stack (scanner, collector,
admission control) with CORE_BPF.

The single-runner job continues to run in parallel for comparison.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
GITHUB_PATH only takes effect in subsequent steps, not within the
current run: block. roxctl commands later in the same step couldn't
find the binary. Add explicit export PATH alongside GITHUB_PATH.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Both the single-runner and shard jobs now check peak memory from the
resource monitor log after tests complete. If peak exceeds 80% of
KIND_MEMORY_LIMIT (4.8GB of 6GB), the step fails with an error.

This catches:
- Memory leaks in StackRox components
- GC thrashing that eats CPU time
- Tests that trigger excessive memory allocation
- Regressions from code changes that increase memory footprint

Also adds background resource monitor to shard jobs (was missing)
and kills monitor in cleanup for both job types.

80% threshold chosen because:
- Current full-stack peak is 4.27GB (71%) — passes with headroom
- 80% = 4.8GB leaves 1.2GB for Cypress + k8s overhead
- Above 80%, OOM risk increases and performance degrades

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Shard rebalance (was 8/17/17/37 min → target ~15 min each):
- Shard 1: added vulnmanagement (13 files, was in shard 4)
- Shard 3: added vulnerabilities/exceptionManagement, nodeCves,
  platformCves, snoozeWorkflow (moved from shard 4)
- Shard 4: now only workloadCves + VulnerabilityReporting

Seed test data: deploy nginx:1.12 (known vulnerable) after StackRox
is up. Scanner indexes it, providing CVE data needed by:
- vulnerabilities/exceptionManagement/* (defer/approve CVEs)
- vulnerabilities/workloadCves/* (CVE detail pages)
- vulnmanagement/* (legacy CVE tables)
Wait for deployment count > 5 to confirm scanner is indexing.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 shards (~15 files each) targeting ~10 min wall-clock:
1-access: accessControl, collections, compliance
2-clusters: clusters, administration
3-config: configmanagement
4-ui: dashboard, integrations, systemHealth, systemConfig, credentialExpiry
5-policy: policies, risk, violations, networkGraph, listeningEndpoints
6-vuln-mgmt: vulnmanagement
7-vuln-except: exceptionManagement, nodeCves, platformCves, VulnerabilityReporting
8-vuln-workload: workloadCves

Deploy parallelized in 4 phases:
Phase 1 (parallel): KinD create + docker pull main image + npm ci
Phase 2 (parallel): 4x ctr pull images into KinD + 2x roxctl chart gen
Phase 3 (sequential): helm install central (needs cluster + charts)
Phase 4 (overlap): port-forward + sensor deploy + nginx seed data

Key optimizations:
- npm ci during KinD creation (saves ~30s)
- 4 images pull concurrently (saves ~15s vs serial)
- install: false on cypress-io action (already installed)
- Seed data deploys in background during sensor wait

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Major deploy speedup — target <60s:
- Use in-repo helm charts directly (no roxctl helm output needed)
- sed tags on in-repo charts during KinD creation (parallel)
- Generate init bundle via Central API (curl) instead of roxctl CLI
- Skip all rollout status waits — only poll Central API readiness
- Sensor, scanner, collector start in background during tests

Eliminated:
- roxctl binary extraction (was ~13s)
- docker pull for roxctl image (was ~10s)
- roxctl chart generation (was ~5s)
- kubectl rollout status deploy/central-db (was ~10s)
- kubectl rollout status deploy/sensor (was ~15s)

Total saved: ~53s from critical path. Deploy should be ~80-90s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Can't modify checked-out files in-place with sed. Copy charts to
working dirs first. Also adds debug echoes for timing visibility
and error output for init bundle API call.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
5-second failure means the script errors at the very start. Add set -x
for full trace, use cp -rL to follow symlinks in chart directories.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
set -x would print passwords and registry credentials in the trace
before ::add-mask:: can register them. Disable tracing around:
- docker login (registry password)
- openssl rand password generation
- ctr pull --user (registry credentials)
- curl -u admin:password (API calls)

Re-enable set -x after each sensitive section for debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ored

The helm charts at deploy/k8s/*/helm/chart are generated artifacts
(.gitignored), not tracked files. roxctl is needed to generate them.

Restored roxctl extraction + chart generation, but kept optimizations:
- Phase 1 (parallel): KinD create + docker pull main + npm ci
- Phase 2 (parallel): 4x ctr pull into KinD + 2x roxctl chart gen
- Phase 3: helm install central (no rollout wait)
- Phase 4: poll Central API only
- Phase 5: init bundle via curl API + helm install sensor

Removed set -x debug tracing (root cause found: missing chart files).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ation)

The curl-based init bundle via /v1/cluster-init/init-bundles was failing —
the helmValuesBundle field format needs investigation (base64 encoding,
JSON structure). Revert to roxctl which is proven to work.

The API approach can be revisited once we verify the response format.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bare 'wait' waits for ALL background processes including npm ci.
If npm fails (or is slow), 'wait' returns non-zero and kills the
script under set -e. Use explicit PID array to wait only for image
pulls + chart generation.

Also disabled single-runner job (if: false) to unblock shard log
access — the 55-min single-runner was preventing us from reading
shard failure logs.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Root cause of shard deploy failures: port-forward was attempted before
Central pod was running (status=Pending). Skipping rollout wait was too
aggressive — Central needs to be at least running for port-forward.

Added: kubectl wait --for=condition=ready pod -l app=central
Still skipping: scanner, sensor, collector rollout waits (not needed
for port-forward or init bundle generation).

Temporarily disabled shards 3-8 for faster iteration (2 shards only).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Deploy fix confirmed: both test shards passed (1-access: 5min,
2-clusters: 3min). Deploy step takes ~100s with parallelization.

Re-enable all 8 shards for full suite coverage.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
KinD's --wait flag polls for control-plane Ready status, adding 15-20s.
The cluster is usable immediately after creation — kubectl and helm work
fine. The kubectl wait --for=condition=ready on the Central pod already
handles readiness.

Local benchmark: 48s → 36s (with --wait removed).
GHA estimate: ~100s → ~85s.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two bugs:
1. kind create had '&' inside a comment, not as the command suffix —
   it ran in foreground, not background. Fixed.
2. npm ci in background + install: false on cypress-io action caused
   exit 127 (Cypress binary not found). Let cypress-io handle install
   with install: true and its built-in binary caching.

Removed background npm ci from deploy step — cypress-io/github-action
handles npm install + Cypress binary download with caching.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move Central API wait, auth token generation, and sensor deploy to a
background process that runs while cypress-io/github-action installs
npm dependencies + Cypress binary (~30s).

Previous flow (serial): deploy → wait for Central (20s) → auth token → Cypress install (30s) → tests
New flow (parallel):    deploy → [background: Central boot + auth + sensor] → Cypress install → tests

The ~20s Central boot is fully hidden behind the ~30s Cypress install.
Auth token written to /tmp/rox-auth-token by background process, read
by a subsequent step that polls until the file appears.

Expected deploy step: ~35s (helm submit + image pulls + chart gen).
Expected wall-clock to first test: ~45s (deploy + Cypress install).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Put the whole deploy (KinD + images + helm + sensor) into a background
subshell with ( ... ) &. GHA proceeds to the next step immediately
(Cypress install) while the deploy runs in parallel.

In regular GHA run: steps, ( ... ) & works — GHA does NOT wait for
background children (documented in memory from PR #19397).

Timeline:
  Deploy step: ~0s (kicks off background, returns immediately)
  Cypress install step: ~30s (runs WHILE deploy happens in background)
  Auth token step: polls /tmp/rox-auth-token file (ready when Central is up)
  Test step: Central + token ready, tests run immediately

Deploy output goes to /tmp/deploy.log for debugging.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Instead of ctr pull inside KinD (network I/O after KinD is ready),
docker pull all 4 images to the host Docker daemon in parallel with
KinD creation. Then docker save | ctr import moves them locally
(no network, just local pipe).

This overlaps ~20s of network image pull with ~8s of KinD creation,
saving ~12s from the deploy timeline inside the background subshell.

Previous: KinD (8s) → ctr pull 4 images (20s) → helm (2s) → boot (20s) = 50s
New:      KinD (8s) + docker pull (20s parallel) → docker save|ctr import (8s) → helm + boot (22s) = 38s

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
docker save | ctr import is slower because it pipes uncompressed
images (~1.55GB for main alone) vs ctr pull which downloads compressed
layers (~330MB). The 92s auth wait with docker save vs 62s with ctr
pull confirms this.

Reverted to ctr pull directly inside KinD node (proven faster).

Also investigating central-db init time: docker-entrypoint.sh runs
initdb + pg_ctl start + docker_setup_db on every fresh start (~12s).
Could pre-bake the initialized PGDATA into the image to skip this.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
davdhacs and others added 6 commits June 5, 2026 14:29
Each Cypress run is ~3 min + 3 min gap = ~6 min per attempt.
15 attempts = ~75 min max. This discovers exactly when vuln data
becomes available rather than guessing with a fixed upfront wait.

Reduced the CVE-specific wait to 30s (quick check only) since
the retry loop is the real wait mechanism.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Root cause: the 40-minute CVE wait was accidentally serving as time
for image scan data to populate. Without it, vulnmanagement tests
fail because they need image data (not cluster/node CVEs).

Fix: add image scan trigger + data wait to vulnmanagement shard
(same pattern as the vulnerabilities shard). Remove the Cypress
retry loop since the data wait handles it properly.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Image count alone isn't enough — tests need actual CVE data from
scanned images to populate tables and widgets. Poll imageCVECount
via GraphQL until CVEs appear (up to 30 min).

The successful runs all had the 40-min CVE wait which inadvertently
gave time for this data to populate. Now we explicitly wait for
the right signal: imageCVECount > 0.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
…ties

Both shards need the same image scan data. Merged their wait logic
into a single case so vulnmanagement works both in full 9-shard runs
(where vulnerabilities populates data in parallel) and in single-shard
testing (where it needs to trigger scans itself).

Root cause of vulnmanagement single-shard failures: the old 40-minute
CVE wait was accidentally stalling long enough for the vulnerabilities
shard to populate image data. Without it, no image CVE data exists.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Scanner pod may take longer than 5 minutes to start on re-used
clusters. Increase timeout to 10m and continue with a warning
instead of failing the entire step — the scan data check loop
below handles the actual readiness.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
55 unscanned images pass the image count check but have 0 CVEs.
Tests need actual scanned image CVE data. Wait for both conditions:
imageCount > 5 AND imageCVECount > 0.

This ensures the scanner has actually completed scans before
Cypress runs, regardless of how many auto-discovered images exist.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
davdhacs and others added 16 commits June 18, 2026 10:03
Replace our custom create-gke-cluster with the version now in master:
- Uses ci_export for CLUSTER_NAME/ZONE via GITHUB_ENV (no outputs)
- Includes auto-cleanup via gacts/run-and-post-run
- No image pull secret (roxie handles via REGISTRY_USERNAME/PASSWORD)

Updated ui-e2e.yaml to match:
- Job outputs use env.CLUSTER_NAME/env.ZONE instead of step outputs
- GCP auth narrowed to pre-provisioned GKE only (action handles own auth)
- Removed tag/quay-user/quay-pass from create-gke-cluster call

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The action's gacts/run-and-post-run cleanup fires when the provision
job ends, which deletes the cluster before shard jobs can use it in
multi-job workflows.

Add auto-cleanup input (default true for backward compat). Set to
false in ui-e2e where cleanup-gke handles deletion after all shards.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move disable_prompts before early-return in setup_gcp and add
setup_gcp call to teardown_gke_cluster. Cherry-picked from PR #21275.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The "drill down on a pod" test was the only test in deploymentTimeline
that used real API data instead of fixture data. When the API returned
empty pods (no process events collected yet), the test failed because
the timeline list rendered without items — no amount of timeout increase
would help since the data simply wasn't there.

Switch to fixture data (deploymentEventTimeline.json for the deployment
view, podEventTimeline.json for the pod drill-down), consistent with
every other test in this file. This tests the UI navigation behavior
(drill down shows pod name + back button) without depending on live
cluster state.

AI-assisted change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The GCR test asserts that the "New integration" button does not exist,
but does so immediately after page visit without waiting for GCR-specific
content to render. On GKE with higher latency, the page may not have
fully settled before the assertion runs, causing a consistent failure.

Wait for the GCR-specific deprecation notice alert before asserting the
button's absence.

AI-assisted.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 16 tests in podTimeline.test.js fail on GKE because the
deployment-level event timeline renders empty (no process events
collected on a fresh cluster), so the drill-down button never
appears.

Pass deploymentEventTimeline.json fixture to openEventTimeline()
for every test, matching the pattern used in deploymentTimeline.test.js.
Also pass podEventTimeline.json fixture to the Legend test's
drill-down call for consistency.

AI-assisted change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
All 16 tests in podTimeline.test.js fail on GKE because the
deployment-level event timeline renders empty (no process events
collected on a fresh cluster), so the drill-down button never
appears.

Pass deploymentEventTimeline.json fixture to openEventTimeline()
for every test, matching the pattern used in deploymentTimeline.test.js.
Also pass podEventTimeline.json fixture to the Legend test's
drill-down call for consistency.

AI-assisted change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous fix waited for a deprecation notice alert that may not be
rendered in time. Wait for the IntegrationsTable "results found" h2
instead, which confirms the table component (and its button visibility
logic) has rendered.

AI-assisted change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The previous fix waited for a deprecation notice alert that may not be
rendered in time. Wait for the IntegrationsTable "results found" h2
instead, which confirms the table component (and its button visibility
logic) has rendered.

AI-assisted change.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test fixes belong in their own PRs (#21337, #21338, #21341) and
should not be merged into the ui-e2e workflow PR.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (4)
.github/actions/deploy-stackrox/action.yaml (1)

103-118: 🔒 Security & Privacy | 🔵 Trivial | ⚖️ Poor tradeoff

Pass operator inputs via env: rather than ${{ }} interpolation into the shell/yq expressions.

inputs.cluster-name, inputs.central-env, and the parsed key/val are spliced directly into yq expression strings. Beyond GitHub template-injection risk, a value containing quotes or yq operators can corrupt the override document. Bind them to environment variables and reference $VAR in the script.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/actions/deploy-stackrox/action.yaml around lines 103 - 118, The
GitHub Actions inputs (cluster-name, central-env) and parsed variables (key,
val) are being directly interpolated into yq expressions using ${{ }} syntax,
which creates security and data corruption risks. Instead, bind these inputs to
environment variables at the beginning of the step using env: (with
cluster_name, central_env, key, and val as environment variable names), then
reference them in the script using $VARIABLE_NAME notation rather than ${{
inputs.xxx }} or direct variable expansion. This applies to the yq command
containing .securedCluster.spec.clusterName, the condition checking central-env,
and the loop processing PAIRS where key and val are extracted from the pair
split.

Source: Linters/SAST tools

.github/workflows/ui-e2e.yaml (1)

569-576: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Avoid interpolating dispatch-controlled inputs directly into run scripts.

inputs.shard-filter (and similarly inputs.gke-cluster-name/inputs.gke-zone at Lines 459-469) are expanded via ${{ }} straight into shell, which static analysis flags as a template-injection vector. Bind them to env: and reference $VAR (and quote in case) instead.

Example
     - name: Check shard filter
       id: filter
+      env:
+        SHARD_FILTER: ${{ inputs.shard-filter }}
+        SHARD_NAME: ${{ matrix.name }}
       run: |
-        if [ -n "${{ inputs.shard-filter }}" ]; then
-          case "${{ matrix.name }}" in
-            *${{ inputs.shard-filter }}*)
+        if [ -n "$SHARD_FILTER" ]; then
+          case "$SHARD_NAME" in
+            *"$SHARD_FILTER"*)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ui-e2e.yaml around lines 569 - 576, The workflow file is
directly interpolating workflow inputs (inputs.shard-filter,
inputs.gke-cluster-name, and inputs.gke-zone) into shell scripts using ${{ }}
syntax, which creates a template-injection security vulnerability. To fix this,
add an env: section before the run: script that binds these inputs to
environment variables, then update all references in the if condition, case
statement pattern matching at lines 569-576, and the references at lines 459-469
to use $VAR_NAME syntax with proper quoting instead of ${{ inputs.xxx }}.

Source: Linters/SAST tools

.github/workflows/build.yaml (2)

530-541: 🔒 Security & Privacy | 🔵 Trivial | ⚡ Quick win

Use the existing job-level BUILD_TAG shell var instead of ${{ env.BUILD_TAG }} interpolation.

BUILD_TAG is already exported at job scope (Line 443), so referencing ${BUILD_TAG} avoids the GitHub-expression-into-shell interpolation that static analysis flags as a template-injection vector.

🛡️ Proposed change
       - name: Copy central-db to rhacs-eng registry
         if: |
           github.event_name == 'push' || !github.event.pull_request.head.repo.fork
         run: |
           skopeo copy --retry-times 5 --all \
-            "docker://quay.io/stackrox-io/central-db:${{ env.BUILD_TAG }}" \
-            "docker://quay.io/rhacs-eng/central-db:${{ env.BUILD_TAG }}"
+            "docker://quay.io/stackrox-io/central-db:${BUILD_TAG}" \
+            "docker://quay.io/rhacs-eng/central-db:${BUILD_TAG}"

       - name: Copy scanner-v4-db to rhacs-eng registry
         if: |
           github.event_name == 'push' || !github.event.pull_request.head.repo.fork
         run: |
           skopeo copy --retry-times 5 --all \
-            "docker://quay.io/stackrox-io/scanner-v4-db:${{ env.BUILD_TAG }}" \
-            "docker://quay.io/rhacs-eng/scanner-v4-db:${{ env.BUILD_TAG }}"
+            "docker://quay.io/stackrox-io/scanner-v4-db:${BUILD_TAG}" \
+            "docker://quay.io/rhacs-eng/scanner-v4-db:${BUILD_TAG}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build.yaml around lines 530 - 541, The skopeo copy
commands in both the "Copy central-db to rhacs-eng registry" and "Copy
scanner-v4-db to rhacs-eng registry" sections are using GitHub expression
interpolation syntax (${{ env.BUILD_TAG }}) instead of the shell variable
syntax. Since BUILD_TAG is already exported at the job scope, replace all
instances of ${{ env.BUILD_TAG }} with ${BUILD_TAG} in both the central-db and
scanner-v4-db skopeo copy run steps to reference the shell variable directly and
avoid the template-injection vector flagged by static analysis.

Source: Linters/SAST tools


469-477: 🚀 Performance & Scalability | 🔵 Trivial

Use the standard is_in_PR_context helper to align with the rest of the file.

Lines 115, 882, and 1254 all use is_in_PR_context || pr_has_label ci-build-all-arch for arch decisions, but this step uses github.event_name != 'pull_request'. Since the file treats workflow_call as distinct in other contexts, this inconsistency could cause platform mismatches across jobs.

♻️ Align with the shared helper
      - name: Determine platforms
        id: platforms
        run: |
+          source ./scripts/ci/lib.sh
           PLATFORMS="linux/amd64,linux/arm64"
-          if [[ "${{ github.event_name }}" != "pull_request" ]] || \
-             [[ "${{ contains(github.event.pull_request.labels.*.name, 'ci-build-all-arch') }}" == "true" ]]; then
+          if ! is_in_PR_context || pr_has_label ci-build-all-arch; then
             PLATFORMS="linux/amd64,linux/arm64,linux/ppc64le,linux/s390x"
           fi
           echo "platforms=${PLATFORMS}" >> "$GITHUB_OUTPUT"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build.yaml around lines 469 - 477, The "Determine
platforms" step uses a custom check with github.event_name instead of the
standard is_in_PR_context helper used elsewhere in the file (at lines 115, 882,
and 1254). Replace the if statement condition that checks github.event_name and
the contains() function with the standard pattern used throughout the file:
is_in_PR_context || pr_has_label ci-build-all-arch. This will ensure consistency
in how PR context is determined across all platform selection logic and prevent
potential platform mismatches between jobs.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/ui-e2e.yaml:
- Around line 408-410: The job-level outputs at lines 408-410 are attempting to
reference environment variables (env.CLUSTER_NAME and env.ZONE) as fallbacks,
but job outputs are evaluated at workflow initialization before step execution,
so these env variable fallbacks will always be empty. Modify the
create-gke-cluster step to write cluster-name and zone to $GITHUB_OUTPUT instead
of or in addition to $GITHUB_ENV, then update the outputs mapping to reference
steps.create-gke-cluster.outputs.cluster-name and
steps.create-gke-cluster.outputs.zone instead of the env variable fallbacks to
ensure the values are properly propagated to downstream jobs.

---

Nitpick comments:
In @.github/actions/deploy-stackrox/action.yaml:
- Around line 103-118: The GitHub Actions inputs (cluster-name, central-env) and
parsed variables (key, val) are being directly interpolated into yq expressions
using ${{ }} syntax, which creates security and data corruption risks. Instead,
bind these inputs to environment variables at the beginning of the step using
env: (with cluster_name, central_env, key, and val as environment variable
names), then reference them in the script using $VARIABLE_NAME notation rather
than ${{ inputs.xxx }} or direct variable expansion. This applies to the yq
command containing .securedCluster.spec.clusterName, the condition checking
central-env, and the loop processing PAIRS where key and val are extracted from
the pair split.

In @.github/workflows/build.yaml:
- Around line 530-541: The skopeo copy commands in both the "Copy central-db to
rhacs-eng registry" and "Copy scanner-v4-db to rhacs-eng registry" sections are
using GitHub expression interpolation syntax (${{ env.BUILD_TAG }}) instead of
the shell variable syntax. Since BUILD_TAG is already exported at the job scope,
replace all instances of ${{ env.BUILD_TAG }} with ${BUILD_TAG} in both the
central-db and scanner-v4-db skopeo copy run steps to reference the shell
variable directly and avoid the template-injection vector flagged by static
analysis.
- Around line 469-477: The "Determine platforms" step uses a custom check with
github.event_name instead of the standard is_in_PR_context helper used elsewhere
in the file (at lines 115, 882, and 1254). Replace the if statement condition
that checks github.event_name and the contains() function with the standard
pattern used throughout the file: is_in_PR_context || pr_has_label
ci-build-all-arch. This will ensure consistency in how PR context is determined
across all platform selection logic and prevent potential platform mismatches
between jobs.

In @.github/workflows/ui-e2e.yaml:
- Around line 569-576: The workflow file is directly interpolating workflow
inputs (inputs.shard-filter, inputs.gke-cluster-name, and inputs.gke-zone) into
shell scripts using ${{ }} syntax, which creates a template-injection security
vulnerability. To fix this, add an env: section before the run: script that
binds these inputs to environment variables, then update all references in the
if condition, case statement pattern matching at lines 569-576, and the
references at lines 459-469 to use $VAR_NAME syntax with proper quoting instead
of ${{ inputs.xxx }}.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 6c7bf915-bb8b-4e9a-95db-6fa521467433

📥 Commits

Reviewing files that changed from the base of the PR and between 078d91d and 9cce79a.

📒 Files selected for processing (8)
  • .github/actions/connect-infra-cluster/action.yaml
  • .github/actions/create-gke-cluster/action.yaml
  • .github/actions/create-kind-cluster/action.yaml
  • .github/actions/deploy-stackrox/action.yaml
  • .github/actions/job-preamble/action.yaml
  • .github/workflows/OCP-UI-E2E-NOTES.md
  • .github/workflows/build.yaml
  • .github/workflows/ui-e2e.yaml
✅ Files skipped from review due to trivial changes (1)
  • .github/workflows/OCP-UI-E2E-NOTES.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/actions/job-preamble/action.yaml

Comment on lines +408 to +410

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

Does the env context in a GitHub Actions job-level outputs mapping reflect environment variables written to $GITHUB_ENV at runtime by steps?

💡 Result:

No, the env context in a GitHub Actions job-level outputs mapping does not reflect environment variables written to $GITHUB_ENV at runtime. The env context is designed to hold variables defined in the workflow configuration (at the workflow, job, or step level) and variables made available via the $GITHUB_ENV mechanism within the same job [1][2][3]. However, job-level outputs (defined under the jobs.<job_id>.outputs mapping) are static definitions evaluated when the workflow is initialized, before the runner executes any steps [1][4]. Because job-level outputs are determined by the workflow configuration, they cannot dynamically capture or reflect environment variables that are created or modified during the execution of a job [5]. To pass information between jobs, you must explicitly define a job-level output and assign it a value from a step's output (using the $GITHUB_OUTPUT file), not from an environment variable [5][6]. In summary: 1. $GITHUB_ENV is for passing information between steps within the same job [7][8][6]. 2. $GITHUB_OUTPUT is for passing information between steps within the same job and, if mapped to job-level outputs, between different jobs [2][5][6]. 3. Job-level outputs cannot pull data from the env context or $GITHUB_ENV at runtime [5].

Citations:


Job-level outputs cannot access environment variables written to $GITHUB_ENV at runtime—these fallbacks will resolve to empty.

Job-level outputs are evaluated at workflow initialization, before step execution begins. When create-gke-cluster writes CLUSTER_NAME and ZONE to $GITHUB_ENV, the job outputs (lines 408-410) cannot reflect those runtime values. The fallback expressions env.CLUSTER_NAME and env.ZONE will remain empty on the in-workflow-create path, causing get-credentials --zone (line 615) and cleanup-gke deletion (line 822) to fail.

Use step outputs from create-gke-cluster instead. Modify create-gke-cluster to expose cluster-name and zone as $GITHUB_OUTPUT, then reference steps.create-gke-cluster.outputs.cluster-name and steps.create-gke-cluster.outputs.zone in the job outputs mapping.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/ui-e2e.yaml around lines 408 - 410, The job-level outputs
at lines 408-410 are attempting to reference environment variables
(env.CLUSTER_NAME and env.ZONE) as fallbacks, but job outputs are evaluated at
workflow initialization before step execution, so these env variable fallbacks
will always be empty. Modify the create-gke-cluster step to write cluster-name
and zone to $GITHUB_OUTPUT instead of or in addition to $GITHUB_ENV, then update
the outputs mapping to reference steps.create-gke-cluster.outputs.cluster-name
and steps.create-gke-cluster.outputs.zone instead of the env variable fallbacks
to ensure the values are properly propagated to downstream jobs.

davdhacs and others added 3 commits June 24, 2026 11:28
…face

Remove stale inputs (tag, quay-user, quay-pass) and switch outputs
from step outputs to env vars (CLUSTER_NAME, ZONE via ci_export).

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Missed the third create-gke-cluster usage in the ui-e2e cluster job.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/build.yaml:
- Line 354: The architecture validation in the build step is too restrictive and
does not match the full set of values produced by pre_build_go_binaries. Update
the build logic around the roxctl_linux build invocation to pass matrix.arch
through an env variable instead of interpolating it directly in the shell
command, and make sure the shell validation allowlist accepts all supported
architectures: amd64, arm64, ppc64le, and s390x. Adjust the regex or case
statement used by that validation so non-default matrix entries are handled
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5e93130c-1993-41e0-8154-4aba0086cd97

📥 Commits

Reviewing files that changed from the base of the PR and between f5e6bb5 and 611b103.

📒 Files selected for processing (1)
  • .github/workflows/build.yaml

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's internal server error or limits when posting large numbers of comments. If you are seeing this consistently it is likely a permissions issue. Please check "Moderation" -> "Code review limits" under your organization settings.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/build.yaml:
- Line 354: The architecture validation in the build step is too restrictive and
does not match the full set of values produced by pre_build_go_binaries. Update
the build logic around the roxctl_linux build invocation to pass matrix.arch
through an env variable instead of interpolating it directly in the shell
command, and make sure the shell validation allowlist accepts all supported
architectures: amd64, arm64, ppc64le, and s390x. Adjust the regex or case
statement used by that validation so non-default matrix entries are handled
correctly.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Central YAML (base), Organization UI (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5e93130c-1993-41e0-8154-4aba0086cd97

📥 Commits

Reviewing files that changed from the base of the PR and between f5e6bb5 and 611b103.

📒 Files selected for processing (1)
  • .github/workflows/build.yaml
🛑 Comments failed to post (1)
.github/workflows/build.yaml (1)

354-354: 🔒 Security & Privacy | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

rg -n -C5 'pre_build_go_binaries|matrix\.arch|fromJson\(needs\.define-job-matrix\.outputs\.matrix\)' .github/workflows/build.yaml

Repository: stackrox/stackrox

Length of output: 20695


Correct the architecture validation allowlist in the build script.

The current proposed fix restricts architectures to amd64 and arm64, but the build matrix (pre_build_go_binaries) dynamically includes ppc64le and s390x for non-default contexts. The validation logic must account for all supported architectures. When passing ${{ matrix.arch }} to the shell command, use an env variable and ensure the validation regex/case statement accepts amd64|arm64|ppc64le|s390x.

🛡️ Corrected fix
       - name: Build Go Binaries
-        run: GOOS=linux GOARCH=${{ matrix.arch }} CGO_ENABLED=0 make build-prep main-build-nodeps roxctl_linux-${{ matrix.arch }}
+        env:
+          MATRIX_ARCH: ${{ matrix.arch }}
+        run: |
+          case "${MATRIX_ARCH}" in
+            amd64|arm64|ppc64le|s390x)
+              ;;
+            *)
+              echo "::error::Unsupported matrix.arch: ${MATRIX_ARCH}"
+              exit 1
+              ;;
+          esac
+
+          GOOS=linux GOARCH="${MATRIX_ARCH}" CGO_ENABLED=0 make build-prep main-build-nodeps "roxctl_linux-${MATRIX_ARCH}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

        env:
          MATRIX_ARCH: ${{ matrix.arch }}
        run: |
          case "${MATRIX_ARCH}" in
            amd64|arm64|ppc64le|s390x)
              ;;
            *)
              echo "::error::Unsupported matrix.arch: ${MATRIX_ARCH}"
              exit 1
              ;;
          esac

          GOOS=linux GOARCH="${MATRIX_ARCH}" CGO_ENABLED=0 make build-prep main-build-nodeps "roxctl_linux-${MATRIX_ARCH}"
🧰 Tools
🪛 zizmor (1.26.1)

[warning] 354-354: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)


[warning] 354-354: code injection via template expansion (template-injection): may expand into attacker-controllable code

(template-injection)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/build.yaml at line 354, The architecture validation in the
build step is too restrictive and does not match the full set of values produced
by pre_build_go_binaries. Update the build logic around the roxctl_linux build
invocation to pass matrix.arch through an env variable instead of interpolating
it directly in the shell command, and make sure the shell validation allowlist
accepts all supported architectures: amd64, arm64, ppc64le, and s390x. Adjust
the regex or case statement used by that validation so non-default matrix
entries are handled correctly.

Source: Linters/SAST tools

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant