chore(AI): add /verify-in-cluster skill for live cluster testing by guzalv · Pull Request #20762 · stackrox/stackrox · GitHub
Skip to content

chore(AI): add /verify-in-cluster skill for live cluster testing#20762

Closed
guzalv wants to merge 25 commits into
masterfrom
gualvare/verify-skill
Closed

chore(AI): add /verify-in-cluster skill for live cluster testing#20762
guzalv wants to merge 25 commits into
masterfrom
gualvare/verify-skill

Conversation

@guzalv

@guzalv guzalv commented May 21, 2026

Copy link
Copy Markdown
Contributor

Description

Add a Claude Code /verify-in-cluster skill that automates the build-deploy-test cycle for
StackRox code changes against a live Kubernetes/OpenShift cluster.

What the skill does

Phases 0–8 guide the agent through: cluster discovery, StackRox detection, authentication,
change analysis, binary cross-compilation, image mutation via crane mutate --append,
deployment patching, test execution, and result reporting.

Roxie integration

The skill delegates to Roxie wherever it can:

Phase What Roxie handles Replaces
1 — Cluster discovery roxie env detects cluster type, context, kubeconfig Manual oc cluster-info + architecture detection
2b — Authentication roxie shell -- bash -c 'echo ENDPOINT=$ROX_ENDPOINT ...' retrieves endpoint and password from the saved manifest ~20 lines of manual endpoint probing (route → LB → port-forward) and credential hunting (env var → files → secrets)
2c — Fresh deployment roxie deploy both --tag <tag> --envrc <file> handles operator install, CRs, readiness, and credential generation ~130 lines of deploy.sh env-var setup, namespace creation, and manual password extraction
Teardown roxie teardown both cleans up when needed Manual namespace deletion

Each Roxie step has a manual fallback for clusters not deployed by Roxie.

What is NOT delegated to Roxie (and why)

Phase What the skill does itself Why
3 — Change analysis git diff, component mapping table, CI fast-path detection Novel to this skill — Roxie has no concept of "which binaries changed"
4 — Build go build with cross-compilation for target arch Roxie doesn't compile custom binaries
5 — Image push crane mutate --append to overlay binaries onto the running image, push to ttl.sh Roxie deploys whole releases; the skill patches individual binaries into an existing image for fast iteration
6 — Deploy patch kubectl set image on specific deployments Surgical single-deployment patches, not full re-deploys
7 — Test Bug repro, fix verification (before/after), API testing, E2E execution Domain-specific test orchestration
8 — Report Structured summary with proof artifacts Skill-specific output format

Other design decisions

  • --tag is mandatory for roxie deploy — without it, Roxie defaults to 4.9.2 which causes DB migration mismatches
  • Private registry fallback: Roxie deploys from quay.io/rhacs-eng/ (requires auth); when crane can't pull, the skill falls back to quay.io/stackrox-io/ (public) with the same tag
  • CI fast-path: When no local changes exist and a CI-built image matches HEAD, the skill skips build+push entirely and patches with the CI image
  • YOLO mode: Skips all user confirmations for unattended execution
  • No auto-rollback: The parent agent may want to inspect a failed deployment

User-facing documentation

  • CHANGELOG.md is updated OR update is not needed
  • documentation PR is created and is linked above OR is not needed

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

This is a Claude Code skill (markdown instructions), not executable code — no unit tests apply.

How I validated my change

Tested the skill end-to-end against a live GKE/OpenShift cluster across multiple iterations:

  • Roxie shell credential retrieval: Deployed via roxie deploy both --tag, then used roxie shell to extract endpoint + password — confirmed working
  • Build + crane push: Made test changes to central/ping/service/service_impl.go, built central+migrator, pushed via crane mutate --append to ttl.sh with quay.io/stackrox-io/ as base (private registry fallback exercised)
  • Deploy + verify: Patched deployment/central, confirmed rollout, curled /v1/ping to verify the test change was live
  • CI fast-path: Tested with no local changes against a PR with CI-built images — correctly skipped build phases
  • Restore: Confirmed original image restored after verification
  • Scanner-v4-db fix: Validated guidance for ImagePullBackOff when scanner-v4-db tag mismatches

Add a skill that automates the build-deploy-test cycle against a live
cluster. It handles cluster discovery, StackRox auth, cross-compilation,
crane-based image mutation, deployment patching, and test execution.

Reviewed by two independent parallel agents; findings addressed.
Tested locally against an OpenShift cluster (central + migrator build,
ttl.sh push, deployment patch, API verification).

AI-assisted: code partially generated by AI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@openshift-ci

openshift-ci Bot commented May 21, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

guzalv and others added 2 commits May 21, 2026 19:54
- crane manifest ttl.sh/test:1h returns a valid manifest, not MANIFEST_UNKNOWN
- Remove incorrect claim that TLS errors are "common on macOS"
- Default to stackrox namespace, fall back to rhacs-operator instead of
  searching all namespaces

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The x509 error is caused by Claude Code's sandbox network proxy
intercepting Go's crypto/tls connections. Tested: crane fails inside
sandbox, works outside; curl (different TLS impl) works in both.
--insecure workaround is correct.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions

github-actions Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit f6b3c44. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.11.x-1130-gf6b3c44a6c

@coderabbitai

coderabbitai Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

A new Claude Code skill document provides end-to-end guidance for verifying StackRox changes on live Kubernetes/OpenShift clusters. The skill defines eight sequential operational phases: environment setup and tool validation, cluster discovery and architecture detection, StackRox Central discovery with authentication and optional deployment, change impact analysis to determine rebuild scope, cross-compilation, image creation and push to ttl.sh, deployment patching with health checks, context-driven test execution, and PR-ready reporting.

Changes

StackRox Verification Skill

Layer / File(s) Summary
Skill overview and setup prerequisites
.claude/skills/verify/SKILL.md
Skill metadata, YOLO-mode confirmation skipping, tool validation (crane vs Docker fallback), kubeconfig selection, cluster connectivity verification, and architecture detection.
StackRox Central discovery and optional deployment
.claude/skills/verify/SKILL.md
StackRox Central detection via route/LB/port-forward, credential sourcing (environment/password files/admin fallback), authentication validation, and optional deployment via roxctl when Central is not found.
Build and image pipeline
.claude/skills/verify/SKILL.md
Git-diff change analysis mapped to affected components and images, cross-compilation build with architecture defaults, TTL tag generation, tar layer creation, and image push via crane mutate (with Docker fallback).
Deployment, testing, and reporting
.claude/skills/verify/SKILL.md
Deployment patching with pod health checks and log capture, context-driven test execution (bug repro, fix verification, API testing, Go e2e, manual), and concise PR-ready summary including built components, image references, patched deployments, and test outcomes.

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Title check ✅ Passed The PR title 'chore(AI): add /verify-in-cluster skill for live cluster testing' directly describes the main change—adding a new Claude Code skill for in-cluster verification testing. It is concise, specific, and accurately reflects the primary objective of the PR.
Description check ✅ Passed The PR description is comprehensive and well-structured, covering the skill's purpose, implementation phases, Roxie integration strategy, design decisions, and validation methodology.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch gualvare/verify-skill

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (1)
.claude/skills/verify/SKILL.md (1)

289-292: 💤 Low value

Consider a fallback for uuidgen availability.

The uuidgen command may not be available on all systems. Consider checking for its availability or using a more portable alternative.

🔧 Portable alternative using date and random
 Generate a unique tag:
 ```bash
-TAG="ttl.sh/$(uuidgen | tr '[:upper:]' '[:lower:]'):2h"
+if command -v uuidgen >/dev/null; then
+  TAG="ttl.sh/$(uuidgen | tr '[:upper:]' '[:lower:]'):2h"
+else
+  TAG="ttl.sh/stackrox-$(date +%s)-$RANDOM:2h"
+fi
</details>

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.claude/skills/verify/SKILL.md around lines 289 - 292, The TAG generation
currently assumes the uuidgen command is present; update the logic around TAG to
first check availability of uuidgen (e.g., using command -v uuidgen >/dev/null)
and if present keep TAG="ttl.sh/$(uuidgen | tr '[:upper:]' '[:lower:]'):2h",
otherwise set a portable fallback (for example TAG="ttl.sh/stackrox-$(date
+%s)-$RANDOM:2h") so TAG is always defined even on systems without uuidgen;
modify the section that defines TAG to implement this conditional.


</details>

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.claude/skills/verify/SKILL.md:

  • Around line 165-169: The export block showing "export ROX_ADMIN_PASSWORD" and
    "export API_ENDPOINT" conflicts with earlier warnings that shell state does not
    persist between Bash tool calls; update the SKILL.md section containing these
    export commands so it does not imply persistence—either remove the export lines
    entirely or replace them with a clear, explicit one-line note (near the existing
    export ROX_ADMIN_PASSWORD / API_ENDPOINT text) stating these are reference
    values only and that the literal values must be substituted into each following
    command (e.g., "Reference only — substitute ROX_ADMIN_PASSWORD and API_ENDPOINT
    literal values into each command; shell exports do not persist across tool
    calls"). Ensure the text mentions the exact symbols ROX_ADMIN_PASSWORD and
    API_ENDPOINT so readers can find and update the subsequent examples.
  • Around line 316-324: The crane mutate calls always use a hardcoded platform of
    linux/amd64; change them to use the detected architecture from Phase 1 by
    substituting the GOARCH-derived platform (e.g., linux/arm64 or linux/amd64) into
    the --platform flag for the crane mutate commands that use CURRENT_CENTRAL_IMAGE
    and TAG (and the later crane mutate for CURRENT_MINIO_IMAGE); ensure the GOARCH
    value is preserved into the shell invocation that runs crane (export GOARCH or
    inline the literal platform string computed earlier) so the correct manifest is
    selected for the cluster architecture.
  • Around line 71-92: The docs warn that shell state/variables like $ORCH_CMD
    won't persist but then use those variables in examples; fix by replacing those
    examples with explicit literals or clear placeholders and a brief note: e.g.,
    show both commands ("oc" / "kubectl") instead of $ORCH_CMD, replace
    $API_ENDPOINT with "" and $ROX_ADMIN_PASSWORD with "", and
    update every occurrence of $ORCH_CMD, $API_ENDPOINT, $ROX_ADMIN_PASSWORD (and
    similar shell vars) in the file so examples are consistent with the persistence
    warning and include the alternative literal commands where appropriate.

Nitpick comments:
In @.claude/skills/verify/SKILL.md:

  • Around line 289-292: The TAG generation currently assumes the uuidgen command
    is present; update the logic around TAG to first check availability of uuidgen
    (e.g., using command -v uuidgen >/dev/null) and if present keep
    TAG="ttl.sh/$(uuidgen | tr '[:upper:]' '[:lower:]'):2h", otherwise set a
    portable fallback (for example TAG="ttl.sh/stackrox-$(date +%s)-$RANDOM:2h") so
    TAG is always defined even on systems without uuidgen; modify the section that
    defines TAG to implement this conditional.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Central YAML (base), Organization UI (inherited)

**Review profile**: CHILL

**Plan**: Enterprise

**Run ID**: `b5fcd7df-03e1-4abb-ba49-a51142bfaf66`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 92e58561a48f8981d92c3877bf8161b72992ac7f and fa027c0ad30e076cd7b29c87e77fbcdda3bec2e9.

</details>

<details>
<summary>📒 Files selected for processing (1)</summary>

* `.claude/skills/verify/SKILL.md`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment thread .claude/skills/verify/SKILL.md Outdated
Comment thread .claude/skills/verify/SKILL.md Outdated
Comment thread .claude/skills/verify/SKILL.md Outdated
guzalv and others added 4 commits May 21, 2026 23:15
- Remove $ORCH_CMD, use literal `oc` (with `kubectl` alternative noted)
- Replace ${ROX_ADMIN_PASSWORD}/${API_ENDPOINT} with <password>/<endpoint>
- Replace $TAG with <tag>, $CURRENT_*_IMAGE with <current-*-image>
- Remove misleading export block, replace with persistence note
- Use detected <arch> in --platform and GOARCH instead of hardcoded amd64
- Remove redundant "shell state reminder" paragraph (Phase 0 covers it)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compliance runs as a container in the collector DaemonSet. The
DaemonSet controller handles rollout — sensor is not involved.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only go, curl, jq are hard requirements. For cluster access, either
oc or kubectl suffices. For image push, either crane or docker suffices.
Stop only if neither alternative is available.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@guzalv guzalv marked this pull request as ready for review May 21, 2026 21:51
@guzalv guzalv changed the title feat: add /verify Claude Code skill for cluster testing AI: add /verify Claude Code skill for cluster testing May 21, 2026
@guzalv

guzalv commented May 21, 2026

Copy link
Copy Markdown
Contributor Author

[Claude writing on behalf of @gualvare]

Regarding the uuidgen nitpick: skipping this. uuidgen ships with macOS and virtually all Linux distros (part of util-linux). The code examples in this skill are illustrative, not copy-paste scripts — the agent executing the skill can handle tool availability on its own. Adding a fallback would be overengineering for a skill document.

All three major findings (shell variables, export block, crane --platform) were fixed in 95dfaae. CodeRabbit reviewed an older commit (fa027c0) — the current HEAD (001b2ac) has all these addressed.

@guzalv guzalv requested a review from a team May 21, 2026 21:53
guzalv added 4 commits May 22, 2026 00:05
Expand the "Fix verification" section to include a full regression
proof workflow: verify the fix works, then stash it, rebuild/redeploy
the old code, reproduce the bug, and present before/after evidence.

This gives PR reviewers concrete proof that the changes actually fix
the reported issue.

Partially generated by AI.
Instead of assuming `git stash`, tell the agent to figure out what
"the fix" is from context (uncommitted changes, single commit,
feature branch, etc.) and choose the appropriate revert strategy.
Ask the user when it can't be determined confidently.

Partially generated by AI.
When charmbracelet/freeze is available, render verification command
outputs as PNG images (saved to /tmp/verify-*.png) that callers can
attach to PR descriptions. freeze is optional — plain text evidence
is still captured when it's not installed.

Partially generated by AI.
Instead of rendering individual commands, accumulate a curated session
log throughout the verification (with section headers, stripped of
noise), then render the whole thing as one image via freeze at the end.

Produces a single cohesive proof image that tells the full story:
build, deploy, with-fix test, without-fix regression, result.

Partially generated by AI.
@guzalv guzalv changed the title AI: add /verify Claude Code skill for cluster testing chore(AI): add /verify Claude Code skill for cluster testing May 21, 2026
@guzalv guzalv changed the title chore(AI): add /verify Claude Code skill for cluster testing chore(AI): add /verify Claude skill for cluster testing May 21, 2026
guzalv and others added 3 commits May 25, 2026 09:47
For changes to existing logic (bug fixes, behavior changes), the skill
now instructs the agent to produce two separate proof logs/images:

- verify-before: deploy base branch, run verification, capture output
- verify-after: deploy fix, run same verification, capture output

Both are tested against the real cluster. For new features (no prior
behavior), only the AFTER proof is produced.

Partially generated by AI.
- Phase 2c: roxctl requires MainVersion ldflag — panics on empty version.
  Document the exact build command with -X flag, oc symlink workaround,
  helm/MONITORING_SUPPORT=false for environments missing helm.
- Phase 3: clarify YOLO + no changes exits cleanly instead of ambiguously
  trying to ask the user.
- Phase 4: note that roxctl is the exception requiring version ldflags;
  central/migrator/sensor run fine without them.

Partially generated by AI

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@guzalv guzalv changed the title chore(AI): add /verify Claude skill for cluster testing chore(AI): add /verify-in-cluster Claude skill for live cluster testing May 26, 2026
@guzalv guzalv changed the title chore(AI): add /verify-in-cluster Claude skill for live cluster testing chore(AI): add /verify-in-cluster skill for live cluster testing May 26, 2026
Guzman Alvarez and others added 6 commits May 26, 2026 09:33
…skill

Add notes about port 8000 conflicts in cloud workspace environments (use
alternative port like 18443) and document DB migration version mismatch
error when source tree has newer migrations than the base nightly image.
Both issues were discovered during live cluster testing of the skill.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 2c now fetches the latest CI-built image tag from quay.io/stackrox-io/
whose embedded commit is on origin/master, instead of using `make tag` which
only works with a personal registry. Also configures all supporting image repos
(central-db, scanner-v4, collector, etc.) from Quay.

Phase 5 adds guidance on base image compatibility — the base image for
crane mutate must have a compatible DB migration sequence to avoid crashes.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
quay.io is ~3.4x faster than ttl.sh for pushing images (38s vs 2m10s
in testing), even when uploading all blobs vs only new ones. The skill
now checks for quay.io credentials first and falls back to ttl.sh
when unavailable.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pipe `crane auth get quay.io` through jq to extract only the
username and authentication status, preventing the secret/token
from being displayed in tool output.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…erify skill

Three issues found during from-scratch deployment testing:

1. Registry: use quay.io/<user>/stackrox/main (existing public repo) instead of
   quay.io/<user>/stackrox-verify (auto-created private, cluster can't pull).
   Verify repo is public via Quay API before using; fall back to ttl.sh.

2. roxctl build: ScannerVersion ldflag is required — the embedded Helm chart
   uses `required "" .ScannerImageTag` which is populated from this value.
   Also must build for host platform (not GOOS=linux) since roxctl runs
   locally to generate deployment configs.

3. Tags: use short 8-char UUIDs for readability.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uidance

Root cause: roxctl central generate bakes MainVersion into ALL image tags
(central, scanner-v4, scanner-v4-db, etc.). Using `make tag` produces a
`-dirty` suffix that doesn't exist on Quay, causing ImagePullBackOff for
scanner-v4-db and scanner-v4 pods.

Fix: build roxctl with MainVersion set to the Quay-fetched MAIN_IMAGE_TAG
instead of `make tag`. Also add post-deploy instructions to patch scanner-v4
and scanner-v4-db images with their independently-versioned Quay tags, since
scanner-v4 is built on a separate CI pipeline from the main image.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vjwilson

Copy link
Copy Markdown
Contributor

@guzalv
Would it be too out-of-scope to include the option to give this skill a PR build reference, instead of having it build the image locally?

When verifying a PR with no local modifications, the skill now checks
for a CI-built image posted by github-actions[bot] on the PR. If the
image tag matches HEAD, Phases 4-5 (build + push) are skipped entirely
and the CI image is used directly — faster and more complete (includes
UI, proper ldflags, matching DB migrations).

The fast-path is only used when all conditions are met: no local changes,
CI tag matches HEAD, gh CLI available. Otherwise falls through to the
normal local build path.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@guzalv

guzalv commented May 27, 2026

Copy link
Copy Markdown
Contributor Author

@guzalv Would it be too out-of-scope to include the option to give this skill a PR build reference, instead of having it build the image locally?

@vjwilson thanks! That is a use case I had not thought of "verify someone else's PR", or "verify something that AI generated before having verified it first".

The main focus of the skill is to verify before creating a PR at all and that way iterate faster, save time spent on checking CI failures, save CI costs. And from that point of view building locally is the right approach.

But what you suggest is also valid and many steps are shared. So I have added support for using PR image builds in d7e10ce

Test in ambient:

PASS — the CI fast-path worked. The agent:
Checked out the PR branch, found no local changes
Found the CI bot comment with MAIN_IMAGE_TAG=4.11.x-1099-gaad2d04064
Verified it matched HEAD — skipped build and crane push entirely
Deployed using the CI image directly
Ran before/after unit tests comparing base vs PR behavior
Confirmed the fix works: timeout and cancellation no longer delete compliance results
It even did a full fresh deploy from scratch and ran a proper before/after comparison. 66 tool calls, ~10 minutes. No user prompts needed.
The fast-path is working as designed — when CI has already built the image, there's no need to rebuild locally.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vladbologa

Copy link
Copy Markdown
Contributor

This looks great, but I think there's a lot of overlap with roxie, and you could remove ~300 lines from the skill by using it instead. The following is an AI generated report:

We have an internal developer tool (roxie) that already handles most of the cluster setup and deployment logic in this skill. We're also in the process of replacing CI shell scripts with roxie, so it would be good to align this skill with that direction rather than reimplementing the same logic in markdown.

Specifically, these phases could delegate to roxie instead of rolling their own shell sequences:

Phase 0 (Prerequisites Check) — ~40 of 62 lines

  • Cluster tool detection, registry auth probing, and image verification are all handled by roxie automatically.
  • The crane/ttl.sh fallback logic for the image-mutation path is novel and should stay.

Phase 1 (Discover Cluster) — all 33 lines

  • Roxie auto-detects cluster type (GKE, OpenShift, Kind, Minikube, K3s, CRC), context, and node architecture. A single roxie env replaces this entire phase.

Phase 2 (Find StackRox and Authenticate) — all 206 lines

  • 2a/2b (namespace detection, endpoint discovery, credential management): Roxie handles endpoint wiring and admin password generation/retrieval automatically.
  • 2c (deploy from scratch): This is the biggest concern — 130 lines of fragile shell that hardcodes Quay tag resolution, roxctl ldflags (including the ScannerVersion workaround), 15+ env vars, and post-deploy image fixups for scanner-v4-db. All of this is a single roxie deploy both. This section is the most
    likely to rot as deployment internals change.

Phase 6 (Deploy) — ~15 of 59 lines

  • Health checks, rollout waiting, and port-forward management overlap with roxie's built-in wait and port-forward logic.

Total: ~294 of 721 lines (41%) could be replaced by shelling out to roxie for cluster setup and deployment, while keeping the genuinely novel parts of this skill: change analysis, crane-based image mutation, test execution, and proof generation.

@mclasmeier

Copy link
Copy Markdown
Contributor

I'd like to second what @vladbologa said.

The purpose of roxie is to provide modern, flexible, operator-based ACS deployment capabilities specifically designed for the use-cases of engineers. This will allow us to get rid of the deployment shell scripts -- work is underway to integrate roxie there as we speak.

Building something new on top of the deploy shell scripts nowadays instead of just using what we have build during the last months, suggests that I didn't do a good job in sharing project status updates on roxie. :-(

Remaining features targeted at what ACS engineers expect from roxie are in the process of being implemented right now. If there is anything missing that you (or anyone from the team) needs, please tell us. Or if there is anything else holding you back from adopting roxie, please tell us.

Sharing the link again:
https://github.com/stackrox/roxie

@vikin91

vikin91 commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

@guzalv this discussion shows how many of us have different preferences regarding the art of everyday ACS developer work. That is why I only selectively publish my skills and I do that outside of the stackrox repo.

Guzman Alvarez and others added 2 commits June 3, 2026 10:51
… skill

Replace ~300 lines of manual cluster discovery, deployment scripting, and
credential management with Roxie CLI commands. Roxie handles cluster type
detection, operator deployment, Central+SecuredCluster CRs, readiness
waiting, and credential generation in a single command.

Key changes:
- Phase 0: Add roxie and roxctl as prerequisites
- Phase 1: Use `roxie env` for cluster detection
- Phase 2a: Check acs-central namespace (Roxie's default) alongside stackrox
- Phase 2b: Read credentials from Roxie manifest secret
- Phase 2c: Replace deploy.sh + 15 env vars + Quay tag resolution +
  scanner-v4-db fixups with `roxie deploy both --tag <tag> --envrc`
- Phase 5: Add note about private registry (rhacs-eng) fallback to
  public registry (stackrox-io) for crane base image pulls
- Remove all deploy.sh-specific workarounds (roxctl ldflags, helm,
  MONITORING_SUPPORT, post-deploy image patches)

Tested against live OpenShift cluster with both local-changes path
(crane mutate) and CI fast-path (PR image). Both pass.

Partially generated by AI.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 2b now prefers `roxie shell -- bash -c 'echo ...'` to retrieve
endpoint and credentials from the saved manifest, with manual fallback
for non-Roxie deployments. This replaces ~20 lines of manual endpoint
discovery with a single command.

AI-generated.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@guzalv

guzalv commented Jun 3, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @vladbologa and @mclasmeier! I have refactored the skill to use roxie, tested it and it works perfectly. I passed your analysis to Claude, and after it did several trials it settled down for an approach where it used Roxie mostly as suggested, but kept certain existing bits. I asked it to update the PR description with details. Would you like to see if that makes sense?

I think you’ve done a good job at marketing roxie because I was aware it exists, it’s on my side that I didn’t “internalize” it so to say.

Some background in case you’re interested: when I joined I got a powerful machine which builds images faster than CI, so I looked into deploying those. Found the deploy/ scripts, and then ended up rolling my own set of scripts to build/push/deploy tailored to my workflow in part because I thought that would provide insight into install methods, which it did.

Then I heard about roxie several times, and while I found it superior and wanted to look into it I never actually did it, and still kept my workflow because “it works”.

When I started using agents to deploy stackrox, I told them to use those scripts because I’m familiar with them so I could babysit the agent and iterate on a better prompt/skill.

When we discussed in the team about sharing agentic workflows I thought of sharing this, which is serving me very well, because I didn’t see it anywhere else. But instead of having the skill use my own scripts I told it to use the deploy/ framework: end result would be the same, and it would easier on other people’s eyes who bothered to read it.

At that point I should have gone with roxie, but it just didn’t come to my mind, I guess because what my scripts replace is the deploy/ framework, not roxie.

@vikin91 I think this shows quite well the beauty of collaborating on aspects that are needed by many people: I proposed something, got feedback, adapted, and when it gets merged everyone will have the opportunity to automatically use the best solution transparently, and learn about it if they are interested. I see skills as equivalent to scripts in "the old times" :D : some make sense within projects and others outside.

@guzalv

guzalv commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

@guzalv guzalv closed this Jun 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants