chore(AI): add /verify-in-cluster skill for live cluster testing#20762
chore(AI): add /verify-in-cluster skill for live cluster testing#20762guzalv wants to merge 25 commits into
Conversation
Add a skill that automates the build-deploy-test cycle against a live cluster. It handles cluster discovery, StackRox auth, cross-compilation, crane-based image mutation, deployment patching, and test execution. Reviewed by two independent parallel agents; findings addressed. Tested locally against an OpenShift cluster (central + migrator build, ttl.sh push, deployment patch, API verification). AI-assisted: code partially generated by AI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Skipping CI for Draft Pull Request. |
- crane manifest ttl.sh/test:1h returns a valid manifest, not MANIFEST_UNKNOWN - Remove incorrect claim that TLS errors are "common on macOS" - Default to stackrox namespace, fall back to rhacs-operator instead of searching all namespaces Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The x509 error is caused by Claude Code's sandbox network proxy intercepting Go's crypto/tls connections. Tested: crane fails inside sandbox, works outside; curl (different TLS impl) works in both. --insecure workaround is correct. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
🚀 Build Images ReadyImages are ready for commit f6b3c44. To use with deploy scripts: export MAIN_IMAGE_TAG=4.11.x-1130-gf6b3c44a6c |
📝 WalkthroughWalkthroughA new Claude Code skill document provides end-to-end guidance for verifying StackRox changes on live Kubernetes/OpenShift clusters. The skill defines eight sequential operational phases: environment setup and tool validation, cluster discovery and architecture detection, StackRox Central discovery with authentication and optional deployment, change impact analysis to determine rebuild scope, cross-compilation, image creation and push to ttl.sh, deployment patching with health checks, context-driven test execution, and PR-ready reporting. ChangesStackRox Verification Skill
🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (1)
.claude/skills/verify/SKILL.md (1)
289-292: 💤 Low valueConsider a fallback for
uuidgenavailability.The
uuidgencommand may not be available on all systems. Consider checking for its availability or using a more portable alternative.🔧 Portable alternative using date and random
Generate a unique tag: ```bash -TAG="ttl.sh/$(uuidgen | tr '[:upper:]' '[:lower:]'):2h" +if command -v uuidgen >/dev/null; then + TAG="ttl.sh/$(uuidgen | tr '[:upper:]' '[:lower:]'):2h" +else + TAG="ttl.sh/stackrox-$(date +%s)-$RANDOM:2h" +fi</details> <details> <summary>🤖 Prompt for AI Agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.In @.claude/skills/verify/SKILL.md around lines 289 - 292, The TAG generation
currently assumes the uuidgen command is present; update the logic around TAG to
first check availability of uuidgen (e.g., using command -v uuidgen >/dev/null)
and if present keep TAG="ttl.sh/$(uuidgen | tr '[:upper:]' '[:lower:]'):2h",
otherwise set a portable fallback (for example TAG="ttl.sh/stackrox-$(date
+%s)-$RANDOM:2h") so TAG is always defined even on systems without uuidgen;
modify the section that defines TAG to implement this conditional.</details> </blockquote></details> </blockquote></details> <details> <summary>🤖 Prompt for all review comments with AI agents</summary>Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.Inline comments:
In @.claude/skills/verify/SKILL.md:
- Around line 165-169: The export block showing "export ROX_ADMIN_PASSWORD" and
"export API_ENDPOINT" conflicts with earlier warnings that shell state does not
persist between Bash tool calls; update the SKILL.md section containing these
export commands so it does not imply persistence—either remove the export lines
entirely or replace them with a clear, explicit one-line note (near the existing
export ROX_ADMIN_PASSWORD / API_ENDPOINT text) stating these are reference
values only and that the literal values must be substituted into each following
command (e.g., "Reference only — substitute ROX_ADMIN_PASSWORD and API_ENDPOINT
literal values into each command; shell exports do not persist across tool
calls"). Ensure the text mentions the exact symbols ROX_ADMIN_PASSWORD and
API_ENDPOINT so readers can find and update the subsequent examples.- Around line 316-324: The crane mutate calls always use a hardcoded platform of
linux/amd64; change them to use the detected architecture from Phase 1 by
substituting the GOARCH-derived platform (e.g., linux/arm64 or linux/amd64) into
the --platform flag for the crane mutate commands that use CURRENT_CENTRAL_IMAGE
and TAG (and the later crane mutate for CURRENT_MINIO_IMAGE); ensure the GOARCH
value is preserved into the shell invocation that runs crane (export GOARCH or
inline the literal platform string computed earlier) so the correct manifest is
selected for the cluster architecture.- Around line 71-92: The docs warn that shell state/variables like $ORCH_CMD
won't persist but then use those variables in examples; fix by replacing those
examples with explicit literals or clear placeholders and a brief note: e.g.,
show both commands ("oc" / "kubectl") instead of $ORCH_CMD, replace
$API_ENDPOINT with "" and $ROX_ADMIN_PASSWORD with "", and
update every occurrence of $ORCH_CMD, $API_ENDPOINT, $ROX_ADMIN_PASSWORD (and
similar shell vars) in the file so examples are consistent with the persistence
warning and include the alternative literal commands where appropriate.
Nitpick comments:
In @.claude/skills/verify/SKILL.md:
- Around line 289-292: The TAG generation currently assumes the uuidgen command
is present; update the logic around TAG to first check availability of uuidgen
(e.g., using command -v uuidgen >/dev/null) and if present keep
TAG="ttl.sh/$(uuidgen | tr '[:upper:]' '[:lower:]'):2h", otherwise set a
portable fallback (for example TAG="ttl.sh/stackrox-$(date +%s)-$RANDOM:2h") so
TAG is always defined even on systems without uuidgen; modify the section that
defines TAG to implement this conditional.</details> <details> <summary>🪄 Autofix (Beta)</summary> Fix all unresolved CodeRabbit comments on this PR: - [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended) - [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes </details> --- <details> <summary>ℹ️ Review info</summary> <details> <summary>⚙️ Run configuration</summary> **Configuration used**: Central YAML (base), Organization UI (inherited) **Review profile**: CHILL **Plan**: Enterprise **Run ID**: `b5fcd7df-03e1-4abb-ba49-a51142bfaf66` </details> <details> <summary>📥 Commits</summary> Reviewing files that changed from the base of the PR and between 92e58561a48f8981d92c3877bf8161b72992ac7f and fa027c0ad30e076cd7b29c87e77fbcdda3bec2e9. </details> <details> <summary>📒 Files selected for processing (1)</summary> * `.claude/skills/verify/SKILL.md` </details> </details> <!-- This is an auto-generated comment by CodeRabbit for review status -->
- Remove $ORCH_CMD, use literal `oc` (with `kubectl` alternative noted)
- Replace ${ROX_ADMIN_PASSWORD}/${API_ENDPOINT} with <password>/<endpoint>
- Replace $TAG with <tag>, $CURRENT_*_IMAGE with <current-*-image>
- Remove misleading export block, replace with persistence note
- Use detected <arch> in --platform and GOARCH instead of hardcoded amd64
- Remove redundant "shell state reminder" paragraph (Phase 0 covers it)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compliance runs as a container in the collector DaemonSet. The DaemonSet controller handles rollout — sensor is not involved. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Only go, curl, jq are hard requirements. For cluster access, either oc or kubectl suffices. For image push, either crane or docker suffices. Stop only if neither alternative is available. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
[Claude writing on behalf of @gualvare] Regarding the All three major findings (shell variables, export block, crane --platform) were fixed in 95dfaae. CodeRabbit reviewed an older commit (fa027c0) — the current HEAD (001b2ac) has all these addressed. |
Expand the "Fix verification" section to include a full regression proof workflow: verify the fix works, then stash it, rebuild/redeploy the old code, reproduce the bug, and present before/after evidence. This gives PR reviewers concrete proof that the changes actually fix the reported issue. Partially generated by AI.
Instead of assuming `git stash`, tell the agent to figure out what "the fix" is from context (uncommitted changes, single commit, feature branch, etc.) and choose the appropriate revert strategy. Ask the user when it can't be determined confidently. Partially generated by AI.
When charmbracelet/freeze is available, render verification command outputs as PNG images (saved to /tmp/verify-*.png) that callers can attach to PR descriptions. freeze is optional — plain text evidence is still captured when it's not installed. Partially generated by AI.
Instead of rendering individual commands, accumulate a curated session log throughout the verification (with section headers, stripped of noise), then render the whole thing as one image via freeze at the end. Produces a single cohesive proof image that tells the full story: build, deploy, with-fix test, without-fix regression, result. Partially generated by AI.
For changes to existing logic (bug fixes, behavior changes), the skill now instructs the agent to produce two separate proof logs/images: - verify-before: deploy base branch, run verification, capture output - verify-after: deploy fix, run same verification, capture output Both are tested against the real cluster. For new features (no prior behavior), only the AFTER proof is produced. Partially generated by AI.
- Phase 2c: roxctl requires MainVersion ldflag — panics on empty version. Document the exact build command with -X flag, oc symlink workaround, helm/MONITORING_SUPPORT=false for environments missing helm. - Phase 3: clarify YOLO + no changes exits cleanly instead of ambiguously trying to ask the user. - Phase 4: note that roxctl is the exception requiring version ldflags; central/migrator/sensor run fine without them. Partially generated by AI Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…skill Add notes about port 8000 conflicts in cloud workspace environments (use alternative port like 18443) and document DB migration version mismatch error when source tree has newer migrations than the base nightly image. Both issues were discovered during live cluster testing of the skill. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 2c now fetches the latest CI-built image tag from quay.io/stackrox-io/ whose embedded commit is on origin/master, instead of using `make tag` which only works with a personal registry. Also configures all supporting image repos (central-db, scanner-v4, collector, etc.) from Quay. Phase 5 adds guidance on base image compatibility — the base image for crane mutate must have a compatible DB migration sequence to avoid crashes. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
quay.io is ~3.4x faster than ttl.sh for pushing images (38s vs 2m10s in testing), even when uploading all blobs vs only new ones. The skill now checks for quay.io credentials first and falls back to ttl.sh when unavailable. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Pipe `crane auth get quay.io` through jq to extract only the username and authentication status, preventing the secret/token from being displayed in tool output. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…erify skill Three issues found during from-scratch deployment testing: 1. Registry: use quay.io/<user>/stackrox/main (existing public repo) instead of quay.io/<user>/stackrox-verify (auto-created private, cluster can't pull). Verify repo is public via Quay API before using; fall back to ttl.sh. 2. roxctl build: ScannerVersion ldflag is required — the embedded Helm chart uses `required "" .ScannerImageTag` which is populated from this value. Also must build for host platform (not GOOS=linux) since roxctl runs locally to generate deployment configs. 3. Tags: use short 8-char UUIDs for readability. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…uidance Root cause: roxctl central generate bakes MainVersion into ALL image tags (central, scanner-v4, scanner-v4-db, etc.). Using `make tag` produces a `-dirty` suffix that doesn't exist on Quay, causing ImagePullBackOff for scanner-v4-db and scanner-v4 pods. Fix: build roxctl with MainVersion set to the Quay-fetched MAIN_IMAGE_TAG instead of `make tag`. Also add post-deploy instructions to patch scanner-v4 and scanner-v4-db images with their independently-versioned Quay tags, since scanner-v4 is built on a separate CI pipeline from the main image. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@guzalv |
When verifying a PR with no local modifications, the skill now checks for a CI-built image posted by github-actions[bot] on the PR. If the image tag matches HEAD, Phases 4-5 (build + push) are skipped entirely and the CI image is used directly — faster and more complete (includes UI, proper ldflags, matching DB migrations). The fast-path is only used when all conditions are met: no local changes, CI tag matches HEAD, gh CLI available. Otherwise falls through to the normal local build path. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@vjwilson thanks! That is a use case I had not thought of "verify someone else's PR", or "verify something that AI generated before having verified it first". The main focus of the skill is to verify before creating a PR at all and that way iterate faster, save time spent on checking CI failures, save CI costs. And from that point of view building locally is the right approach. But what you suggest is also valid and many steps are shared. So I have added support for using PR image builds in d7e10ce Test in ambient: |
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This looks great, but I think there's a lot of overlap with roxie, and you could remove ~300 lines from the skill by using it instead. The following is an AI generated report: We have an internal developer tool (roxie) that already handles most of the cluster setup and deployment logic in this skill. We're also in the process of replacing CI shell scripts with roxie, so it would be good to align this skill with that direction rather than reimplementing the same logic in markdown. Specifically, these phases could delegate to roxie instead of rolling their own shell sequences: Phase 0 (Prerequisites Check) — ~40 of 62 lines
Phase 1 (Discover Cluster) — all 33 lines
Phase 2 (Find StackRox and Authenticate) — all 206 lines
Phase 6 (Deploy) — ~15 of 59 lines
Total: ~294 of 721 lines (41%) could be replaced by shelling out to roxie for cluster setup and deployment, while keeping the genuinely novel parts of this skill: change analysis, crane-based image mutation, test execution, and proof generation. |
|
I'd like to second what @vladbologa said. The purpose of roxie is to provide modern, flexible, operator-based ACS deployment capabilities specifically designed for the use-cases of engineers. This will allow us to get rid of the deployment shell scripts -- work is underway to integrate roxie there as we speak. Building something new on top of the deploy shell scripts nowadays instead of just using what we have build during the last months, suggests that I didn't do a good job in sharing project status updates on roxie. :-( Remaining features targeted at what ACS engineers expect from roxie are in the process of being implemented right now. If there is anything missing that you (or anyone from the team) needs, please tell us. Or if there is anything else holding you back from adopting roxie, please tell us. Sharing the link again: |
|
@guzalv this discussion shows how many of us have different preferences regarding the art of everyday ACS developer work. That is why I only selectively publish my skills and I do that outside of the stackrox repo. |
… skill Replace ~300 lines of manual cluster discovery, deployment scripting, and credential management with Roxie CLI commands. Roxie handles cluster type detection, operator deployment, Central+SecuredCluster CRs, readiness waiting, and credential generation in a single command. Key changes: - Phase 0: Add roxie and roxctl as prerequisites - Phase 1: Use `roxie env` for cluster detection - Phase 2a: Check acs-central namespace (Roxie's default) alongside stackrox - Phase 2b: Read credentials from Roxie manifest secret - Phase 2c: Replace deploy.sh + 15 env vars + Quay tag resolution + scanner-v4-db fixups with `roxie deploy both --tag <tag> --envrc` - Phase 5: Add note about private registry (rhacs-eng) fallback to public registry (stackrox-io) for crane base image pulls - Remove all deploy.sh-specific workarounds (roxctl ldflags, helm, MONITORING_SUPPORT, post-deploy image patches) Tested against live OpenShift cluster with both local-changes path (crane mutate) and CI fast-path (PR image). Both pass. Partially generated by AI. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Phase 2b now prefers `roxie shell -- bash -c 'echo ...'` to retrieve endpoint and credentials from the saved manifest, with manual fallback for non-Roxie deployments. This replaces ~20 lines of manual endpoint discovery with a single command. AI-generated. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Thanks @vladbologa and @mclasmeier! I have refactored the skill to use roxie, tested it and it works perfectly. I passed your analysis to Claude, and after it did several trials it settled down for an approach where it used Roxie mostly as suggested, but kept certain existing bits. I asked it to update the PR description with details. Would you like to see if that makes sense? I think you’ve done a good job at marketing roxie because I was aware it exists, it’s on my side that I didn’t “internalize” it so to say. Some background in case you’re interested: when I joined I got a powerful machine which builds images faster than CI, so I looked into deploying those. Found the deploy/ scripts, and then ended up rolling my own set of scripts to build/push/deploy tailored to my workflow in part because I thought that would provide insight into install methods, which it did. Then I heard about roxie several times, and while I found it superior and wanted to look into it I never actually did it, and still kept my workflow because “it works”. When I started using agents to deploy stackrox, I told them to use those scripts because I’m familiar with them so I could babysit the agent and iterate on a better prompt/skill. When we discussed in the team about sharing agentic workflows I thought of sharing this, which is serving me very well, because I didn’t see it anywhere else. But instead of having the skill use my own scripts I told it to use the deploy/ framework: end result would be the same, and it would easier on other people’s eyes who bothered to read it. At that point I should have gone with roxie, but it just didn’t come to my mind, I guess because what my scripts replace is the deploy/ framework, not roxie. @vikin91 I think this shows quite well the beauty of collaborating on aspects that are needed by many people: I proposed something, got feedback, adapted, and when it gets merged everyone will have the opportunity to automatically use the best solution transparently, and learn about it if they are interested. I see skills as equivalent to scripts in "the old times" :D : some make sense within projects and others outside. |

Description
Add a Claude Code
/verify-in-clusterskill that automates the build-deploy-test cycle forStackRox code changes against a live Kubernetes/OpenShift cluster.
What the skill does
Phases 0–8 guide the agent through: cluster discovery, StackRox detection, authentication,
change analysis, binary cross-compilation, image mutation via
crane mutate --append,deployment patching, test execution, and result reporting.
Roxie integration
The skill delegates to Roxie wherever it can:
roxie envdetects cluster type, context, kubeconfigoc cluster-info+ architecture detectionroxie shell -- bash -c 'echo ENDPOINT=$ROX_ENDPOINT ...'retrieves endpoint and password from the saved manifestroxie deploy both --tag <tag> --envrc <file>handles operator install, CRs, readiness, and credential generationroxie teardown bothcleans up when neededEach Roxie step has a manual fallback for clusters not deployed by Roxie.
What is NOT delegated to Roxie (and why)
git diff, component mapping table, CI fast-path detectiongo buildwith cross-compilation for target archcrane mutate --appendto overlay binaries onto the running image, push to ttl.shkubectl set imageon specific deploymentsOther design decisions
--tagis mandatory forroxie deploy— without it, Roxie defaults to4.9.2which causes DB migration mismatchesquay.io/rhacs-eng/(requires auth); whencranecan't pull, the skill falls back toquay.io/stackrox-io/(public) with the same tagUser-facing documentation
Testing and quality
Automated testing
This is a Claude Code skill (markdown instructions), not executable code — no unit tests apply.
How I validated my change
Tested the skill end-to-end against a live GKE/OpenShift cluster across multiple iterations:
roxie deploy both --tag, then usedroxie shellto extract endpoint + password — confirmed workingcentral/ping/service/service_impl.go, built central+migrator, pushed viacrane mutate --appendto ttl.sh withquay.io/stackrox-io/as base (private registry fallback exercised)/v1/pingto verify the test change was live