{{ message }}
fix(coderd): only send prebuild claim reinit for the claim build#26548
Merged
SasSwart merged 4 commits intoJun 23, 2026
Conversation
The durable claim check in workspaceAgentReinit pre-seeded a prebuild_claimed reinitialization event whenever a workspace's first build was created by the prebuilds system user and its latest build succeeded. For workspaces claimed long ago this re-fired on every /reinit reconnection, restarting the agent (dropping SSH and IDE sessions) each time, and could loop indefinitely when the restart took down the workspace container. Only seed the event while the latest build is the claim build itself, as recorded in the build job's input, mirroring the check the provisioner server uses when publishing the claim event. Workspaces with any later build now receive a 409 so agents stop polling, the same as regular workspaces.
Contributor
|
Thanks for the contribution @leonzh-ant! This looks good to me. Its a reasonable approach and it covers a valid gap in my original PR. If you don't mind, I'm going to mark this ready for review and let another colleague of mine have a second look. We do just need you to sign the CLA before we can bring this aboard. |
johnstcn
reviewed
Jun 22, 2026
Comment on lines
+1620
to
+1621
| slog.F("latest_build_id", latestBuild.ID), | ||
| slog.F("latest_build_number", latestBuild.BuildNumber)) |
Member
There was a problem hiding this comment.
suggestion: also include job_id for correlation as we log it in the other cases above
| // Claim build failed permanently. Return 409 so the | ||
| // agent treats this as terminal and stops retrying | ||
| // (WaitForReinitLoop exits on any 409). | ||
| // Claim build still in progress: fall through to the |
Member
There was a problem hiding this comment.
nit: "fall through" inside a switch statement has a very specific meaning.
Contributor
Author
|
I have read the CLA Document and I hereby sign the CLA |
cdrci2
added a commit
to coder/cla
that referenced
this pull request
Jun 22, 2026
Contributor
Author
SasSwart
approved these changes
Jun 23, 2026
Contributor
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Problem
#23108 made prebuild claim delivery durable: when an agent connects to
/api/v2/workspaceagents/me/reinit?wait=true, the handler checks whether the workspace's first build was created by the prebuilds system user and whether its latest build succeeded, and if so pre-seeds aprebuild_claimedreinitialization event in case the original pubsub event was missed.The check does not verify that the latest build is the claim build, so it keeps firing for the rest of the workspace's life. Any workspace that was claimed from a prebuild receives a spurious "prebuild claimed" reinit every time its agent (re)opens the
/reinitconnection: after every agent restart, every coderd deploy or replica restart, and every dropped SSE connection. Each one shuts the agent down and reinitializes it, killing SSH/IDE sessions and re-running startup scripts. In our deployment, where most workspaces are claimed from prebuilds, this caused fleet-wide "agent disconnected" blips whenever a coderd replica restarted, and a few workspaces whose container exits when the agent restarts went into a restart loop every 15-60 minutes. The agent-side dedup (lastOwnerIDincli/agent.go) only suppresses the second event within one agent process, so every new agent process takes at least one spurious restart.Fix
Only seed the reinitialization event while the latest build is the claim build itself, determined from the build job's input (
prebuilt_workspace_stage), the same signalprovisionerdserveruses when publishing the claim event:dbfakegains aMarkPrebuiltWorkspaceClaim()builder option so tests can model claim builds' job input, and the existingTestReinitclaim subtests now use it. A new subtest covers the long-claimed workspace case.One deliberate behavior change worth calling out: if a claim build fails and the owner retries with another start build, the handler now returns 409 for that retry build rather than seeding a reinit. This matches the existing treatment of failed claim builds as terminal for the reinit poller.
Verification
go test ./coderd/ -run TestReinitagainst Postgres 17: all subtests pass, including the newworkspace claimed in the past gets 409case.gofmt,go vet, andgolangci-lint(v1.64.8) are clean on the touched packages./reinit?wait=truereturned the seededprebuild_claimedevent on every connection before the change and a 409 afterwards.Note: this branch was prepared in an environment without the full local toolchain, so the repo's pre-commit hook (
make pre-commit) was not run locally; relying on CI for the full gen/fmt/lint suite. Opening as a draft mainly to report the issue and propose a fix; happy to rework it to the maintainers' preferred approach.