iframe-proxy

leonzh-ant · 2026-06-19T13:56:42Z

Problem

#23108 made prebuild claim delivery durable: when an agent connects to /api/v2/workspaceagents/me/reinit?wait=true, the handler checks whether the workspace's first build was created by the prebuilds system user and whether its latest build succeeded, and if so pre-seeds a prebuild_claimed reinitialization event in case the original pubsub event was missed.

The check does not verify that the latest build is the claim build, so it keeps firing for the rest of the workspace's life. Any workspace that was claimed from a prebuild receives a spurious "prebuild claimed" reinit every time its agent (re)opens the /reinit connection: after every agent restart, every coderd deploy or replica restart, and every dropped SSE connection. Each one shuts the agent down and reinitializes it, killing SSH/IDE sessions and re-running startup scripts. In our deployment, where most workspaces are claimed from prebuilds, this caused fleet-wide "agent disconnected" blips whenever a coderd replica restarted, and a few workspaces whose container exits when the agent restarts went into a restart loop every 15-60 minutes. The agent-side dedup (lastOwnerID in cli/agent.go) only suppresses the second event within one agent process, so every new agent process takes at least one spurious restart.

Fix

Only seed the reinitialization event while the latest build is the claim build itself, determined from the build job's input (prebuilt_workspace_stage), the same signal provisionerdserver uses when publishing the claim event:

Latest build is the claim build: behavior unchanged (seed when the job succeeded, 409 when it failed permanently, wait on pubsub while it is in progress).
Latest build is still a prebuilds-initiated build (claim build not created yet): fall through to the pubsub subscription, which delivers the claim event when the claim build completes.
Latest build is any later user-initiated build: the claim was already handled, so return 409 and the agent stops polling, the same as a regular workspace.

dbfake gains a MarkPrebuiltWorkspaceClaim() builder option so tests can model claim builds' job input, and the existing TestReinit claim subtests now use it. A new subtest covers the long-claimed workspace case.

One deliberate behavior change worth calling out: if a claim build fails and the owner retries with another start build, the handler now returns 409 for that retry build rather than seeding a reinit. This matches the existing treatment of failed claim builds as terminal for the reinit poller.

Verification

go test ./coderd/ -run TestReinit against Postgres 17: all subtests pass, including the new workspace claimed in the past gets 409 case.
gofmt, go vet, and golangci-lint (v1.64.8) are clean on the touched packages.
The fix mirrors behavior validated by hand against an affected deployment: for a long-claimed workspace, /reinit?wait=true returned the seeded prebuild_claimed event on every connection before the change and a 409 afterwards.

Note: this branch was prepared in an environment without the full local toolchain, so the repo's pre-commit hook (make pre-commit) was not run locally; relying on CI for the full gen/fmt/lint suite. Opening as a draft mainly to report the issue and propose a fix; happy to rework it to the maintainers' preferred approach.

The durable claim check in workspaceAgentReinit pre-seeded a prebuild_claimed reinitialization event whenever a workspace's first build was created by the prebuilds system user and its latest build succeeded. For workspaces claimed long ago this re-fired on every /reinit reconnection, restarting the agent (dropping SSH and IDE sessions) each time, and could loop indefinitely when the restart took down the workspace container. Only seed the event while the latest build is the claim build itself, as recorded in the build job's input, mirroring the check the provisioner server uses when publishing the claim event. Workspaces with any later build now receive a 409 so agents stop polling, the same as regular workspaces.

github-actions · 2026-06-19T13:56:53Z

datadog-coder · 2026-06-19T13:58:02Z

⚠️ Warnings

🚦 1 Pipeline job failed

ci | Storybook

_{This comment will be updated automatically if new data arrives.

🔗 Commit SHA: fbda2f7 | Docs | Give us feedback!}

SasSwart · 2026-06-22T10:13:34Z

Thanks for the contribution @leonzh-ant!

This looks good to me. Its a reasonable approach and it covers a valid gap in my original PR. If you don't mind, I'm going to mark this ready for review and let another colleague of mine have a second look.

We do just need you to sign the CLA before we can bring this aboard.

johnstcn · 2026-06-22T10:22:22Z

+				slog.F("latest_build_id", latestBuild.ID),
+				slog.F("latest_build_number", latestBuild.BuildNumber))


suggestion: also include job_id for correlation as we log it in the other cases above

johnstcn · 2026-06-22T10:26:35Z

-			// Claim build failed permanently. Return 409 so the
-			// agent treats this as terminal and stops retrying
-			// (WaitForReinitLoop exits on any 409).
+			// Claim build still in progress: fall through to the


nit: "fall through" inside a switch statement has a very specific meaning.

leonzh-ant · 2026-06-22T14:44:22Z

I have read the CLA Document and I hereby sign the CLA

leonzh-ant · 2026-06-22T15:17:18Z

Thanks for the review @SasSwart @johnstcn! Both comments addressed in ad010e0: added job_id to the debug log in the already-handled case, and reworded the two switch-case comments to "proceed to the transmitter below" to avoid confusion with Go's fallthrough.

recheck

…-build

leonzh-ant · 2026-06-23T07:20:08Z

github-actions Bot added the community Pull Requests and issues created by the community. label Jun 19, 2026

github-actions Bot assigned leonzh-ant Jun 19, 2026

johnstcn requested review from SasSwart, johnstcn and mafredri June 19, 2026 16:37

SasSwart marked this pull request as ready for review June 22, 2026 10:13

johnstcn reviewed Jun 22, 2026

View reviewed changes

cdrci2 added a commit to coder/cla that referenced this pull request Jun 22, 2026

@leonzh-ant has signed the CLA in coder/coder#26548

96c3a96

Include job_id in reinit log and clarify switch case comments

ad010e0

Merge branch 'upstream-main-tmp' into fix/workspaceagent-reinit-claim…

84e07d7

…-build

SasSwart approved these changes Jun 23, 2026

View reviewed changes

Merge branch 'main' into fix/workspaceagent-reinit-claim-build

fbda2f7

SasSwart merged commit e044a42 into coder:main Jun 23, 2026
28 of 29 checks passed

github-actions Bot locked and limited conversation to collaborators Jun 23, 2026

Sunbelt Computer Software

PL/B Language Development and Support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(coderd): only send prebuild claim reinit for the claim build#26548

fix(coderd): only send prebuild claim reinit for the claim build#26548
SasSwart merged 4 commits into
coder:mainfrom
leonzh-ant:fix/workspaceagent-reinit-claim-build

leonzh-ant commented Jun 19, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026 •

edited

Loading

Uh oh!

datadog-coder Bot commented Jun 19, 2026 •

edited by datadog-coder-official Bot

Loading

Uh oh!

SasSwart commented Jun 22, 2026 •

edited

Loading

Uh oh!

johnstcn Jun 22, 2026

Uh oh!

johnstcn Jun 22, 2026

Uh oh!

leonzh-ant commented Jun 22, 2026

Uh oh!

leonzh-ant commented Jun 22, 2026

Uh oh!

leonzh-ant commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		slog.F("latest_build_id", latestBuild.ID),
		slog.F("latest_build_number", latestBuild.BuildNumber))

Sunbelt Computer Software

PL/B Language Development and Support

Uh oh!

Conversation

leonzh-ant commented Jun 19, 2026

Problem

Fix

Verification

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

datadog-coder Bot commented Jun 19, 2026 • edited by datadog-coder-official Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ Warnings

Uh oh!

SasSwart commented Jun 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

johnstcn Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

johnstcn Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

leonzh-ant commented Jun 22, 2026

Uh oh!

leonzh-ant commented Jun 22, 2026

Uh oh!

leonzh-ant commented Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

github-actions Bot commented Jun 19, 2026 •

edited

Loading

datadog-coder Bot commented Jun 19, 2026 •

edited by datadog-coder-official Bot

Loading

SasSwart commented Jun 22, 2026 •

edited

Loading