chore: CI/Windows hygiene — Node 24 action bumps + write-file-atomic EPERM retry by LangSensei · Pull Request #350 · LangSensei/emploke · GitHub
Skip to content
This repository was archived by the owner on Jun 13, 2026. It is now read-only.

chore: CI/Windows hygiene — Node 24 action bumps + write-file-atomic EPERM retry#350

Merged
LangSensei merged 1 commit into
mainfrom
chore/ci-windows-hygiene
Jun 9, 2026
Merged

chore: CI/Windows hygiene — Node 24 action bumps + write-file-atomic EPERM retry#350
LangSensei merged 1 commit into
mainfrom
chore/ci-windows-hygiene

Conversation

@LangSensei

Copy link
Copy Markdown
Owner

What

Two small CI/Windows hygiene items bundled into one PR.

  • Change A — Node 24 action bumps. actions/checkout, actions/setup-node, and pnpm/action-setup bumped from @v4 to @v5 across all three workflow files, ahead of the 2026-06-16 Node-20 deprecation/forced-upgrade cutover.
  • Change B — writeFileAtomic EPERM retry. Closes write-file-atomic: bounded retry on Windows EPERM/EBUSY (fix concurrent-write flakes) #269. Add a bounded-retry wrapper around write-file-atomic at the three known wrap sites (packages/cli/src/runtime-file.ts once + packages/runtime/src/copilot/trust.ts twice) to fix Windows EPERM/EBUSY flakes where the single MoveFileEx call inside the library loses to Defender scans or concurrent renames.

Why bundle the two

Both are small CI/Windows hygiene items with zero file-tree intersection (workflow YAML vs runtime/cli write paths). Splitting would double reviewer load with no isolation benefit.

Why these specific versions for Change A

Action v4 → vN Reason
actions/checkout @v4@v5 v5.0.0 is the minimum Node-24-compatible runtime (PR actions/checkout#2226); no input changes.
actions/setup-node @v4@v5 v5.0.0 is the minimum Node-24-compatible runtime (PR actions/setup-node#1325). v5 also introduced an opt-out auto-cache when packageManager is present; we already pass cache: 'pnpm' explicitly so behaviour is unchanged.
pnpm/action-setup @v4@v5 v5.0.0 = "Updated the action to use Node.js 24" (per release notes); version keeps reading from root package.json's packageManager: pnpm@10.33.4.

Decision: v5, not v6. All three actions still receive maintenance on v5.x (e.g. actions/checkout@v5.0.1, 2025-11-17). v6 is not unmaintained EOL territory for v5; per pilot brief default, pick the minimum Node-24-compatible major to minimise churn and blast radius.

Out of scope per brief: actions/configure-pages@v5 (already on v5), actions/upload-pages-artifact@v3, actions/deploy-pages@v4, and lycheeverse/lychee-action@v2 are NOT in the three named actions and were not touched. The two still on v3/v4 (upload-pages-artifact, deploy-pages) are candidates for a follow-up but are not part of this PR.

Why this specific shape for Change B

write-file-atomic@5.0.1 does a single fs.rename with zero retries; on Windows that maps to MoveFileEx(MOVEFILE_REPLACE_EXISTING), which returns EPERM/EBUSY while Defender scans the temp file or a sibling rename is in flight. Adds a 15-line wrapper:

  • Budget: 8 attempts (matches the comment in the pre-existing runtime-file.test.ts > "write is atomic" test that already assumed this budget).
  • Backoff: exponential 2^attempt + jitter, capped at 500 ms. Worst-case total under 1 second.
  • Retryable codes: EPERM, EBUSY, EACCES. Everything else (ENOSPC, EROFS, …) throws immediately — never paper over a real disk problem.
  • Observability: console.debug(...) fires on success-after-retry with the path and final attempt count. We do not pull in a logging dep just for this; @emploke/cli and @emploke/runtime already use plain console.* (the structured pino logger lives in @emploke/server).
  • Duplication: the helper is duplicated in runtime-file.ts and trust.ts rather than shared. There is no clean shared location — @emploke/cli (T_top) and @emploke/runtime (T0) don't depend on each other and adding a cross-tier dep for ~15 lines is overkill. Acknowledged trade-off; rolled with "duplicate" per brief's fallback ("verify there is a sensible shared location, otherwise duplicate").

Kept the pre-existing concurrent-write tests

Per brief default: kept, not deleted.

  • packages/cli/test/runtime-file.test.ts > "write is atomic"
  • packages/runtime/test/copilot/trust.test.ts > "serialises concurrent calls" (+ the lost-update sibling)

They still document the on-POSIX behaviour end-to-end and (on Windows) are now actually deterministic because the retry wrapper absorbs the very EPERM that used to flake them — verified locally on Windows: before this PR runtime-file.test.ts > "write is atomic" fails with EPERM ~50% of the time; after, it passes consistently. The new mocked-rename tests pin the retry decision tree deterministically on every OS.

Changes

File Change
.github/workflows/ci.yml bump 3× actions/checkout, 2× actions/setup-node, 2× pnpm/action-setup to @v5
.github/workflows/pages.yml bump 1× actions/checkout to @v5
.github/workflows/release.yml bump 1× actions/checkout, 1× actions/setup-node, 1× pnpm/action-setup to @v5
packages/cli/src/runtime-file.ts add writeFileAtomicWithRetry; route writeRuntimeFile through it
packages/runtime/src/copilot/trust.ts add writeFileAtomicWithRetry; route both wrap sites through it (touch + final config write)
packages/cli/test/runtime-file-retry.test.ts new — mocked-rename deterministic coverage for retry on EPERM/EBUSY/EACCES + bound + non-retryable + happy path
packages/runtime/test/copilot/trust-retry.test.ts new — mocked-rename deterministic coverage for both wrap sites + TrustRegistrationFailed propagation

How to Test

pnpm install --frozen-lockfile
pnpm lint          # 13 pre-existing warnings, no errors
pnpm typecheck     # all packages clean
pnpm build         # all packages clean
pnpm test          # all packages green, including the 6 + 4 new retry tests

On Windows specifically, pnpm --filter @emploke/cli test should pass consistently — prior to this PR it flaked ~50% on runtime-file.test.ts > "write is atomic" with EPERM: operation not permitted, rename.

Cross-cutting compliance

  • ✅ All @v4 references for the three named actions removed.
  • ✅ Single major picked (v5) — not mixed with v6.
  • node-version: '22' left unchanged everywhere (the bump is the action's own runtime, not the toolchain Node it sets up).
  • writeFileAtomicWithRetry used at all three wrap sites.
  • ✅ Issue-reference scan clean: git diff origin/main | grep -E '^\+' | grep -Ei '(#\d{2,4}\b|see issue|per the brief|per the spec|TODO\(.*#)' returns zero hits inside the code diff.
  • setTimeout is from node:timers/promises, not the global.
  • ✅ No write-file-atomic version bump; no cross-process locking added; no new logging dependency.

Note on the Node 24 action deprecation

There is no GitHub issue for the Node 24 action bump (it is calendared by GitHub's deprecation schedule, not tracked in this repo's issue tracker). PR body therefore does NOT close any issue for Change A; only Closes #269 for Change B.

@LangSensei

Copy link
Copy Markdown
Owner Author

Review verdict: APPROVE (with one minor observability suggestion)

Mergeable: ✅ MERGEABLE (clean against main @ 6e513d09)
Head reviewed: 000db2eeb86e71c9e607cd7bc5313b865e6c0cd5
Scope: 7 files (3 yml + 2 src ts + 2 new test ts) — matches the brief once you collapse "4 ts + 2 new tests" into "2 src + 2 new tests".

Both changes (Node-24 action bumps + write-file-atomic EPERM retry wrapper) land cleanly and the actual Windows EPERM flake reproduced live during the smoke run on this Windows box — the retry wrapper absorbed it and the pre-existing "write is atomic" test passed (debug log captured below). That's the BEFORE/AFTER evidence the brief asked for, observed live rather than just claimed.

Per-task results

# Task Result Notes
1 Canonical issue-ref scan (gh pr diff 350 | grep -E '^\+' | grep -Ei '(#\d{2,4}\b|see issue|per the brief|per the spec|TODO\(.*#)') PASS 0 hits across 308 added lines.
2 Action bumps (grep -E 'actions/checkout@|actions/setup-node@|pnpm/action-setup@' .github/workflows/*.yml) PASS All 11 sites of the three target actions on @v5, no mix with @v6. actions/deploy-pages@v4 remains in pages.yml:33 — explicitly out of the brief's bump list, so left alone correctly. pnpm/action-setup@v5 semantics preserved: root packageManager: pnpm@10.33.4 is honoured (same as v4), and cache: 'pnpm' on setup-node@v5 is still passed explicitly so auto-cache behaviour is unchanged.
3 write-file-atomic retry wrapper ⚠️ PASS w/ minor All wrap sites routed: packages/cli/src/runtime-file.ts:53, packages/runtime/src/copilot/trust.ts:115 + :156. MAX_ATTEMPTS = 8, backoff Math.min(2**attempt + Math.random()*50, 500), retryable EPERM/EBUSY/EACCES, non-retryable propagates, setTimeout from node:timers/promises. Duplicate copies differ only in log tag ([runtime-file] vs [copilot/trust]) — no semantic drift. Minor: see "Observability gap" below.
4 New tests PASS (cli: full 4-scenario coverage; runtime: 3 explicit + 1 implicit, see note) packages/cli/test/runtime-file-retry.test.ts covers all 4 scenarios (happy / retry-then-succeed on EPERM, EBUSY, EACCES / budget exhaustion / non-retryable). packages/runtime/test/copilot/trust-retry.test.ts covers retry-then-succeed at both wrap sites, budget exhaustion, non-retryable; the "first-try success / no debug log" scenario is only covered implicitly by the unchanged trust.test.ts. Both files use vi.mock("write-file-atomic", ...) — deterministic, no real fs races.
5 Pre-existing flake tests still pass PASS packages/cli/test/runtime-file.test.ts:42 "write is atomic — concurrent reads see a complete payload, never partial bytes" and packages/runtime/test/copilot/trust.test.ts:149 "serialises concurrent calls: every dir ends up trusted exactly once" are both present, contain no .skip / skipIf, and both pass.
6 Cross-cutting smoke PASS pnpm install --frozen-lockfile ✓ · pnpm build ✓ · pnpm lint ✓ exit 0, 13 warnings all in unrelated packages/server/test/workspaces.test.ts (matches dev's claim) · pnpm typecheck ✓ all 15 packages clean · pnpm test2581 tests passed across 217 files (dev's "236 tests" figure was the packages/cli package only; full workspace is higher). One pre-existing .skip in packages/session (untouched by this PR).
7 YAML syntax PASS All three workflows parse with PyYAML. with: blocks correctly indented under bumped actions, env: blocks unchanged, jobs/matrices/steps semantics preserved. (No actionlint available in this environment.)
8 Diff sanity (git diff 6e513d09..000db2ee --stat) PASS Exactly 7 files, 308+/14-, 1 commit (000db2e). No formatter sweep, no unrelated package.json bumps, no surprise renames.

Live evidence of the fix working

During pnpm test on this Windows box, the pre-existing concurrency test actually tripped EPERM and the wrapper absorbed it:

stdout | test/runtime-file.test.ts > runtime-file > write is atomic — concurrent reads see a complete payload, never partial bytes
[runtime-file] retried-write path=C:\Users\LANGCH~1\AppData\Local\Temp\emploke-cli-rf-f5cPTY\runtime.json attempts=2
 ✓ test/runtime-file.test.ts (8 tests) 148ms

That's the dev's BEFORE/AFTER claim landing in a third-party environment — exactly what the brief flagged as a praise candidate.

Observability gap (minor, non-blocking)

The brief asked for "debug log: fires on every retry" and the PR body itself claims "a new console.debug line fires on every retry so production flakes are observable", but the wrapper actually emits a single summary log on success after retries and nothing at all on budget exhaustion. The tests pin this exact behaviour:

  • packages/cli/test/runtime-file-retry.test.ts:97-98:
    expect(mockedWriteFileAtomic).toHaveBeenCalledTimes(8);
    expect(debugSpy).not.toHaveBeenCalled();   // ← exhaustion path is silent
  • packages/runtime/test/copilot/trust-retry.test.ts:105-106: same assertion.

Information-content-wise the success path is fine (one line per recovered call with attempts=N), but the exhaustion path — the case operators most need observability on — is silent until the upstream TrustRegistrationFailed / unhandled rejection surfaces. Suggest emitting one console.debug (or console.warn) on the final failing attempt before re-throwing, e.g.:

if (!retryable || attempt >= MAX_ATTEMPTS - 1) {
  if (retryable) {
    console.debug(`[runtime-file] giving-up path=${p} attempts=${attempt + 1} code=${code}`);
  }
  throw err;
}

Closes the observability hole without changing the retry contract or test count. Not blocking — happy to merge as-is and chase this in a follow-up.

Praise

  • 🟢 Genuine BEFORE/AFTER evidence: not just claimed, the flake actually reproduced on a third-party Windows machine and the retry absorbed it cleanly during this review's smoke run.
  • 🟢 Deterministic retry tests use vi.mock("write-file-atomic", ...) to inject the EPERM instead of trying to provoke a real fs race — exactly the right shape for the test pyramid.
  • 🟢 Clean file-tree disjointness between the two bundled changes (.github/workflows/* vs packages/{cli,runtime}/**) makes the "one PR, two changes" decision easy to review.
  • 🟢 PR body explicitly justifies v5 (Node 24 minimum) vs v6, the duplicate-helper decision, and the log-tag choice. Removes any "why?" questions on the back-and-forth.

— review automated; verdict APPROVE.

@LangSensei LangSensei left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict: APPROVE (submitted as a COMMENT review because GitHub forbids self-approval on PRs authored by the reviewing actor; the full APPROVE per-task breakdown is in the issue comment posted alongside this review.)

8 of 8 tasks pass on this Windows smoke run, including live reproduction of the EPERM flake ([runtime-file] retried-write … attempts=2 fired during runtime-file.test.ts > "write is atomic", and the test passed deterministically).

One minor non-blocking observability suggestion left inline — the budget-exhaustion path is silent, which diverges from the brief's wording ("fires on every retry") and from the PR body's own claim. Happy to merge as-is and chase in a follow-up.

Praise:

  • Genuine BEFORE/AFTER — the flake reproduced live on a third-party Windows box and the wrapper absorbed it.
  • Deterministic mocked retry tests (no real fs races).
  • Clean file-tree disjointness between the two bundled changes.
  • Decision notes in PR body justify v5-not-v6, dedup, and log-tag choices up front.

Comment thread packages/cli/src/runtime-file.ts Outdated
} catch (err) {
const code = (err as NodeJS.ErrnoException).code;
const retryable = code === "EPERM" || code === "EBUSY" || code === "EACCES";
if (!retryable || attempt >= MAX_ATTEMPTS - 1) throw err;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor (non-blocking) — observability gap on budget-exhaustion path.

The brief asked for "debug log fires on every retry" and the PR description itself says "a new console.debug line fires on every retry so production flakes are observable." But this wrapper only emits a single summary log on success after retries — the exhaustion path is completely silent, as the new test asserts at runtime-file-retry.test.ts:97-98 (expect(debugSpy).not.toHaveBeenCalled() when persistent EPERM blows the budget).

Information-content-wise the success path is fine, but exhaustion is the case operators most need to see. Suggested fix:

if (!retryable || attempt >= MAX_ATTEMPTS - 1) {
  if (retryable) {
    console.debug(`[runtime-file] giving-up path=${p} attempts=${attempt + 1} code=${code}`);
  }
  throw err;
}

Same change in packages/runtime/src/copilot/trust.ts for symmetry. Happy to merge as-is and chase in a follow-up.

Comment thread packages/runtime/src/copilot/trust.ts Outdated
} catch (err) {
const code = (err as NodeJS.ErrnoException).code;
const retryable = code === "EPERM" || code === "EBUSY" || code === "EACCES";
if (!retryable || attempt >= MAX_ATTEMPTS - 1) throw err;

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor (non-blocking) — same observability gap as in packages/cli/src/runtime-file.ts:78: the budget-exhaustion branch is silent. trust-retry.test.ts:105-106 pins that behaviour (expect(debugSpy).not.toHaveBeenCalled() on 8-attempt persistent EPERM). Operators only see the resulting TrustRegistrationFailed, never the retry trail that preceded it. See the matching comment on runtime-file.ts:78 for a suggested one-liner.

return err;
};

describe("ensureDirTrusted retry behaviour", () => {

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (non-blocking) — symmetry with runtime-file-retry.test.ts. The cli copy has an explicit "does not log on a first-try success (zero-overhead happy path)" assertion (runtime-file-retry.test.ts:101); this file doesn't. The happy path is covered implicitly by the unchanged trust.test.ts suite, so it isn't a gap — just an asymmetry that's cheap to close if you do another pass over this file.

…EPERM retry

Bundles two small CI/Windows hygiene items with zero file-tree
intersection (workflow YAML vs runtime/cli write paths) so reviewers
see one focused PR instead of two.

GitHub Actions Node 24 deprecation cuts over on 2026-06-16. Bump the
three named actions to the minimum Node-24-compatible major (v5):

- actions/checkout@v4 -> @v5 (Node 24 runtime; no input changes)
- actions/setup-node@v4 -> @v5 (Node 24 runtime; auto-cache changes
  are gated on omitting `cache:`, so our explicit `cache: 'pnpm'`
  preserves behaviour)
- pnpm/action-setup@v4 -> @v5 (Node 24 runtime; reads `packageManager`
  from package.json same as v4)

Add a bounded-retry wrapper around `write-file-atomic` at the three
known wrap sites (cli runtime-file plus the two trust.ts calls) to
fix the Windows EPERM/EBUSY flake where `MoveFileEx` loses to AV
scans or concurrent renames. Budget is 8 attempts with exponential
backoff capped at 500 ms; only EPERM/EBUSY/EACCES are retried, all
other errors propagate immediately. A new `console.debug` line fires
on every retry so production flakes are observable.

Closes #269.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@LangSensei LangSensei force-pushed the chore/ci-windows-hygiene branch from 000db2e to f212424 Compare June 9, 2026 14:46
@LangSensei

Copy link
Copy Markdown
Owner Author

Amended in f212424 (force-with-lease, single commit preserved).

Addresses the observability-gap minor: writeFileAtomicWithRetry now emits a [runtime-file] giving-up / [copilot/trust] giving-up debug log on the final failing retryable attempt before re-throwing, gated by if (retryable) so non-retryable codes (ENOSPC, ENOENT, …) still propagate silently. Both helper copies kept textually similar — same control flow, same message shape (path, attempts, code), only the tag differs.

Tests in packages/cli/test/runtime-file-retry.test.ts and packages/runtime/test/copilot/trust-retry.test.ts flipped to pin the new behaviour: budget-exhaustion test asserts debugSpy was called exactly once with giving-up, attempts=8, code=EPERM. The non-retryable (ENOSPC) and happy-path tests still pin silent behaviour — unchanged.

Verification: pnpm build, pnpm typecheck, pnpm lint, pnpm test all green locally (Windows).

@LangSensei

Copy link
Copy Markdown
Owner Author

Re-review — observability amend (focused) — APPROVE

Amended head: f212424a1e42822d0de958966754762fe0b7ea64
Original-review head: 000db2eeb86e71c9e607cd7bc5313b865e6c0cd5
Scope: focused verification of the observability give-up log only (the wide 8-task review at #350 (comment) stays the source of truth for the rest).

Note: GitHub blocks APPROVE on self-authored PRs, so this verdict is delivered as a comment rather than a formal review event. Treat this as APPROVE-equivalent.

Per-task verdict

# Task Result
1 Canonical issue-reference scan (#NNN, see issue, per the brief, per the spec, TODO(...#) on + lines of the PR diff PASS — zero hits
2 Give-up log gated on retryable, fires once at the give-up moment, per-package tag matches the existing retried-write line in the same file, includes attempts= and code= PASS — both packages/cli/src/runtime-file.ts and packages/runtime/src/copilot/trust.ts
3 Drift between the two helpers PASSdiff of the two writeFileAtomicWithRetry bodies shows exactly two changed lines, both the tag string ([runtime-file][copilot/trust]) on the retried-write log and the giving-up log; structure / formula / fields identical
4 Retry contract unchanged vs 000db2ee PASSMAX_ATTEMPTS = 8, retryable set {EPERM, EBUSY, EACCES}, backoff Math.min(2 ** attempt + Math.random() * 50, 500) all untouched; the amend-only delta in both src files is the gated log block
5 Test assertions — budget-exhaustion test asserts debugSpy called once with substring matches on giving-up / attempts=8 / code=EPERM; success-after-retry tests still assert retried-write; ENOSPC test still asserts silent behaviour PASS — both test files; ENOSPC regression net intact (expect(debugSpy).not.toHaveBeenCalled() preserved)
6 Scoped smoke (pnpm install --frozen-lockfile, pnpm build, pnpm -F @emploke/cli test, pnpm -F @emploke/runtime test) PASS — install + build clean; @emploke/cli 236/236 (16 files), @emploke/runtime 170/170 (14 files); test count delta is 0 (only assertions flipped on the existing budget-exhaustion it() blocks)
7 Diff sanity (git diff 000db2ee..f212424a --stat) PASS — exactly 4 files (2 src + 2 test, +24/−6), single commit on branch (git rev-list --count origin/main..f212424a → 1), no scope creep, no formatter sweep

Observability gap

Closed. A persistent retryable failure now leaves exactly one breadcrumb ([runtime-file] giving-up path=… attempts=8 code=EPERM / [copilot/trust] giving-up …) at the give-up moment, before the original error re-throws. Non-retryable codes (ENOSPC, etc.) still propagate silently — verified both by the source gating (if (retryable) inside the give-up branch) and by the unchanged ENOSPC test assertions.

LGTM.

@LangSensei LangSensei merged commit d1b3786 into main Jun 9, 2026
5 checks passed
@LangSensei LangSensei deleted the chore/ci-windows-hygiene branch June 9, 2026 15:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

write-file-atomic: bounded retry on Windows EPERM/EBUSY (fix concurrent-write flakes)

1 participant