fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race by renecannao · Pull Request #122 · ProxySQL/dbdeployer · GitHub
Skip to content

fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race#122

Open
renecannao wants to merge 1 commit into
masterfrom
fix/delete-sandbox-race-on-rm
Open

fix(sandbox): retry rm -rf during delete to handle kill -9 file-handle race#122
renecannao wants to merge 1 commit into
masterfrom
fix/delete-sandbox-race-on-rm

Conversation

@renecannao

@renecannao renecannao commented May 14, 2026

Copy link
Copy Markdown

Summary

Closes #121.

RemoveSandbox, RemoveCustomSandbox, and an earlier custom-delete path in sandbox/sandbox.go all do:

  1. Run the sandbox's stop / send_kill script (the destroy mode of send_kill does kill -9 $MYPID).
  2. Immediately rm -rf the sandbox directory.

kill -9 doesn't synchronously wait for the killed process to release its open file descriptors. During that brief window, the dying mysqld's last buffered I/O can land inside master/data/ after rm's readdir() walk has already passed — making the final rmdir(2) return ENOTEMPTY:

rm: cannot remove '<sandbox>/master/data': Directory not empty
error while deleting sandbox <sandbox>

This is an intermittent flake but persistent enough to bite:

Fix

Wrap the rm -rf in a small exponential-backoff retry loop (200ms + 400ms + 800ms + 1.6s + 3.2s ≈ 6.2s max wait). The race resolves in milliseconds in practice; retries handle it without changing the stop-script contract or templates across topologies.

A more thorough fix would be for send_kill to wait for the killed PIDs to disappear from /proc before returning, but that's a larger change touching templates for single/replication/multiple/tidb/etc. The localized retry in RemoveSandbox solves the user-visible symptom with one focused change.

Files

sandbox/sandbox.go — adds a removeDirWithRetry(target string) error helper, then routes all four synchronous rm -rf call-sites through it:

  • Early custom-delete path (around line 213) — both sandbox dir and log directory.
  • The synchronous branch of RemoveSandbox.
  • The synchronous branch of RemoveCustomSandbox.

The concurrent-mode branches that enqueue an ExecCommand into the parallel execution list are unchanged.

Test plan

  • go build ./... clean.
  • go test ./sandbox/... — the three failures (TestUseTemplate_MySQL, TestUseTemplate_MariaDB, TestReplicationStopAndUse_DelegateToSingleTemplates) are pre-existing on master HEAD without my change. They require real MySQL binaries in $SANDBOX_BINARY to pass and are auto-skipped by test/go-unit-tests.sh in CI. Nothing related to the rm/delete code I touched.
  • CI should run the full Integration Tests matrix; in particular MariaDB (10.11.9) is the most reliable repro of the race the new retry is supposed to handle.

Note

The retry is bounded (max ~6s) and the per-attempt delay is short, so this doesn't measurably slow down healthy deletes — the rm normally succeeds on the first attempt and we never hit a sleep.

Summary by CodeRabbit

  • Bug Fixes
    • Improved reliability of sandbox directory cleanup by adding retry logic with exponential backoff to handle transient deletion failures.

Review Change Stack

…e race

Closes #121.

`RemoveSandbox` (and `RemoveCustomSandbox`, plus an earlier custom-delete
path) runs the sandbox's stop / send_kill script and then immediately
`rm -rf`s the sandbox dir. The `destroy` mode of `send_kill` does
`kill -9 $MYPID`, which doesn't synchronously wait for the killed
process's open file descriptors to be released. During that brief
window, the dying mysqld's last buffered I/O can land inside
`master/data/` after `rm`'s readdir() walk has already passed —
making the final `rmdir(2)` return ENOTEMPTY:

    rm: cannot remove '<sandbox>/master/data': Directory not empty
    error while deleting sandbox <sandbox>

Recently observed on:
- master Integration Tests run 25839621095 (`MariaDB (10.11.9)`)
- PR #119 hit the same flake on 4 consecutive runs before clearing on
  the 5th retry
- PR #114 hit it during initial review

Fix: wrap the rm in a small exponential-backoff retry loop
(200ms + 400ms + 800ms + 1.6s + 3.2s ≈ 6.2s max wait). The race
resolves in milliseconds in practice; retries handle it without
changing the stop-script contract or template scripts across topologies.

A more thorough fix would be to have `send_kill` wait for the killed
PIDs to disappear from /proc before returning, but that's a larger
change touching template scripts across single, replication, multiple,
tidb, etc. The localized retry in `RemoveSandbox` solves the
user-visible symptom with one focused change.

Applies to all four `rm -rf` call-sites in sandbox/sandbox.go:
- The early custom-delete path (around line 213) used by sandboxDef
  recreation, both sandbox dir and log directory.
- The synchronous (non-concurrent) branch of `RemoveSandbox`.
- The synchronous branch of `RemoveCustomSandbox`.

The concurrent-mode branches (which enqueue an `ExecCommand` into the
parallel execution list) are unchanged — those run later and their
race window is already wider by virtue of scheduling.
@coderabbitai

coderabbitai Bot commented May 14, 2026

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a removeDirWithRetry function to handle race conditions during directory deletion by implementing an exponential backoff retry mechanism. This helper replaces direct calls to the external rm command in checkDirectory, RemoveSandbox, and RemoveCustomSandbox. Feedback suggests using the more idiomatic os.RemoveAll function to avoid process overhead and points out a calculation error in the docstring regarding the total maximum wait time.

Comment thread sandbox/sandbox.go
Comment on lines +1191 to +1206

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Using os.RemoveAll is more idiomatic and efficient in Go than spawning an external rm process, as it avoids the overhead of process creation and is more portable. Additionally, the loop logic can be simplified by removing the redundant else block. Note that the total sleep time with 5 attempts is actually 3 seconds (200+400+800+1600ms), which differs from the 6.2s mentioned in the docstring.

func removeDirWithRetry(target string) error {
	backoff := 200 * time.Millisecond
	var err error
	for attempt := 0; attempt < 5; attempt++ {
		if attempt > 0 {
			time.Sleep(backoff)
			backoff *= 2
		}
		if err = os.RemoveAll(target); err == nil {
			return nil
		}
	}
	return err
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

dbdeployer delete intermittently fails with 'Directory not empty' after kill -9

1 participant