Fix FileLog crash on delete+recreate reusing the same inode by groeneai · Pull Request #107617 · ClickHouse/ClickHouse · GitHub
Skip to content

Fix FileLog crash on delete+recreate reusing the same inode#107617

Merged
antaljanosbenjamin merged 4 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-filelog-delete-recreate-same-inode
Jun 25, 2026
Merged

Fix FileLog crash on delete+recreate reusing the same inode#107617
antaljanosbenjamin merged 4 commits into
ClickHouse:masterfrom
groeneai:groeneai/fix-filelog-delete-recreate-same-inode

Conversation

@groeneai

@groeneai groeneai commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Related: #101408

Changelog category (leave one):

  • Bug Fix (user-visible misbehavior in an official stable release)

Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):

Fixed a FileLog server crash (LOGICAL_ERROR: Last stored last_written_position in meta file ... is bigger than current last_written_pos) that could happen when a watched file was deleted and recreated reusing the same inode.

Description

Symptom

StorageFileLog could abort the server with a LOGICAL_ERROR:

Last stored last_written_position in meta file a.csv is bigger than current last_written_pos (N > 0)

Crash path: StorageFileLog::checkOffsetIsValid -> serialize -> openFilesAndSetPos.

Root cause

When a watched file is deleted and recreated under the same name within one
directory-watcher backoff window, the DW_ITEM_REMOVED and DW_ITEM_ADDED
events are drained together. updateFileInfos() processes DW_ITEM_ADDED
first: it flips the file status back to OPEN (so the later REMOVED-cleanup
loop skips it) and resets the in-memory FileMeta for the inode to offset 0.

onFileAppeared() only removes the stale on-disk meta when the inode
changes. If the recreated file reuses the same inode (deterministic on
filesystems that recycle freed inodes, e.g. ext4), that guard does not fire,
so the on-disk meta with the old, larger offset is left behind while the
in-memory offset is 0. The next serialize() reads the larger stored offset
and trips checkOffsetIsValid.

PR #101408 covered the inode-change case (a fresh inode). This PR covers
inode reuse, which is what surfaced under stress in CI.

Fix

A freshly added file is read from offset 0, so any on-disk meta under its
name is stale. Remove it in the DW_ITEM_ADDED branch so the on-disk and
in-memory metadata stay consistent. The DW_ITEM_MOVED_TO (rename) branch is
intentionally left unchanged: it preserves the consumed offset, and a
moved-out name's on-disk meta is already removed by the REMOVED-cleanup loop
before the new inode is observed.

Validation

New regression test 04342_file_log_delete_recreate_same_inode reproduces
the crash on an unfixed server (it aborts on the first delete+recreate round)
and passes after the fix. Existing FileLog regression tests
(04242_file_log_delete_recreate, 04257_file_log_rename_and_recreate) and
the FileLog streaming suite continue to pass, and after delete+recreate the
new content is still streamed correctly (no rows skipped).

No related open issue was found.

Version info

  • Merged into: 26.7.1.58

StorageFileLog could abort with a LOGICAL_ERROR when a watched file was
deleted and recreated reusing the same inode within one directory-watcher
backoff window:

  Last stored last_written_position in meta file a.csv is bigger than
  current last_written_pos (N > 0)

When the DW_ITEM_REMOVED and DW_ITEM_ADDED events for the same name are
drained together, updateFileInfos() processes DW_ITEM_ADDED first: it flips
the file status back to OPEN (so the later REMOVED-cleanup loop skips it) and
resets the in-memory FileMeta for the inode to offset 0. onFileAppeared()
only cleans up the on-disk meta when the inode CHANGES, so on inode REUSE
the stale on-disk meta (with the old, larger offset) is left behind. The
next serialize() then reads that larger stored offset and trips
checkOffsetIsValid().

PR ClickHouse#101408 covered the inode-change case (a fresh inode). This covers inode
reuse, which is deterministic on filesystems that recycle freed inodes
(e.g. ext4) and is what surfaced under stress in CI. A freshly added file is
read from offset 0, so any on-disk meta under its name is stale; remove it in
the DW_ITEM_ADDED branch so the on-disk and in-memory metadata stay
consistent. The DW_ITEM_MOVED_TO (rename) branch is intentionally left
unchanged: it preserves the consumed offset, and the moved-out name's on-disk
meta is already removed by the REMOVED-cleanup loop before the new inode is
seen.

Regression test 04342_file_log_delete_recreate_same_inode reproduces the
crash on an unfixed server (delete + immediate recreate, high watch backoff)
and passes after the fix.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

@groeneai

Copy link
Copy Markdown
Contributor Author

cc @antaljanosbenjamin @kssenii — could you review this? It fixes a FileLog server crash (LOGICAL_ERROR: ... last_written_position ... bigger than current (N > 0)) on delete+recreate that reuses the same inode within one watcher backoff window: the DW_ITEM_ADDED branch resets the in-memory offset to 0 but left the stale on-disk meta behind (onFileAppeared only cleans up on inode change, not inode reuse). Extends PR #101408 to the inode-reuse case; regression test included.

@PedroTadim PedroTadim added the can be tested Allows running workflows for external contributors label Jun 16, 2026
@clickhouse-gh

clickhouse-gh Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Workflow [PR], commit [963e74b]

Summary:

job_name test_name status info comment
BuzzHouse (amd_tsan) ERROR IGNORED

AI Review

Summary
  • The PR fixes FileLog stale metadata when a watched file is deleted and recreated with the same inode: the DW_ITEM_ADDED path now drops the stale on-disk meta before resetting the in-memory FileMeta to offset 0. The current regression test covers both required conditions, same-inode reuse and same updateFileInfos drain, and also verifies that recreated rows are streamed. I did not find unresolved correctness issues in the current diff.
Final Verdict
  • Status: ✅ Approve

@clickhouse-gh clickhouse-gh Bot added the pr-bugfix Pull request with bugfix, not backported by default label Jun 16, 2026
Comment thread tests/queries/0_stateless/04342_file_log_delete_recreate_same_inode.sh Outdated
The test relied on ext4 reusing the just-freed inode, so on tmpfs/xfs an
unfixed build took the already-covered inode-change path and could pass. Force
the same-inode recreate deterministically with a hard link held outside the
watched directory: pin the inode across the unlink, rewrite content through the
held link, then link it back under the original name. This produces a
DW_ITEM_REMOVED + DW_ITEM_ADDED batch for the watched name with an unchanged
inode regardless of filesystem allocation policy. Also assert the recreated
rows stream, proving FileLog both avoids the LOGICAL_ERROR and re-reads from
offset 0.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Done in b07494e. The test no longer depends on filesystem inode-allocation policy.

It now forces the same-inode recreate deterministically with a hard link held outside the watched directory: pin the inode across the unlink, rewrite the content through the held link, then link the same inode back as a.csv. That yields a DW_ITEM_REMOVED + DW_ITEM_ADDED batch for the watched name with an unchanged inode, independent of whether the filesystem would have reused the freed inode. It also waits for the recreated rows and asserts the full stream 1,2,3,100,200, proving FileLog both avoids the exception and re-reads from offset 0.

Validated on xfs (which never reuses inodes, so the old test could not have reproduced here):

  • Unfixed binary (build id 106eddda): aborts with Last stored last_written_position in meta file a.csv is bigger than current last_written_pos (6 > 0), stack checkOffsetIsValid -> serialize -> openFilesAndSetPos -> streamToViews. The server log confirms onItemRemoved a.csv + onItemAdded a.csv in the same watcher drain (same inode).
  • Fixed binary (build id 2d56c397): survives and streams 1,2,3,100,200; no LOGICAL_ERROR in the log, and the REMOVED+ADDED batch is still processed (path exercised, not avoided).

The FileLog engine is Linux-only (USE_FILELOG is gated on OS_LINUX in
src/configure_config.cmake and the directory watcher uses inotify), so it
is not compiled into the Darwin / fast-test build. Without no-fasttest the
test runs there and CREATE TABLE ... ENGINE = FileLog fails with
UNKNOWN_STORAGE, breaking Fast test (arm_darwin). Matches the existing
no-fasttest FileLog tests 04031 and 04202.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The same-inode recreate was forced via a held hard link, but the
DW_ITEM_REMOVED + DW_ITEM_ADDED pair was not guaranteed to reach the
storage in one updateFileInfos drain: watchFunc wakes the storage task
right after reading the delete event, so on some interleavings an
unfixed server could process REMOVED alone (the cleanup loop erases the
stale on-disk meta) and then see ADDED as a clean new file, giving a
false pass.

getEventsAndReset is only called while a dependent view is attached, so
detach the materialized view across the delete/relink and re-attach
afterwards. The watcher keeps accumulating REMOVED + ADDED while the
consumer is detached and both are drained together on re-attach,
independent of watchFunc poll timing. Backoff is lowered to 500ms now
that batching no longer relies on a wide backoff window.

Validated both directions on xfs: unfixed binary (build id 106eddda)
aborts with 'last_written_position ... (6 > 0)' in
checkOffsetIsValid <- serialize <- openFilesAndSetPos <- streamToViews,
with onItemRemoved + onItemAdded for a.csv in one drain; fixed binary
(build id 2d56c397) survives and streams 1,2,3,100,200.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@groeneai

Copy link
Copy Markdown
Contributor Author

Done in 963e74b. The REMOVED + ADDED batching is now deterministic too, independent of watchFunc poll timing.

getEventsAndReset is only called from updateFileInfos, which only runs while a dependent view is attached. So the test now detaches file_log_mv before the delete/relink and re-attaches it afterwards: with the consumer detached the watcher keeps accumulating DW_ITEM_REMOVED + DW_ITEM_ADDED, and they are drained together by the first streaming cycle after re-attach. No streaming cycle can run between the two events regardless of how the scheduler interleaves them. Backoff is lowered to 500ms since batching no longer depends on a wide window.

Re-validated both directions on xfs (which never reuses inodes, so the hard-link trick is doing the inode-reuse work):

  • Unfixed (build id 106eddda): server aborts on re-attach with Last stored last_written_position in meta file a.csv is bigger than current last_written_pos (6 > 0), stack checkOffsetIsValid <- serialize <- openFilesAndSetPos <- streamToViews <- threadFunc. Log shows onItemRemoved a.csv + onItemAdded a.csv in one drain (3 us apart).
  • Fixed (build id 2d56c397): survives and streams 1,2,3,100,200.

@clickhouse-gh

clickhouse-gh Bot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

LLVM Coverage Report

Metric Baseline Current Δ
Lines 85.20% 85.30% +0.10%
Functions 92.30% 92.30% +0.00%
Branches 77.50% 77.50% +0.00%

Changed lines: Changed C/C++ lines covered by tests: 8/8 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code

Full report · Diff report

@groeneai

Copy link
Copy Markdown
Contributor Author

CI finished on 963e74b7c5ef. One job-level failure, not PR-caused:

  • BuzzHouse (amd_tsan): job-level ERROR, FileNotFoundError: .../status.tsv at ci/jobs/ast_fuzzer_job.py:240 (the fuzzer runner aborted before writing status.tsv). No STID, no crash stack, no ClickHouse frame; the other three BuzzHouse builds (amd_debug, amd_msan, arm_asan_ubsan) passed on this same commit. This is the chronic fuzzer-runner harness flake, addressed separately in Report fuzzer runner aborts instead of an opaque status.tsv FileNotFoundError #107676. Unrelated to this StorageFileLog change.
  • CH Inc sync: CIDB log-export sidecar, infrastructure, orthogonal to this PR.
  • Mergeable Check / PR: aggregators, red only because of the BuzzHouse job above.

All PR-own checks pass: Fast test, Fast test (arm_darwin), every Build, Integration, and the new 04342 regression test path. AI Review verdict on this commit is Approve.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@groeneai why not to remove it when the file is removed?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing it on DW_ITEM_REMOVED does not fire in the interleaving that triggers this crash.

Delete+recreate under the same name within one backoff window puts DW_ITEM_REMOVED and DW_ITEM_ADDED in the same getEventsAndReset() batch, walked in order:

  1. DW_ITEM_REMOVED sets context_by_name[name].status = REMOVED.
  2. DW_ITEM_ADDED (same reused inode) calls onFileAppeared(). The inode is unchanged, so its it->second.inode != inode guard is false (no meta cleanup), and it then resets the context to {OPEN, inode}, clearing REMOVED.

So when the REMOVED-cleanup loop runs, status is OPEN; its if (status == REMOVED) branch (with the removeFileIfExists) is skipped, the stale on-disk meta survives, and the ADDED branch has meanwhile reset the in-memory meta to pos 0. The next serialize() sees on-disk 122 vs in-memory 0 and raises the LOGICAL_ERROR.

DW_ITEM_ADDED is where the invariant actually breaks (the file is reassigned offset 0), so it is the right place to drop the now-stale meta. It mirrors the cleanup onFileAppeared already does for the inode-change case.

Eagerly deleting the meta in the REMOVED handler instead would break renames: DW_ITEM_MOVED_FROM also sets status = REMOVED, and there the meta must survive so DW_ITEM_MOVED_TO can rename it and preserve the offset.

@antaljanosbenjamin antaljanosbenjamin Jun 17, 2026

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the file shows up with same inode, then onFileAppeared won't clean up the file metas. Shouldn't they cleaned up here too?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

onFileAppeared's cleanup is specific to the inode-CHANGE case: it erases the meta of the name's previous inode (it->second.inode). That has to live in onFileAppeared, because line 931 overwrites it->second with the new inode right after, so this is the last point the orphaned old-inode entry is still identifiable.

The same-inode case is a different operation. There is no orphaned old inode to erase: the same meta_by_inode[inode] entry is reused and must be reset to offset 0 with its on-disk meta dropped. That reset-to-0 is correct only for DW_ITEM_ADDED (a genuinely new file). onFileAppeared is also called from DW_ITEM_MOVED_TO (line 1037), where the offset must be preserved: lines 1040-1046 rename the meta and keep last_writen_position. Cleaning the meta on same-inode inside onFileAppeared would wipe that; a rename onto an existing same-inode name would then fall through to the else at line 1048 and recreate the meta at offset 0, losing the moved file's position.

So the same-inode drop stays in the DW_ITEM_ADDED branch, next to the pos-0 reset (lines 1001-1004) it must stay consistent with. onFileAppeared keeps only the inode-change erase, which depends on reading the old inode before it is overwritten.

@antaljanosbenjamin antaljanosbenjamin self-assigned this Jun 24, 2026
@antaljanosbenjamin antaljanosbenjamin added this pull request to the merge queue Jun 25, 2026
Merged via the queue into ClickHouse:master with commit 0d9996b Jun 25, 2026
163 of 165 checks passed
@robot-ch-test-poll4 robot-ch-test-poll4 added the pr-synced-to-cloud The PR is synced to the cloud repo label Jun 25, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

can be tested Allows running workflows for external contributors pr-bugfix Pull request with bugfix, not backported by default pr-synced-to-cloud The PR is synced to the cloud repo

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants