Fix FileLog crash on delete+recreate reusing the same inode#107617
Conversation
StorageFileLog could abort with a LOGICAL_ERROR when a watched file was deleted and recreated reusing the same inode within one directory-watcher backoff window: Last stored last_written_position in meta file a.csv is bigger than current last_written_pos (N > 0) When the DW_ITEM_REMOVED and DW_ITEM_ADDED events for the same name are drained together, updateFileInfos() processes DW_ITEM_ADDED first: it flips the file status back to OPEN (so the later REMOVED-cleanup loop skips it) and resets the in-memory FileMeta for the inode to offset 0. onFileAppeared() only cleans up the on-disk meta when the inode CHANGES, so on inode REUSE the stale on-disk meta (with the old, larger offset) is left behind. The next serialize() then reads that larger stored offset and trips checkOffsetIsValid(). PR ClickHouse#101408 covered the inode-change case (a fresh inode). This covers inode reuse, which is deterministic on filesystems that recycle freed inodes (e.g. ext4) and is what surfaced under stress in CI. A freshly added file is read from offset 0, so any on-disk meta under its name is stale; remove it in the DW_ITEM_ADDED branch so the on-disk and in-memory metadata stay consistent. The DW_ITEM_MOVED_TO (rename) branch is intentionally left unchanged: it preserves the consumed offset, and the moved-out name's on-disk meta is already removed by the REMOVED-cleanup loop before the new inode is seen. Regression test 04342_file_log_delete_recreate_same_inode reproduces the crash on an unfixed server (delete + immediate recreate, high watch backoff) and passes after the fix. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
cc @antaljanosbenjamin @kssenii — could you review this? It fixes a FileLog server crash ( |
|
Workflow [PR], commit [963e74b] Summary: ❌
AI ReviewSummary
Final Verdict
|
The test relied on ext4 reusing the just-freed inode, so on tmpfs/xfs an unfixed build took the already-covered inode-change path and could pass. Force the same-inode recreate deterministically with a hard link held outside the watched directory: pin the inode across the unlink, rewrite content through the held link, then link it back under the original name. This produces a DW_ITEM_REMOVED + DW_ITEM_ADDED batch for the watched name with an unchanged inode regardless of filesystem allocation policy. Also assert the recreated rows stream, proving FileLog both avoids the LOGICAL_ERROR and re-reads from offset 0. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Done in b07494e. The test no longer depends on filesystem inode-allocation policy. It now forces the same-inode recreate deterministically with a hard link held outside the watched directory: pin the inode across the unlink, rewrite the content through the held link, then link the same inode back as Validated on xfs (which never reuses inodes, so the old test could not have reproduced here):
|
The FileLog engine is Linux-only (USE_FILELOG is gated on OS_LINUX in src/configure_config.cmake and the directory watcher uses inotify), so it is not compiled into the Darwin / fast-test build. Without no-fasttest the test runs there and CREATE TABLE ... ENGINE = FileLog fails with UNKNOWN_STORAGE, breaking Fast test (arm_darwin). Matches the existing no-fasttest FileLog tests 04031 and 04202. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The same-inode recreate was forced via a held hard link, but the DW_ITEM_REMOVED + DW_ITEM_ADDED pair was not guaranteed to reach the storage in one updateFileInfos drain: watchFunc wakes the storage task right after reading the delete event, so on some interleavings an unfixed server could process REMOVED alone (the cleanup loop erases the stale on-disk meta) and then see ADDED as a clean new file, giving a false pass. getEventsAndReset is only called while a dependent view is attached, so detach the materialized view across the delete/relink and re-attach afterwards. The watcher keeps accumulating REMOVED + ADDED while the consumer is detached and both are drained together on re-attach, independent of watchFunc poll timing. Backoff is lowered to 500ms now that batching no longer relies on a wide backoff window. Validated both directions on xfs: unfixed binary (build id 106eddda) aborts with 'last_written_position ... (6 > 0)' in checkOffsetIsValid <- serialize <- openFilesAndSetPos <- streamToViews, with onItemRemoved + onItemAdded for a.csv in one drain; fixed binary (build id 2d56c397) survives and streams 1,2,3,100,200. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
Done in 963e74b. The REMOVED + ADDED batching is now deterministic too, independent of
Re-validated both directions on xfs (which never reuses inodes, so the hard-link trick is doing the inode-reuse work):
|
LLVM Coverage Report
Changed lines: Changed C/C++ lines covered by tests: 8/8 (100.00%) | Lost baseline coverage (was covered on master, now uncovered in this PR): 3 line(s) · Uncovered code |
|
CI finished on
All PR-own checks pass: Fast test, Fast test (arm_darwin), every Build, Integration, and the new |
There was a problem hiding this comment.
@groeneai why not to remove it when the file is removed?
There was a problem hiding this comment.
Removing it on DW_ITEM_REMOVED does not fire in the interleaving that triggers this crash.
Delete+recreate under the same name within one backoff window puts DW_ITEM_REMOVED and DW_ITEM_ADDED in the same getEventsAndReset() batch, walked in order:
- DW_ITEM_REMOVED sets
context_by_name[name].status = REMOVED. - DW_ITEM_ADDED (same reused inode) calls
onFileAppeared(). The inode is unchanged, so itsit->second.inode != inodeguard is false (no meta cleanup), and it then resets the context to{OPEN, inode}, clearing REMOVED.
So when the REMOVED-cleanup loop runs, status is OPEN; its if (status == REMOVED) branch (with the removeFileIfExists) is skipped, the stale on-disk meta survives, and the ADDED branch has meanwhile reset the in-memory meta to pos 0. The next serialize() sees on-disk 122 vs in-memory 0 and raises the LOGICAL_ERROR.
DW_ITEM_ADDED is where the invariant actually breaks (the file is reassigned offset 0), so it is the right place to drop the now-stale meta. It mirrors the cleanup onFileAppeared already does for the inode-change case.
Eagerly deleting the meta in the REMOVED handler instead would break renames: DW_ITEM_MOVED_FROM also sets status = REMOVED, and there the meta must survive so DW_ITEM_MOVED_TO can rename it and preserve the offset.
There was a problem hiding this comment.
If the file shows up with same inode, then onFileAppeared won't clean up the file metas. Shouldn't they cleaned up here too?
There was a problem hiding this comment.
onFileAppeared's cleanup is specific to the inode-CHANGE case: it erases the meta of the name's previous inode (it->second.inode). That has to live in onFileAppeared, because line 931 overwrites it->second with the new inode right after, so this is the last point the orphaned old-inode entry is still identifiable.
The same-inode case is a different operation. There is no orphaned old inode to erase: the same meta_by_inode[inode] entry is reused and must be reset to offset 0 with its on-disk meta dropped. That reset-to-0 is correct only for DW_ITEM_ADDED (a genuinely new file). onFileAppeared is also called from DW_ITEM_MOVED_TO (line 1037), where the offset must be preserved: lines 1040-1046 rename the meta and keep last_writen_position. Cleaning the meta on same-inode inside onFileAppeared would wipe that; a rename onto an existing same-inode name would then fall through to the else at line 1048 and recreate the meta at offset 0, losing the moved file's position.
So the same-inode drop stays in the DW_ITEM_ADDED branch, next to the pos-0 reset (lines 1001-1004) it must stay consistent with. onFileAppeared keeps only the inode-change erase, which depends on reading the old inode before it is overwritten.
0d9996b

Related: #101408
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes into CHANGELOG.md):
Fixed a
FileLogserver crash (LOGICAL_ERROR: Last stored last_written_position in meta file ... is bigger than current last_written_pos) that could happen when a watched file was deleted and recreated reusing the same inode.Description
Symptom
StorageFileLogcould abort the server with aLOGICAL_ERROR:Crash path:
StorageFileLog::checkOffsetIsValid->serialize->openFilesAndSetPos.Root cause
When a watched file is deleted and recreated under the same name within one
directory-watcher backoff window, the
DW_ITEM_REMOVEDandDW_ITEM_ADDEDevents are drained together.
updateFileInfos()processesDW_ITEM_ADDEDfirst: it flips the file status back to
OPEN(so the later REMOVED-cleanuploop skips it) and resets the in-memory
FileMetafor the inode to offset 0.onFileAppeared()only removes the stale on-disk meta when the inodechanges. If the recreated file reuses the same inode (deterministic on
filesystems that recycle freed inodes, e.g. ext4), that guard does not fire,
so the on-disk meta with the old, larger offset is left behind while the
in-memory offset is 0. The next
serialize()reads the larger stored offsetand trips
checkOffsetIsValid.PR #101408 covered the inode-change case (a fresh inode). This PR covers
inode reuse, which is what surfaced under stress in CI.
Fix
A freshly added file is read from offset 0, so any on-disk meta under its
name is stale. Remove it in the
DW_ITEM_ADDEDbranch so the on-disk andin-memory metadata stay consistent. The
DW_ITEM_MOVED_TO(rename) branch isintentionally left unchanged: it preserves the consumed offset, and a
moved-out name's on-disk meta is already removed by the REMOVED-cleanup loop
before the new inode is observed.
Validation
New regression test
04342_file_log_delete_recreate_same_inodereproducesthe crash on an unfixed server (it aborts on the first delete+recreate round)
and passes after the fix. Existing FileLog regression tests
(
04242_file_log_delete_recreate,04257_file_log_rename_and_recreate) andthe FileLog streaming suite continue to pass, and after delete+recreate the
new content is still streamed correctly (no rows skipped).
No related open issue was found.
Version info
26.7.1.58