{{ message }}
update-index: add --refresh-stat-only#2125
Open
giorgidze wants to merge 2 commits into
Open
Conversation
When refresh_index() is invoked with REFRESH_REALLY (e.g. via "git update-index --really-refresh"), the documented behaviour is that the "assume unchanged" bit on cache entries is disregarded so that stale stat data on those entries is still refreshed. The preload pass runs before the single-threaded refresh loop and is intended to mark up-to-date entries quickly so the slow path only has to deal with the leftovers. However, preload_thread() unconditionally called ie_match_stat() with CE_MATCH_RACY_IS_DIRTY|CE_MATCH_IGNORE_FSMONITOR and never with CE_MATCH_IGNORE_VALID, so it honoured the "assume unchanged" bit. When a modified file's entry was marked assume-unchanged, preload would conclude the entry was clean and call ce_mark_uptodate(); the subsequent --really-refresh loop would then skip the entry (because ce_uptodate(ce) is true) and never report it as needing an update. This only manifests when preload is active, so it has been latent in default configurations. It is observable today via GIT_TEST_PRELOAD_INDEX=1. Plumb the refresh flags through to the preload threads via a new refresh_flags field on struct thread_data, and have preload_thread() add CE_MATCH_IGNORE_VALID to its match options when REFRESH_REALLY is in effect. Update refresh_index() to pass "flags & REFRESH_REALLY" to preload_index() instead of a bare 0. Add a regression test under t2106 that forces preload on and confirms that "update-index --really-refresh" reports a modified assume-unchanged entry as needing update. Signed-off-by: George Giorgidze <giorgidze@meta.com>
When a working tree is copied from another machine, or restored from
a tarball, container image, or CI cache on the same machine, the
files may be byte-for-byte identical while cached stat data in the
index no longer matches. Backup and sync tools can preserve mtimes,
but fields like inode and device numbers are filesystem-local, so
large repositories can still end up paying for expensive refresh
checks on every "git status".
Git already has runtime configuration for reducing which stat fields
are checked, such as core.checkStat=<minimal|default>. That affects
how future checks interpret cached stat data, but it does not provide
a one-shot way to update the index's cached stat data to match the
current filesystem without also rehashing file contents. Setting
core.checkStat=minimal is "sticky": it weakens every subsequent
operation in the repository for the duration of the configuration,
rather than performing a single, bounded correction at a well-defined
point.
A similar idea was discussed on the list in January 2017 under the
name "--assume-content-unchanged"; see the thread starting at
<20170105112359.GN8116@chrystal.oracle.com>. The concern raised there
was that exposing a way to update cached stat data without content
comparison opens the index to abuse: an interactive user could skip a
slow refresh, lie to Git about the worktree, then file a bug after a
later merge corrupts a file. That concern is taken seriously here,
and this proposal is deliberately narrower than the 2017 one:
* It is a one-shot action, not a sticky configuration or per-entry
bit. The name --refresh-stat-only reflects that: it describes
what the command does in a single invocation, not a trust state
attached to entries (contrast with --assume-unchanged).
* The trust assertion is intended for closed-loop callers (CI cache
restore, container provisioning, backup/restore tooling) where
the worktree and the index were produced or verified together by
the same process. It is not a knob for interactive users to reach
for when "git status" feels slow.
* The failure mode is named directly in the documentation: if the
worktree does not in fact match the index, affected entries will
appear clean while the recorded object ID remains stale. The user
must type the flag, having read the warning. This is a narrower
contract than core.checkStat=minimal, which silently affects
every subsequent operation.
Container-based CI has become the dominant deployment model in the
years since that 2017 discussion. The current workaround -- setting
core.checkStat=minimal in every job step, or accepting the cost of
full content rehashing -- is operationally fragile: it requires every
step in every pipeline to set and preserve the configuration, and it
permanently weakens stat semantics for every command those steps
run. A single explicit invocation at restore time is a tighter, more
local fix.
Teach git update-index --refresh-stat-only to refresh only cached
stat information. It follows the existing refresh machinery, but
skips ie_modified() and treats racy entries as dirty by stat instead
of resolving them by content. Like --really-refresh, it ignores the
"assume unchanged" setting, so stale stat data on those entries is
still updated; that behaviour is documented alongside the flag.
The preload pass is extended to recognise REFRESH_STAT_ONLY (on top
of REFRESH_REALLY, which was wired up in the preceding commit) so
that assume-unchanged entries are not marked uptodate before the main
refresh path can update them.
Add tests covering object ID preservation, missing-file handling with
and without --ignore-missing, assume-unchanged override, and quiet
output.
Signed-off-by: George Giorgidze <giorgidze@meta.com>
|
/allow |
Author
|
/preview |
|
Error: User giorgidze is not yet permitted to use GitGitGadget |
Member
|
/allow |
|
User giorgidze is now allowed to use GitGitGadget. WARNING: giorgidze has no public email address set on GitHub; GitGitGadget needs an email address to Cc: you on your contribution, so that you receive any feedback on the Git mailing list. Go to https://github.com/settings/profile to make your preferred email public to let GitGitGadget know which email address to use. |
Author
|
/preview |
|
Preview email sent as pull.2125.git.1779782272.gitgitgadget@gmail.com |
Author
|
/submit |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

This two-patch series adds "git update-index --refresh-stat-only", a
one-shot way to update the index's cached stat data to match the
current filesystem without rehashing file contents.
When a working tree is produced or restored by means other than a
normal checkout -- a CI cache restore, container provisioning, a
tarball extraction, or a copy from another machine -- the files may
be byte-for-byte identical while filesystem-local stat fields like
inode and device numbers no longer match. Today the available
workarounds are to (a) pay for full content rehashing on the next
"git status", or (b) set core.checkStat=minimal, which is sticky
and weakens every subsequent operation. Neither composes well with
modern container-based CI, where every job step would otherwise
need to set and preserve the configuration.
A similar idea ("--assume-content-unchanged") was discussed in
January 2017; see the thread starting at
20170105112359.GN8116@chrystal.oracle.com. The concern raised
there was that exposing a way to update cached stat data without
content comparison opens the index to abuse. The flag in this
series is deliberately narrower than the 2017 proposal:
attached to entries (contrast --assume-unchanged);
provisioning, backup/restore tooling) that produced or verified
the worktree atomically;
the next content check -- is named directly in the docs, and
the flag must be typed explicitly.
The series is organised so the bug fix is reviewable on its own:
1/2 preload-index: respect --really-refresh override of
assume-unchanged
2/2 update-index: add --refresh-stat-only
CC: Junio C Hamano gitster@pobox.com