clone 569505 -1
retitle -1 teach 'git gc' to run 'git fsck --no-full'
severity -1 wishlist
notforwarded -1
tags 569505 + fixed-upstream
thanks

Hi again,

It seems like a good time to revisit this bug.

Zygo Blaxell wrote:

> 'git add' will happily corrupt a git repo if it is run while files in
> the working directory are being modified.  A blob is added to the index
> with contents that do not match its SHA1 hash.

Nico Pitre’s 748af44c (sha1_file: be paranoid when creating loose
objects, 2010-02-21) was merged to master and will probably be
included in 1.7.1.  Its strategy:

        To do so we compute the SHA1 of the data being deflated
        _after_ the deflate operation has consumed that data, and
        make sure it matches with the expected SHA1.  This way we
        can rely on the CRC32 checked by the inflate operation to
        provide a good indication that the data is still coherent
        with its SHA1 hash.  One pathological case we ignore is
        when the data is modified before (or during) deflate
        call, but changed back before it is hashed.

Sounds reasonable to me.  Maybe it would be possible to try it out?
I can make a .deb from master for this if you would like.

> If the index is then
> committed, the corrupt blob cannot be checked out (or is checked out
> with incorrect contents, depending on which tool you use to try to get
> the file out of git) in the future.

Do you remember more details?  Some commands might be worth changing
to be more permissive (to more easily recover from corruption) or
strict (to more easily catch it).

> Surprisingly, it's possible to clone, fetch, push, pull, and sometimes
> even gc the corrupted repo several times before anyone notices the
> corruption.

For push, I do not think it is worth the slowdown; better to spend the
time needed to be robust in hash-object.

On the other hand, it does not seem overly burdensome to me to make
‘git gc’ imply a ‘git fsck --no-full’.  That wouldn’t be needed to
catch on-disk corruption (the CRC32 already does that), but it could
be a good idea anyway.  Cloning the bug.

> If the affected commit is included in a merge with history
> from other git users, the only way to fix it is to rebase

I assume you used rebase -f?  Clever.

If it does not fail,

 git filter-branch --tree-filter :

should accomplish the same thing without making the history linear,
though it does not scale well to deep histories.  It should be
possible to devise an appropriate index-filter for this, too, if that
is too slow.

> (or come up
> with a blob whose contents match the affected SHA1 hash somehow).

Yes, depending on how the file was changing this could be easy or hard
to do.  /usr/share/doc/git-doc/howto/recover-corrupted-blob-object.txt
is about something like this.

> I discovered this bug accidentally when I was using inotifywait (from
> the inotify-tools package) to automatically commit snapshots of a working
> directory triggered by write events.

Ah, so the commits happened precisely during the race window.  

“Something like git that can handle snapshots” tends to be something
people want every once in a while.  Of course, git has some problems
for these uses:

 - racy add, as you noticed;
 - checkout is not atomic or close to atomic;
 - large files are not supported well (but there is some work going on
   to change this);
 - uncompressible files are not supported well;
 - file metadata is not tracked;
 - directories with many entries are not supported well;
 - rename detection works poorly with binary files;
 - no quick way to throw away old history.

If the tree to be tracked is small (like /etc), one can try to work
around these things and get away with it for a while.

Still, when so many of git’s features are the opposite of useful for
the goal, it is a wonder it is so tempting to try anyway.  My guess is
it is the UI for browsing and manipulating history.

There are plenty of backup tools that are great for capturing
snapshots, but few of them make it easy to do much with the snapshots
once they’re captured.  The nicest I’ve experienced created a magic
.snapshots directory under the mount point for the file system whose
history was being tracked, with subdirectories for each time a
snapshot was taken.  Not nearly as pleasant to use as ‘git log’.

Something to think about.

> I tested this with a number of kernel versions from 2.6.27 to 2.6.31.
> All of them reproduced this problem.  I checked this because strace
> shows 'git add' doing a mmap(..., MAP_PRIVATE, ...) of its input file,

I think this is more of a “MAP_PRIVATE by default because that’s
easier on the kernel” kind of thing.

Cheers,
Jonathan



-- 
To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org
with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org

Reply via email to