clone 569505 -1 retitle -1 teach 'git gc' to run 'git fsck --no-full' severity -1 wishlist notforwarded -1 tags 569505 + fixed-upstream thanks
Hi again, It seems like a good time to revisit this bug. Zygo Blaxell wrote: > 'git add' will happily corrupt a git repo if it is run while files in > the working directory are being modified. A blob is added to the index > with contents that do not match its SHA1 hash. Nico Pitre’s 748af44c (sha1_file: be paranoid when creating loose objects, 2010-02-21) was merged to master and will probably be included in 1.7.1. Its strategy: To do so we compute the SHA1 of the data being deflated _after_ the deflate operation has consumed that data, and make sure it matches with the expected SHA1. This way we can rely on the CRC32 checked by the inflate operation to provide a good indication that the data is still coherent with its SHA1 hash. One pathological case we ignore is when the data is modified before (or during) deflate call, but changed back before it is hashed. Sounds reasonable to me. Maybe it would be possible to try it out? I can make a .deb from master for this if you would like. > If the index is then > committed, the corrupt blob cannot be checked out (or is checked out > with incorrect contents, depending on which tool you use to try to get > the file out of git) in the future. Do you remember more details? Some commands might be worth changing to be more permissive (to more easily recover from corruption) or strict (to more easily catch it). > Surprisingly, it's possible to clone, fetch, push, pull, and sometimes > even gc the corrupted repo several times before anyone notices the > corruption. For push, I do not think it is worth the slowdown; better to spend the time needed to be robust in hash-object. On the other hand, it does not seem overly burdensome to me to make ‘git gc’ imply a ‘git fsck --no-full’. That wouldn’t be needed to catch on-disk corruption (the CRC32 already does that), but it could be a good idea anyway. Cloning the bug. > If the affected commit is included in a merge with history > from other git users, the only way to fix it is to rebase I assume you used rebase -f? Clever. If it does not fail, git filter-branch --tree-filter : should accomplish the same thing without making the history linear, though it does not scale well to deep histories. It should be possible to devise an appropriate index-filter for this, too, if that is too slow. > (or come up > with a blob whose contents match the affected SHA1 hash somehow). Yes, depending on how the file was changing this could be easy or hard to do. /usr/share/doc/git-doc/howto/recover-corrupted-blob-object.txt is about something like this. > I discovered this bug accidentally when I was using inotifywait (from > the inotify-tools package) to automatically commit snapshots of a working > directory triggered by write events. Ah, so the commits happened precisely during the race window. “Something like git that can handle snapshots” tends to be something people want every once in a while. Of course, git has some problems for these uses: - racy add, as you noticed; - checkout is not atomic or close to atomic; - large files are not supported well (but there is some work going on to change this); - uncompressible files are not supported well; - file metadata is not tracked; - directories with many entries are not supported well; - rename detection works poorly with binary files; - no quick way to throw away old history. If the tree to be tracked is small (like /etc), one can try to work around these things and get away with it for a while. Still, when so many of git’s features are the opposite of useful for the goal, it is a wonder it is so tempting to try anyway. My guess is it is the UI for browsing and manipulating history. There are plenty of backup tools that are great for capturing snapshots, but few of them make it easy to do much with the snapshots once they’re captured. The nicest I’ve experienced created a magic .snapshots directory under the mount point for the file system whose history was being tracked, with subdirectories for each time a snapshot was taken. Not nearly as pleasant to use as ‘git log’. Something to think about. > I tested this with a number of kernel versions from 2.6.27 to 2.6.31. > All of them reproduced this problem. I checked this because strace > shows 'git add' doing a mmap(..., MAP_PRIVATE, ...) of its input file, I think this is more of a “MAP_PRIVATE by default because that’s easier on the kernel” kind of thing. Cheers, Jonathan -- To UNSUBSCRIBE, email to debian-bugs-dist-requ...@lists.debian.org with a subject of "unsubscribe". Trouble? Contact listmas...@lists.debian.org