Re: chunking (Re: [ANNOUNCEMENT] /Arch/ embraces `git')

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Linus Torvalds wrote: What's the disk usage results? I'm on ext3, for example, which means that even small files invariably take up 4.125kB on disk (with the inode). Even uncompressed, most source files tend to be small. Compressed, I'm seeing the median blob size being ~1.6kB

Re: [ANNOUNCEMENT] /Arch/ embraces `git'

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Petr Baudis wrote: I think one thing git's objects database is not very well suited for are network transports. You want to have something smart doing the transports, comparing trees so that it can do some delta compression; that could probably reduce the amount of data needed

Blob chunking code. [Second look]

2005-04-20 Thread C. Scott Ananian
ffix' * chunk), and these chunks are put in the object store. This way * similar files will be expected to share chunks, saving space. * Files less than one disk block long are expected to fit in a single * chunk, so there is no extra indirection overhead for this case. * * Copyright

Re: [PATCH] Some documentation...

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, David Greaves wrote: In doing this I noticed a couple of points: * update-cache won't accept ./file or fred/./file The comment in update-cache.c reads: /* * We fundamentally don't like some paths: we don't want * dot or dot-dot anywhere, and in fact, we don't even want * any

Re: [PATCH] write-tree performance problems

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Linus Torvalds wrote: I was considering using a chunked representation for *all* files (not just blobs), which would avoid the original 'trees must reference other trees or they become too large' issue -- and maybe the performance issue you're referring to, as well? No. The mos

Re: [PATCH] write-tree performance problems

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Chris Mason wrote: With the basic changes I described before, the 100 patch time only goes down to 40s. Certainly not fast enough to justify the changes. In this case, the bulk of the extra time comes from write-tree writing the index file, so I split write-tree.c up into li

Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Martin Uecker wrote: You can (and my code demonstrates/will demonstrate) still use a whole-file hash to use chunking. With content prefixes, this takes O(N ln M) time (where N is the file size and M is the number of chunks) to compute all hashes; if subtrees can share the same

Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Linus Torvalds wrote: - _keep_ the same compression format, but notice that we already have an object by looking at the uncompressed one. With a chunked file, you can also skip writing certain *subtrees* of the file as soon as you notice it's already present on disk. I can

Re: WARNING! Object DB conversion (was Re: [PATCH] write-tree performance problems)

2005-04-20 Thread C. Scott Ananian
On Wed, 20 Apr 2005, Martin Uecker wrote: The other thing I don't like is the use of a sha1 for a complete file. Switching to some kind of hash tree would allow to introduce chunks later. This has two advantages: You can (and my code demonstrates/will demonstrate) still use a whole-file hash to us

Blob chunking code. [First look.]

2005-04-20 Thread C. Scott Ananian
So I wrote up my ideas regarding blob chunking as code; see attached. This is against git-0.4 (I know, ancient, but I had to start somewhere.) The idea here is that blobs are chunked using a rolling checksum (so the chunk boundaries are content-dependent and stay fixed even if you mutate pieces o

Re: SHA1 hash safety

2005-04-19 Thread C. Scott Ananian
On Tue, 19 Apr 2005, David Meybohm wrote: But doesn't this require assuming the distribution of MD5 is uniform, and don't the papers finding collisions in less show it's not? So, your birthday-argument for calculating the probability wouldn't apply, because it rests on the assumption MD5 is uniform

Re: [PATCH] write-tree performance problems

2005-04-19 Thread C. Scott Ananian
On Tue, 19 Apr 2005, Linus Torvalds wrote: (*) Actually, I think it's the compression that ends up being the most expensive part. You're also using the equivalent of '-9', too -- and *that's slow*. Changing to Z_NORMAL_COMPRESSION would probably help a lot (but would break all existing repositories

Re: SHA1 hash safety

2005-04-18 Thread C. Scott Ananian
On Mon, 18 Apr 2005, Andy Isaacson wrote: If you had actual evidence of a collision, I'd love to see it - even if it's just the equivalent of % md5 foo d3b07384d113edec49eaa6238ad5ff00 foo % md5 bar d3b07384d113edec49eaa6238ad5ff00 bar % cmp foo bar foo bar differ: byte 25, line 1 % But in the abse

Re: SHA1 hash safety

2005-04-18 Thread C. Scott Ananian
On Sun, 17 Apr 2005, Horst von Brand wrote: crypto-babble about collision whitepapers is uninteresting without a repo that has real collisions. git is far too cool as is - prove I Just copy over a file (might be the first step in splitting it, or a header file that is duplicated for convenience, .

Re: Re: SHA1 hash safety

2005-04-16 Thread C. Scott Ananian
On Sat, 16 Apr 2005, Petr Baudis wrote: I know the current state of the art here. It's going to take more than just hearsay to convince me that full 128-bit MD5 collisions are likely. http://cryptography.hyperlink.cz/MD5_collisions.html OK, OK, I spoke too sloppily. Let me rephrase: It's going

Re: space compression (again)

2005-04-16 Thread C. Scott Ananian
On Sat, 16 Apr 2005, Martin Uecker wrote: The right thing (TM) is to switch from SHA1 of compressed content for the complete monolithic file to a merkle hash tree of the uncompressed content. This would make the hash independent of the actual storage method (chunked or not). It would certainly be n

Re: SHA1 hash safety

2005-04-16 Thread C. Scott Ananian
On Sat, 16 Apr 2005, Brian O'Mahoney wrote: (1) I _have_ seen real-life collisions with MD5, in the context of Document management systems containing ~10^6 ms-WORD documents. Dude! You could have been *famous*! Why the aitch-ee-double-hockey-sticks didn't you publish this when you found it? S

Re: write-tree is pasky-0.4

2005-04-15 Thread C. Scott Ananian
On Fri, 15 Apr 2005, Junio C Hamano wrote: to yours is no problem for me. Currently I see your HEAD is at 461aef08823a18a6c69d472499ef5257f8c7f6c8, so I will generate a set of patches against it. Have you considered using an s/key-like system to make these hashes more human-readable? Using the S

Re: space compression (again)

2005-04-15 Thread C. Scott Ananian
On Fri, 15 Apr 2005, Linus Torvalds wrote: The problem with chunking is: - it complicates a lot of the routines. Things like "is this file unchanged" suddenly become "is this file still the same set of chunks", which is just a _lot_ more code and a lot more likely to have bugs. The blob still h

space compression (again)

2005-04-15 Thread C. Scott Ananian
I've been reading the archives (a bad idea, I know). Here's a concrete suggestion for GIT space-compression which is (I believe) consistent with the philosophy of GIT. Why are blobs per-file? [After all, Linus insists that files are an illusion.] Why not just have 'chunks', and assemble *the

Re: Merge with git-pasky II.

2005-04-15 Thread C. Scott Ananian
On Fri, 15 Apr 2005, David Woodhouse wrote: given piece of content. Also because we actually have the developer's attention at commit time, and we can get _real_ answers from the user about what she was doing, instead of having to guess. Yes, but it's still hard to get *accurate* information. And

Re: Merge with git-pasky II.

2005-04-15 Thread C. Scott Ananian
On Fri, 15 Apr 2005, Paul Jackson wrote: Um ah ... could you explain what you mean by inter and intra file diffs? intra file diffs: here are two versions of the same file. what changed? inter file diffs: here is a new file, and here are *all the files in the current committed version*. Where di

Re: another perspective on renames.

2005-04-15 Thread C. Scott Ananian
On Thu, 14 Apr 2005, Paul Jackson wrote: To me, rename is a special case of the more general case of a big chunk of code (a portion of a file) that was in one place either being moved or copied to another place. I wonder if there might be someway to use the tools that biologists use to analyze DNA

another perspective on renames.

2005-04-14 Thread C. Scott Ananian
Perhaps our thinking is being clouded by 'how other SCMs do things' --- do we *really* need extra rename metadata? As Linus pointed out, as long as a commit is done immediately after a rename (ie before the renamed file is changed) the tree object contains all the information one needs: you can