On Wed, 20 Apr 2005, Linus Torvalds wrote:
What's the disk usage results? I'm on ext3, for example, which means that
even small files invariably take up 4.125kB on disk (with the inode).
Even uncompressed, most source files tend to be small. Compressed, I'm
seeing the median blob size being ~1.6kB
On Wed, 20 Apr 2005, Petr Baudis wrote:
I think one thing git's objects database is not very well suited for are
network transports. You want to have something smart doing the
transports, comparing trees so that it can do some delta compression;
that could probably reduce the amount of data needed
ffix'
* chunk), and these chunks are put in the object store. This way
* similar files will be expected to share chunks, saving space.
* Files less than one disk block long are expected to fit in a single
* chunk, so there is no extra indirection overhead for this case.
*
* Copyright
On Wed, 20 Apr 2005, David Greaves wrote:
In doing this I noticed a couple of points:
* update-cache won't accept ./file or fred/./file
The comment in update-cache.c reads:
/*
* We fundamentally don't like some paths: we don't want
* dot or dot-dot anywhere, and in fact, we don't even want
* any
On Wed, 20 Apr 2005, Linus Torvalds wrote:
I was considering using a chunked representation for *all* files (not just
blobs), which would avoid the original 'trees must reference other trees
or they become too large' issue -- and maybe the performance issue you're
referring to, as well?
No. The mos
On Wed, 20 Apr 2005, Chris Mason wrote:
With the basic changes I described before, the 100 patch time only goes down
to 40s. Certainly not fast enough to justify the changes. In this case, the
bulk of the extra time comes from write-tree writing the index file, so I
split write-tree.c up into li
On Wed, 20 Apr 2005, Martin Uecker wrote:
You can (and my code demonstrates/will demonstrate) still use a whole-file
hash to use chunking. With content prefixes, this takes O(N ln M) time
(where N is the file size and M is the number of chunks) to compute all
hashes; if subtrees can share the same
On Wed, 20 Apr 2005, Linus Torvalds wrote:
- _keep_ the same compression format, but notice that we already have an
object by looking at the uncompressed one.
With a chunked file, you can also skip writing certain *subtrees* of the
file as soon as you notice it's already present on disk. I can
On Wed, 20 Apr 2005, Martin Uecker wrote:
The other thing I don't like is the use of a sha1
for a complete file. Switching to some kind of hash
tree would allow to introduce chunks later. This has
two advantages:
You can (and my code demonstrates/will demonstrate) still use a whole-file
hash to us
So I wrote up my ideas regarding blob chunking as code; see attached.
This is against git-0.4 (I know, ancient, but I had to start somewhere.)
The idea here is that blobs are chunked using a rolling checksum (so the
chunk boundaries are content-dependent and stay fixed even if you mutate
pieces o
On Tue, 19 Apr 2005, David Meybohm wrote:
But doesn't this require assuming the distribution of MD5 is uniform,
and don't the papers finding collisions in less show it's not? So, your
birthday-argument for calculating the probability wouldn't apply, because
it rests on the assumption MD5 is uniform
On Tue, 19 Apr 2005, Linus Torvalds wrote:
(*) Actually, I think it's the compression that ends up being the most
expensive part.
You're also using the equivalent of '-9', too -- and *that's slow*.
Changing to Z_NORMAL_COMPRESSION would probably help a lot
(but would break all existing repositories
On Mon, 18 Apr 2005, Andy Isaacson wrote:
If you had actual evidence of a collision, I'd love to see it - even if
it's just the equivalent of
% md5 foo
d3b07384d113edec49eaa6238ad5ff00 foo
% md5 bar
d3b07384d113edec49eaa6238ad5ff00 bar
% cmp foo bar
foo bar differ: byte 25, line 1
%
But in the abse
On Sun, 17 Apr 2005, Horst von Brand wrote:
crypto-babble about collision whitepapers is uninteresting without a
repo that has real collisions. git is far too cool as is - prove I
Just copy over a file (might be the first step in splitting it, or a
header file that is duplicated for convenience, .
On Sat, 16 Apr 2005, Petr Baudis wrote:
I know the current state of the art here. It's going to take more than
just hearsay to convince me that full 128-bit MD5 collisions are likely.
http://cryptography.hyperlink.cz/MD5_collisions.html
OK, OK, I spoke too sloppily. Let me rephrase:
It's going
On Sat, 16 Apr 2005, Martin Uecker wrote:
The right thing (TM) is to switch from SHA1 of compressed
content for the complete monolithic file to a merkle hash tree
of the uncompressed content. This would make the hash
independent of the actual storage method (chunked or not).
It would certainly be n
On Sat, 16 Apr 2005, Brian O'Mahoney wrote:
(1) I _have_ seen real-life collisions with MD5, in the context of
Document management systems containing ~10^6 ms-WORD documents.
Dude! You could have been *famous*! Why the
aitch-ee-double-hockey-sticks didn't you publish this when you found it?
S
On Fri, 15 Apr 2005, Junio C Hamano wrote:
to yours is no problem for me. Currently I see your HEAD is at
461aef08823a18a6c69d472499ef5257f8c7f6c8, so I will generate a
set of patches against it.
Have you considered using an s/key-like system to make these hashes more
human-readable? Using the S
On Fri, 15 Apr 2005, Linus Torvalds wrote:
The problem with chunking is:
- it complicates a lot of the routines. Things like "is this file
unchanged" suddenly become "is this file still the same set of chunks",
which is just a _lot_ more code and a lot more likely to have bugs.
The blob still h
I've been reading the archives (a bad idea, I know). Here's a concrete
suggestion for GIT space-compression which is (I believe) consistent with
the philosophy of GIT.
Why are blobs per-file? [After all, Linus insists that files are an
illusion.] Why not just have 'chunks', and assemble *the
On Fri, 15 Apr 2005, David Woodhouse wrote:
given piece of content. Also because we actually have the developer's
attention at commit time, and we can get _real_ answers from the user
about what she was doing, instead of having to guess.
Yes, but it's still hard to get *accurate* information. And
On Fri, 15 Apr 2005, Paul Jackson wrote:
Um ah ... could you explain what you mean by inter and intra file diffs?
intra file diffs: here are two versions of the same file. what changed?
inter file diffs: here is a new file, and here are *all the files in the
current committed version*. Where di
On Thu, 14 Apr 2005, Paul Jackson wrote:
To me, rename is a special case of the more general case of a
big chunk of code (a portion of a file) that was in one place
either being moved or copied to another place.
I wonder if there might be someway to use the tools that biologists use
to analyze DNA
Perhaps our thinking is being clouded by 'how other SCMs do things' ---
do we *really* need extra rename metadata? As Linus pointed out, as long
as a commit is done immediately after a rename (ie before the renamed file
is changed) the tree object contains all the information one needs: you
can
24 matches
Mail list logo