> > Representation cache is based on the sha of the rep. So it does not > matter what the filename is or where it is stored. If it has the same > sha as an existing rep, then it will be be shared. > > The small improvement in 1.8 was simply to do this for files being added > within the same revision, but the other scenario was already supported. > > I think it is worth pointing out that a rep is not necessarily a "file". > It is the specific delta that SVN would be storing in the repository DB. >
One improvement that I'd like to suggest is that files over 1MiB (4? 8?) be "chunked" prior to calculating rep-sharing. http://blog.clearpathsg.com/blog/bid/254076/Understanding-Variable-Length-Deduplication My thinking is that there might be storage gains to be made if rep-sharing is done at a lower level then the file level in cases of files over a particular size. For instance, if you commit a few hundred files of mid-size (5-15MB or larger), there is probably a lot of identical data between them (if the files are not already compressed). Those identical chunks could be possibly found via a variable length deduplication algorithm and deduped across the repository. IIRC when I moved our repos from 1.6 to 1.8 format, space usage went down by 10-15% from rep-sharing. I wouldn't mind having another 5-10% space savings.