Bug#662080: ITP: hadori -- Hardlinks identical files

Timo Weingärtner Sat, 03 Mar 2012 18:21:16 -0800

Hallo Julian Andres,

2012-03-04 um 01:07:42 schriebst Du:
> On Sun, Mar 04, 2012 at 12:31:16AM +0100, Timo Weingärtner wrote:


> > The initial comparison was with hardlink, which got OOM killed with a
> > hundred backups of my home directory. Last night I compared it to duff
> > and rdfind which would have happily linked files with different st_mtime
> > and st_mode.
> 
> You might want to try hardlink 0.2~rc1. In any case, I don't think we need
> yet another such tool in the archive. If you want that algorithm, we can
> implement it in hardlink 0.2 using probably about 10 lines. I had that
> locally and it works, so if you want it, we can add it and avoid the
> need for one more hack in that space.

And why is lighttpd in the archive? Apache can do the same ...

> hardlink 0.2 is written in C, and uses a binary tree to map
> (dev_t, off_t) to a struct file which contains the stat information
> plus name for linking. It requires two allocations per file, one for
> the struct file with the filename, and one for the node in the tree
> (well, actually we only need the node for the first file with a
>  specific (dev_t, off_t) tuple). A node has 3 pointers.

The "hardlink" I used at that time was written in python and definitely didn't 
do it the way I want.

hadori is written in C++11 which IMHO makes it look a little more readable. It 
started with tree based map and multimap, now it uses the unordered_ (hash 
based) versions which made it twice as fast in a typical workload.

The main logic is in hadori.C, handle_file and uses:

std::unordered_map<ino_t, inode const> kept;
std::unordered_map<ino_t, ino_t> to_link;
std::unordered_multimap<off_t, ino_t> sizes;

class inode contains a struct stat, a file name and an adler checksum, but I 
plan to drop the last one because I think the hashing option is no great gain.


Grüße
Timo

signature.asc
Description: This is a digitally signed message part.

Bug#662080: ITP: hadori -- Hardlinks identical files

Reply via email to