Re: a program to delete duplicate files

2005-03-29 Thread alanwo
Why not try to use NoClone, it finds and deletes duplicate files by true byte-by-byte comparison. Smart marker filters duplicate files to delete. With GUI. http://noclone.net Xah Lee wrote: > here's a large exercise that uses what we built before. > > suppose you have tens of thousands of files i

Re: a program to delete duplicate files

2005-03-14 Thread David Eppstein
In article <[EMAIL PROTECTED]>, [EMAIL PROTECTED] (John J. Lee) wrote: > > If you read them in parallel, it's _at most_ m (m is the worst case > > here), not 2(m-1). In my tests, it has always significantly less than > > m. > > Hmm, Patrick's right, David, isn't he? Yes, I was only considering

Re: a program to delete duplicate files

2005-03-14 Thread Jeff Shannon
Patrick Useldinger wrote: David Eppstein wrote: When I've been talking about hashes, I've been assuming very strong cryptographic hashes, good enough that you can trust equal results to really be equal without having to verify by a comparison. I am not an expert in this field. All I know is that

Re: a program to delete duplicate files

2005-03-14 Thread John J. Lee
Patrick Useldinger <[EMAIL PROTECTED]> writes: > David Eppstein wrote: > > > The hard part is verifying that the files that look like duplicates > > really are duplicates. To do so, for a group of m files that appear > > to be the same, requires 2(m-1) reads through the whole files if you > > us

Re: a program to delete duplicate files

2005-03-14 Thread Bengt Richter
On Mon, 14 Mar 2005 10:43:23 -0800, David Eppstein <[EMAIL PROTECTED]> wrote: >In article <[EMAIL PROTECTED]>, > "John Machin" <[EMAIL PROTECTED]> wrote: > >> Just look at the efficiency of processing N files of the same size S, >> where they differ after d bytes: [If they don't differ, d = S] > >

Re: a program to delete duplicate files

2005-03-14 Thread Patrick Useldinger
David Eppstein wrote: The hard part is verifying that the files that look like duplicates really are duplicates. To do so, for a group of m files that appear to be the same, requires 2(m-1) reads through the whole files if you use a comparison based method, or m reads if you use a strong hashin

Re: a program to delete duplicate files

2005-03-14 Thread Patrick Useldinger
David Eppstein wrote: When I've been talking about hashes, I've been assuming very strong cryptographic hashes, good enough that you can trust equal results to really be equal without having to verify by a comparison. I am not an expert in this field. All I know is that MD5 and SHA1 can create c

Re: a program to delete duplicate files

2005-03-14 Thread David Eppstein
In article <[EMAIL PROTECTED]>, Patrick Useldinger <[EMAIL PROTECTED]> wrote: > Shouldn't you add the additional comparison time that has to be done > after hash calculation? Hashes do not give 100% guarantee. When I've been talking about hashes, I've been assuming very strong cryptographic ha

Re: a program to delete duplicate files

2005-03-14 Thread David Eppstein
In article <[EMAIL PROTECTED]>, "John Machin" <[EMAIL PROTECTED]> wrote: > Just look at the efficiency of processing N files of the same size S, > where they differ after d bytes: [If they don't differ, d = S] I think this misses the point. It's easy to find the files that are different. Just

Re: a program to delete duplicate files

2005-03-13 Thread Patrick Useldinger
John Machin wrote: Test: !for k in range(1000): !open('foo' + str(k), 'w') I ran that and watched it open 2 million files and going strong ... until I figured that files are closed by Python immediately because there's no reference to them ;-) Here's my code: #!/usr/bin/env python import os

Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote: Oh yeah, "the computer said so, it must be correct". Even with your algorithm, I would be investigating cases where files were duplicates but there was nothing in the names or paths that suggested how that might have come about. Of course, but it's good to know that the computer

Re: a program to delete duplicate files

2005-03-12 Thread John Machin
Patrick Useldinger wrote: > John Machin wrote: > > > Maybe I was wrong: lawyers are noted for irritating precision. You > > meant to say in your own defence: "If there are *any* number (n >= 2) > > of identical hashes, you'd still need to *RE*-read and *compare* ...". > > Right, that is what I mea

Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote: Maybe I was wrong: lawyers are noted for irritating precision. You meant to say in your own defence: "If there are *any* number (n >= 2) of identical hashes, you'd still need to *RE*-read and *compare* ...". Right, that is what I meant. 2. As others have explained, with a decent

Re: a program to delete duplicate files

2005-03-12 Thread John Machin
Patrick Useldinger wrote: > John Machin wrote: > > > Just look at the efficiency of processing N files of the same size S, > > where they differ after d bytes: [If they don't differ, d = S] > > > > PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d > > which is important for small

Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
François Pinard wrote: Identical hashes for different files? The probability of this happening should be extremely small, or else, your hash function is not a good one. We're talking about md5, sha1 or similar. They are all known not to be 100% perfect. I agree it's a rare case, but still, why se

Re: a program to delete duplicate files

2005-03-12 Thread François Pinard
[Patrick Useldinger] > Shouldn't you add the additional comparison time that has to be done > after hash calculation? Hashes do not give 100% guarantee. If there's > a large number of identical hashes, you'd still need to read all of > these files to make sure. Identical hashes for different file

Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
Scott David Daniels wrote: comparisons. Using hashes, three file reads and three comparisons of hash values. Without hashes, six file reads; you must read both files to do a file comparison, so three comparisons is six files. That's provided you compare always 2 files at a time. I compar

Re: a program to delete duplicate files

2005-03-12 Thread Scott David Daniels
Patrick Useldinger wrote: Just to explain why I appear to be a lawer: everybody I spoke to about this program told me to use hashes, but nobody has been able to explain why. I found myself 2 possible reasons: 1) it's easier to program: you don't compare several files in parallel, but process on

Re: a program to delete duplicate files

2005-03-12 Thread Patrick Useldinger
John Machin wrote: Just look at the efficiency of processing N files of the same size S, where they differ after d bytes: [If they don't differ, d = S] PU: O(Nd) reading time, O(Nd) data comparison time [Actually (N-1)d which is important for small N and large d]. Hashing method: O(NS) reading time

Re: a program to delete duplicate files

2005-03-12 Thread John Machin
David Eppstein wrote: > In article <[EMAIL PROTECTED]>, > Patrick Useldinger <[EMAIL PROTECTED]> wrote: > > > > Well, but the spec didn't say efficiency was the primary criterion, it > > > said minimizing the number of comparisons was. > > > > That's exactly what my program does. > > If you're do