On May 27, 2009, at 7:52 PM, Korny Sietsma wrote:
> How would I do this in a functional way? My first effort would be
> something like
> (defn hash [filename] (memoize (... hash function ...)))
> but I have a couple of problems with this:
> - it doesn't seem to store the hash value with the rest of the file
> information, which feels a bit ugly
> - I assume it means storing the full filename three times, once in
> the original file info structure, once in the memoized hash function,
> and once in the memoized quickhash function. My program struggles to
> get enough RAM to track as many files as I'd like already - storing
> the filename multiple times would blow out memory quite badly.
>
> I guess I could define a unique key for each filename, and define hash
> as a function on that key, but then hash would need to be able to
> access the list of filenames somehow. It's starting to get beyond me
> - I'm hoping there's a simpler option!
OK, I'm probably talking out of turn here, but I thought I would share
some of my thoughts.
First of all, though you probably already thought of this, you can
improve time complexity from O(N^2) to O(N log N) if you sort the
files looking for duplicates rather than comparing every file to every
other file. In other words, make a list of files and their hashes, and
sort by the hashes, then walk the list comparing each hash to the
previous hash.
I can think of two ways to do this. The first one I thought of might
be more functional, which would be to successively narrow down a list
of potential duplicates comparison by comparison. It would have a
structure something like this: generate a list of files by size; group
by files of the same size; remove all sublists of length 1 (unique
sizes). Pass that list to a function which generates a list of files
by the quick hash; group by them by it; remove all sublists of length
1. Pass that to a function which does the same thing only with the
more expensive hash. I suspect you could make each of the three steps
a function and then make a fourth function which combines those three
and takes a function as a parameter to produce the hash. Then you
could use the comp function to combine those three functions and
presto, you have your list of duplicates.
Another possible way that more closely resembles what you're doing
right now would be to have a function which looks at a file and lazily
produces the three comparison values, then do the sort/group procedure
with a comparison function that is smart enough to only compare as
much as it has to. I think you could build this with what Mikio was
proposing, because map is lazy.
As another example of this:
(defn get-comparators [str]
(cons (nth str 0)
(lazy-seq (do
(println "Looking at second element")
(Thread/sleep 1000)
(list (nth str 1))))))
This function returns a list of two values which are just the first
two characters of a string, but it prints a message and waits a second
if the second value in the list is actually forced. If you never look
at the second value, you never pay the penalty for computing it. So
supposing a compare a list of strings which differ on the first
character, it should return immediately without printing anything, and
it does:
user> (sort-by get-comparators #(some (complement zero?) (map compare
%1 %2)) ["ab" "cd" "ef" "gh"])
("ab" "cd" "ef" "gh")
But if some of them have the same first character, it should have a
delay and then return the values:
user> (sort-by get-comparators #(some (complement zero?) (map compare
%1 %2)) ["ab" "cd" "ef" "eg"])
Looking at second element
Looking at second element
("ab" "cd" "ef" "eg")
So you can see that even though it had to use the second values to
compare 'ef' and 'eg', it only had to "compute" their second
characters, not the second characters of any of the other elements.
There is a group-by function in clojure.contrib.seq-utils which would
be of use. Unfortunately I don't see how to do filter with a hash-map
to produce another hash-map. The best I came up with was this:
(def groupings
(clojure.contrib.seq-utils/group-by
#(first (get-comparators %))
["abc" "abd" "hello" "gravy" "stuff" "stripes"]))
(apply hash-map (apply concat (filter (fn [[key value]] (> (count
value) 1)) groupings)))
I don't trust apply, and double apply seems really fishy. I'm probably
missing something obvious there. Also, it would probably be better to
either be able to furnish group-by with a comparison function a la
sort-by or else somehow make lists be instances of comparable to make
this work using get-comparators.
One more note, because you mentioned some concern over having
filenames in memory. I'm not sure but I don't think Java and therefore
Clojure can do this, but in general (at least on Unix-based systems)
you ought to be able to store the inode and look up everything else at
considerable savings. Your storage costs then ought to be something
like a 4-8 bytes per file. Since you don't care about the filename for
identifying duplicates, just displaying them, that could save you some
overhead during the comparison stage, but I'd only bother with it if I
knew I had to compare a *lot* of files and I only cared about the
result during one run, as inodes are frequently recycled.
None of this really answers your question about internally mutable
state. Of course internally mutable state is possible, either using
closed-over variables or the OOP facilities of Clojure. I have a
recent message which gives examples of both, which you can see here:
<http://groups.google.com/group/clojure/msg/de5246273c08571f
>.
Hope this helps,
—
Daniel Lyons
http://www.storytotell.org -- Tell It!
--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
-~----------~----~----~----~------~----~------~--~---