The Best Way to Accomplish This...

Wardrop Tue, 02 Feb 2010 19:54:15 -0800

I feel like I'm over-staying my welcome by posting yet another topic,
so please only answer if you get some form of enjoyment out of solving
such problems as this one.


I've given this problem a fair bit of my time, and it's been good so
far as it's forced me to learn new things and challenge my rather
immature knowledge of clojure. I want to turn to the answers section
now though, which is what I'm hoping to get on this forum. So here's
what I'm trying to do...

I need to pass a 32mb text file of duplicate file entries. I need to
be able to get the total number of duplicates as well as the total
size the duplicates are taking up. As a bonus, it would be good to
provide an additional categorisation by file extension (.jpg =
450.02mb, .bak = 5.65GB, etc). Here's how the file is formatted...

71 byte(null)each:
./atgiss1/profiles/rebeccat/DataWorks/DataWorks LIVE/dwlui.ini
./atgiss1/profiles/alistairh/DataWorks/DataWorks LIVE/dwlui.ini

14171 byte(null)each:
./atgiss1/profiles/rebeccat/My Documents/Corel User Files/WT9_1US.UWL
./atgiss1/profiles/guyc/My Documents/Corel User Files/WT9_1US.UWL
./atgiss1/profiles/carls/My Documents/Corel User Files/WT9_1US.UWL

102 byte(null)each:
./atgiss1/profiles/rebeccat/Application Data/AdobeUM/AcRdB7_0_7.sta
./atgiss1/profiles/rebeccat/Application Data/AdobeUM/AcRdB7_0_8.sta


Do note however, that when computing sizes, do not take into account
the first file, as we only want to know the total space being taken up
by the duplicates, not the original file.

I've already implemented this in Scala, where I use global variable to
keep track of persistent data. I'm finding it hard to morph that
concept into Clojure, which makes me believe I'm going about it the
wrong way. Hopefully someone here can demonstrate one of the right
ways. To get you started, here's a bit of a proof of concept...

(use '[clojure.contrib.duck-streams])

(for [line (line-seq (reader "C:\\atgisfiledupes.txt"))]
  (some #(if ((first %) 1) %)
    [{:size (get (re-matches #"([0-9]+) byte\(null\)each:" line) 1)}
     {:file (get (re-matches #".*(\.[0-9a-zA-Z]+)" line) 1)}
     (if (= line "") {:blank true} {:other true})]))

(println "Finished!")

If anything, it should give the regex you need to extract data from
the various lines.

All replies are much appreciated.

Cheers

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en

The Best Way to Accomplish This...

Reply via email to