Hi,
I've just implemented a simple map-reduce framework which mimics the
steps involved in the workflow of a Hadoop job. It basically amounted
to implementing a helper function to emit results and the
shuffle-combine step which happens in between a map and a reduce task.
Please consider that I am also still new to Hadoop so the code below
amounts to my interpretation of how a Hadoop job is structured:
(defn emit
"Helper function to produce intermediate and final results."
[k v]
{:k k :v v})
(defn shuffle
"Shuffle step where all v's from maps get grouped for the reduce steps"
[ coll ]
(when (seq coll)
(sort-by :k (reduce
(fn [acc i]
(assoc acc (:k i)
(cons i ((:k i) acc))))
{} (flatten coll)))))
(defn job
"Equivalent to a hadoop job"
[map-fn reduce-fn coll]
(when (seq coll)
(map reduce-fn
(shuffle (map map-fn coll)))))
And the necessary wordcount example:
(defn tf-mapper
[ s ]
(when (seq s)
(map (fn [i] (emit i 1)) s)))
(defn tf-reducer
[[k vs]]
(emit k (reduce (fn [acc {v :v}] (+ acc v)) 0 vs)))
(defn tf
"Calculates the term frequency of each token in the collection"
[ coll ]
(when (seq coll)
(job tf-mapper tf-reducer coll)))
user> (tf [[:a :b] [:a :b :c] [:c]])
({:k :c, :v 2} {:k :b, :v 2} {:k :a, :v 2})
The wordcount example does not tokenise the documents/strings, it
assumes that each document is actually a seq, so the collection is
just a seq of seqs.
Am I right to think that replacing calls to map with pmap would make
the framework work in parallel within a single box?
Any feedback always welcome.
Cheers,
U
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en