Hi,
I'm learning Clojure and I wrote a word-frequencies function that relies
heavily on clojure.core/frequencies (plus a little filtering)
(ns topwords.core
(require [clojure.java.io :as io]
[clojure.string :as str]))
(def stop-words #{"other" "still" "again" "where" "could" "there"
"their" "these" "those" "after" "while" "almost" "before"
"through"
"every" "being" "never" "should" "might" "thing" "among"
"which" "would" "though" "about"})
(defn get-words [line]
(re-seq #"\p{Alpha}+" line))
(defn min-length [word]
(< 4 (count word)))
(defn ignore-words [word]
(if-not (contains? stop-words word) word))
(defn word-frequencies [filename]
(with-open [rdr (io/reader filename)]
(let [lines (line-seq rdr)
words (comp get-words str/lower-case)
preds (every-pred min-length ignore-words)]
(frequencies (filter preds (words lines))))))
It works (you can see some output from it on my blog if you want -
http://robbuhler.blogspot.com/2014/02/word-frequencies-from-file.html)
Anyway, my questions are:
1) Why do I not need a doall on the line-seq? What is forcing the evaluation
here?
2) I'm assuming this is still reading the entire file into memory at once? If
so, how would I
count the frequencies of a really large file without consuming so much
memory?
I've thought about using doseq and for each line updating a atom that holds
a map,
but I'm not sure if I'm no the right track here.
I'm just thinking of something like this (in Python):
for i in xrange(100):
key = i % 10
if key in d:
d[key] += 1
else:
d[key] = 1
Can I somehow count all of the frequencies line by line and not use an atom
(or another ref type)?
I'm not looking for the ultimate performance code, just something that would
be considered idiomatic Clojure
Thanks,
Rob
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.