Hi,

I'm learning Clojure and I wrote a word-frequencies function that relies 
heavily on clojure.core/frequencies (plus a little filtering)

(ns topwords.core
 (require [clojure.java.io :as io]
          [clojure.string :as str]))
(def stop-words #{"other" "still" "again" "where" "could" "there" 

                  "their" "these" "those" "after" "while" "almost" "before" 
"through" 

                  "every" "being" "never" "should" "might" "thing" "among" 

                  "which" "would" "though" "about"})
(defn get-words [line]
  (re-seq #"\p{Alpha}+" line))
(defn min-length [word]
 (< 4 (count word)))
(defn ignore-words [word]
 (if-not (contains? stop-words word) word))
(defn word-frequencies [filename]
  (with-open [rdr (io/reader filename)]
     (let [lines (line-seq rdr)
           words (comp get-words str/lower-case)
           preds (every-pred min-length ignore-words)]
       (frequencies (filter preds (words lines))))))


It works (you can see some output from it on my blog if you want - 
http://robbuhler.blogspot.com/2014/02/word-frequencies-from-file.html)

Anyway, my questions are:


1) Why do I not need a doall on the line-seq? What is forcing the evaluation 
here?


2) I'm assuming this is still reading the entire file into memory at once? If 
so, how would I

   count the frequencies of a really large file without consuming so much 
memory?

   I've thought about using doseq and for each line updating a atom that holds 
a map,

      but I'm not sure if I'm no the right track here.

      I'm just thinking of something like this (in Python):

      for i in xrange(100):

         key = i % 10

    if key in d:
        d[key] += 1
    else:
        d[key] = 1

   Can I somehow count all of the frequencies line by line and not use an atom 
(or another ref type)?

   I'm not looking for the ultimate performance code, just something that would 
be considered idiomatic Clojure


 Thanks,

 Rob

-- 
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to