Hello all
I have question about processing big XML files with lazy-xml. I'm trying to
analyze
StackOverflow dumps with Clojure, and when analyzing 1.6Gb XML file with
posts, i get java stack overflow, although i provide enough memory for java
(1Gb of heap).
My code looks following way
(ns stackoverflow
(:import java.io.File)
(:use clojure.contrib.lazy-xml))
(def so-base "..../data-sets/stack-overflow/2009-12/122009 SO")
(def posts-file (File. (str so-base "/posts.xml")))
(defn count-post-entries [xml]
(loop [counter 0
lst xml]
(if (nil? lst)
counter
(let [elem (first lst)
rst (rest lst)]
(if (and (= (:type elem) :start-element) (= (:name elem) :row))
(recur (+ 1 counter) rst)
(recur counter rst))))))
and run it with
(stackoverflow/count-post-entries (clojure.contrib.lazy-xml/parse-seq
stackoverflow/posts-file))
I don't collect real data here, so i expect, that clojure will discard
already processed data.
The same problem with stack overflow happens, when i use reduce:
(reduce (fn [counter elem]
(if (and (= (:type elem) :start-element) (= (:name elem) :row))
(+ 1 counter)
counter))
0 (clojure.contrib.lazy-xml/parse-seq stackoverflow/posts-file))
So, question is open - how to process big xml files in constant space? (if
I won't collect much data during processing)
--
With best wishes, Alex Ott, MBA
http://alexott.blogspot.com/ http://xtalk.msk.su/~ott/
http://alexott-ru.blogspot.com/
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en