Thank you everyone for your advice, I found it useful and think that I am
part-way to a solution using clojure.data.xml/source-seq as suggested by
dannue.
I'll post what I have done so far in the hope it might help someone else...
comments on style welcome.
*Solution*:
Given the following XML,
<head>
<title>This is some text</title>
<body>
<h1>This is a header</h1>
</body>
</head>
data.xml/source-seq will return a lazy seq of data.xml.Event items
#clojure.data.xml.Event{:type :start-element, :name :head, :attrs nil, :str
nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :start-element, :name :title, :attrs nil,
:str nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str This
is some text}
#clojure.data.xml.Event{:type :end-element, :name :title, :attrs nil, :str
nil}
#clojure.data.xml.Event{:type :start-element, :name :body, :attrs nil, :str
nil}
#clojure.data.xml.Event{:type :start-element, :name :h1, :attrs nil, :str
nil}
#clojure.data.xml.Event{:type :characters, :name nil, :attrs nil, :str This
is a header}
#clojure.data.xml.Event{:type :end-element, :name :h1, :attrs nil, :str nil}
#clojure.data.xml.Event{:type :end-element, :name :body, :attrs nil, :str
nil}
#clojure.data.xml.Event{:type :end-element, :name :head, :attrs nil, :str
nil}
This is perfect for finding elements with a particular name, but completely
useless if I want to find an element based on its location. So I maintain a
stack where each :start-element causes the element name to be pushed, and
each :end-element to invoke a pop.
(filter (fn [x] (complement (nil? x)))
(let [stack (atom [])
search-pattern "vmware/collectionHost/Object/Property/Property"]
(doseq[x (take 100 xml)] ; just test with the first 100 elements in seq.
(do
(cond
(= (:type x) :start-element) (swap! stack conj (name (get x
:name)))
(= (:type x) :end-element) (swap! stack pop)
)
(when (= search-pattern (clojure.string/join "/" @stack)) (println
(clojure.string/join "/" @stack)))
)
)
)
)
This is a work in progress and does not take account of attributes on the
elements, but I would appreciate any comments.
Thanks
Pete
On Wednesday, December 18, 2013 7:23:21 AM UTC, danneu wrote:
>
> Good question. Every lib that came to mind when I saw
> clojure.data.xml/parse's
> tree of Elements {:tag _,
> :attrs _, :content _} only works on zippers which apparently sit in memory.
>
> One option is to use `clojure.data.xml/source-seq` to get back a lazy
> sequence
> of Events {:type _, :name _, :attrs _, :str _} where the event :name is
> either
> :start-element, :end-element, or :characters.
>
> For example, "<strong>Hello</strong>" would parse into the events
> [:start-element "strong"], [:characters "Hello"], [:end-element "strong"].
> You
> could use loop/recur to manage state as your consume the sequence.
>
> That's actually how I'm used to working with SAX parsers anyways. Here are
> some
> naive Ruby examples if it's new to you:
> https://gist.github.com/danneu/3977120.
>
> Of course, I imagine the ideal solution would involve some way to express
> selectors on the
> Element tree like I'm used to doing with raynes/laser on zippers:
> https://github.com/Raynes/laser/blob/master/docs/guide.md#screen-scraping.
>
>
> On Tuesday, December 17, 2013 4:57:32 AM UTC-6, Peter Ullah wrote:
>>
>>
>> Hi all,
>>
>> I'm attempting to parse a large (500MB) XML, specifically I am trying to
>> extract various parts using XPath. I've been using the examples presented
>> here:
>> http://clojure-doc.org/articles/tutorials/parsing_xml_with_zippers.html
>> and all was going when tested against small files, however now that I am
>> using the larger file Fireplace/Vim just hangs and my laptop gets hot then
>> I get a memory exception.
>>
>> I've been playing around with various other libraries such as
>> clojure.data.xml and and found that the following works perfectly well for
>> parsing... but when I come to search inside root, things start to snarl up
>> again.
>>
>> (ns example.core
>> (:require [clojure.java.io :as java.io]
>> [clojure.data.xml :as data.xml]
>> ))
>>
>> (def large-file "/path-to-large-file")
>>
>> ;; using clojure.data.xml returns quickly with no problems whereas
>> clojure.xml/parse from the link above causes problems..
>> (def root
>> ( -> large-file
>> java.io/input-stream
>> data.xml/parse
>> ))
>>
>> (class root) ;clojure.data.xml.Element
>>
>> Does anyone know a way of searching within root that won't consume the
>> heap?
>>
>> Forgive me, I'm new to Clojure and these forums, I've searched through
>> previous posts but not managed to answer my own question.
>>
>> Thanks in advance.
>>
>
--
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups
"Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.