Just create a Reader over the file, and do something like (take-while
identity (repeatedly #(read-one-wellformed-xml-tag the-reader))). It needs
some fleshing out for boundary conditions, but I hope you get the general
idea.
On Tuesday, July 10, 2012 6:04:23 AM UTC-7, Wolodja Wentland wrote:
>
> Dear all,
>
> I am running into problems when I try to parse SGML documents [0] that are
> valid XML apart from the fact that they lack a root tag. The whole issue
> is
> complicated by the fact that the input files are pretty large (i.e. 9.1 GB
> gzipped files) and that I therefore cannot read them completely into
> memory.
> My goal is to extract each "document" and index it using Lucene, so I need
> access to the data at one point, but can throw it away immediately after
> processing.
>
> The input data looks something like [1] and my main problem is that none
> of
> the parsers I tried cope with the missing root tag. The main problem is
> that if the SGML is parsed with either clj-tagsoup's [3] parse-xml or
> data.xml's [4] parse function I get a broken representation and can't
> extract all
> data using either zippers (e.g. like in [5]) or by working on the parsed
> data
> directly (as in [6]).
>
> The main problem is that the *first* <DOC> is (wrongly) assumed to be the
> root
> tag for the entire document and that the result of the parse looks
> something
> like [7] (for tagsoup) or [8] (for data.xml). As you can imagine I want
> output
> such as from my (hypothetical) processing function:
>
> (documents-from-gigaword-file (-> in-file
> (io/input-stream)
> (GZIPInputStream.))))
> ({:id "AFP_ENG_20101220.0219"
> :type "story"
> :headline "Headline 1"
> :paragraphs ("Paragraph 1" "Paragraph 2")}
>
> {:id "AFP_ENG_20101220.0235"
> :type "story"
> :headline "Headline 2"
> :text ("Paragraph 3")})
>
> But I get the follwing right now: (no wonder!)
>
> user=> (clojure.pprint/pprint (gw-file->documents (io/file
> "/home/babilen/foo.gz")))
> ({:id "AFP_ENG_20101206.0235",
> :type "story",
> :headline " Headline 2 ",
> :paragraphs (" Paragraph 3 ")})
>
>
> I am, however, unsure how to proceed. I tried wrapping the input stream in
> "<XML> ... </XML>" [10] but that requires me to read the entire file into
> memory
> and I get OutOfMemory errors when working on the complete corpus. So in
> short
> my questions are:
>
> * Do you know a parser that I can use to parse this data?
> * Lacking that: How can I wrap the GZIPInputStream in opening and closing
> tags?
> * Do you think that I should just write a parser myself? (seems a lot of
> work
> just because the enclosing tags are missing)
> * Are there other feasible approaches?
>
> Any input would be most appreciated!
>
> References
> ----------
>
> [0] The input data is the English gigaword corpus from
> http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07
>
> [1] Example data:
>
> <DOC id="AFP_ENG_20101220.0219" type="story" >
> <HEADLINE>
> Headline 1
> </HEADLINE>
> <DATELINE>
> Location, Dec 20, 2010 (AFP)
> </DATELINE>
> <TEXT>
> <P>
> Paragraph 1
> </P>
> <P>
> Paragraph 2
> </P>
> </TEXT>
> </DOC>
> <DOC id="AFP_ENG_20101206.0235" type="story" >
> <HEADLINE>
> Headline 2
> </HEADLINE>
> <DATELINE>
> Location, Dec 6, 2010 (AFP)
> </DATELINE>
> <TEXT>
> <P>
> Paragraph 3
> </P>
> </TEXT>
> </DOC>
>
> [3]
> https://github.com/nathell/clj-tagsoup/blob/master/src/pl/danieljanus/tagsoup.clj
>
> [4] https://github.com/clojure/data.xml/
> [5] Extraction using a zipper:
> (defn gw-file->documents
> [in-file]
> (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))]
> (map (fn [doc]
> {:id (dzx/xml1-> doc (dzx/attr :id))
> :type (dzx/xml1-> doc (dzx/attr :type))
> :headline (dzx/xml1-> doc :HEADLINE dzx/text)
> :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)})
> (dzx/xml-> xml-zipper :DOC))))
> [6] Example extraction of data on the output of parse(-xml) directly:
> I use filter-tag to search for all :DOC's and call process-document
> for
> each.
>
> (defn- filter-tag
> [tag xmls]
> (filter identity
> (for [x xmls
> :when (= tag (:tag x))]
> x)))
>
> (defn process-document
> [doc]
> {:id (:id (:attrs doc))
> :type (:type (:attrs doc))
> :headline (filter-tag :HEADLINE (xml-seq doc))})
> [7] Parsing with tagsoup
>
> user=> (clojure.pprint/pprint
> (tagsoup/parse-xml
> (-> "/home/babilen/foo.gz" (io/file) (io/input-stream)
> (GZIPInputStream.))))
> {:tag :DOC,
> :attrs {:id "AFP_ENG_20101220.0219", :type "story"},
> :content
> [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 1\n"]}
> {:tag :DATELINE,
> :attrs nil,
> :content ["\nLocation, Dec 20, 2010 (AFP)\n"]}
> {:tag :TEXT,
> :attrs nil,
> :content
> [{:tag :p, :attrs nil, :content ["\nParagraph 1\n"]}
> {:tag :p, :attrs nil, :content ["\nParagraph 2\n"]}]}
> {:tag :DOC,
> :attrs {:id "AFP_ENG_20101206.0235", :type "story"},
> :content
> [{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 2\n"]}
> {:tag :DATELINE,
> :attrs nil,
> :content ["\nLocation, Dec 6, 2010 (AFP)\n"]}
> {:tag :TEXT,
> :attrs nil,
> :content [{:tag :p, :attrs nil, :content ["\nParagraph 3\n"]}]}]}]}
>
> [8] Parsing with clojure.data.xml/parse
> user=> (clojure.pprint/pprint
> (clojure.data.xml/parse
> (-> "/home/babilen/foo.gz" (io/file) (io/input-stream)
> (GZIPInputStream.))))
> {:tag :DOC,
> :attrs {:id "AFP_ENG_20101220.0219", :type "story"},
> :content
> ({:tag :HEADLINE, :attrs {}, :content ("\nHeadline 1\n")}
> {:tag :DATELINE,
> :attrs {},
> :content ("\nLocation, Dec 20, 2010 (AFP)\n")}
> {:tag :TEXT,
> :attrs {},
> :content
> ({:tag :P, :attrs {}, :content ("\nParagraph 1\n")}
> {:tag :P, :attrs {}, :content ("\nParagraph 2\n")})})}
>
> [9] My actual code:
>
> (defn- parse-gw-file
> [in-file]
> (->> in-file
> (io/input-stream)
> (GZIPInputStream.)
> (ts/parse-xml)))
>
> (defn gw-file->documents
> [in-file]
> (let [xml-zipper (zip/xml-zip (parse-gw-file in-file))]
> (map (fn [doc]
> {:id (dzx/xml1-> doc (dzx/attr :id))
> :type (dzx/xml1-> doc (dzx/attr :type))
> :headline (dzx/xml1-> doc :HEADLINE dzx/text)
> :paragraphs (dzx/xml-> doc :TEXT :p dzx/text)})
> (dzx/xml-> xml-zipper :DOC))))
>
> [10] Wrapping the stream:
> (defn- parse-gw-file
> [in-file]
> (let [unzipped-file (->> in-file
> (io/input-stream)
> (GZIPInputStream.))
> wrapped-file (str "<XML>" (slurp unzipped-file) "</XML>")]
> (->> wrapped-file
> (ByteArrayInputStream.)
> (ts/parse-xml (ByteArrayInputStream.
> (.getBytes (str "<XML>" (slurp unzipped-file)
> "</XML>")
> "UTF-8")))))
> --
> Wolodja <[email protected]>
>
> 4096R/CAF14EFC
> 081C B7CD FF04 2BA9 94EA 36B2 8B7F 7D30 CAF1 4EFC
>
--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to [email protected]
Note that posts from new members are moderated - please be patient with your
first post.
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en