Dear all, I am running into problems when I try to parse SGML documents [0] that are valid XML apart from the fact that they lack a root tag. The whole issue is complicated by the fact that the input files are pretty large (i.e. 9.1 GB gzipped files) and that I therefore cannot read them completely into memory. My goal is to extract each "document" and index it using Lucene, so I need access to the data at one point, but can throw it away immediately after processing.
The input data looks something like [1] and my main problem is that none of
the parsers I tried cope with the missing root tag. The main problem is
that if the SGML is parsed with either clj-tagsoup's [3] parse-xml or
data.xml's [4] parse function I get a broken representation and can't extract
all
data using either zippers (e.g. like in [5]) or by working on the parsed data
directly (as in [6]).
The main problem is that the *first* <DOC> is (wrongly) assumed to be the root
tag for the entire document and that the result of the parse looks something
like [7] (for tagsoup) or [8] (for data.xml). As you can imagine I want output
such as from my (hypothetical) processing function:
(documents-from-gigaword-file (-> in-file
(io/input-stream)
(GZIPInputStream.))))
({:id "AFP_ENG_20101220.0219"
:type "story"
:headline "Headline 1"
:paragraphs ("Paragraph 1" "Paragraph 2")}
{:id "AFP_ENG_20101220.0235"
:type "story"
:headline "Headline 2"
:text ("Paragraph 3")})
But I get the follwing right now: (no wonder!)
user=> (clojure.pprint/pprint (gw-file->documents (io/file
"/home/babilen/foo.gz")))
({:id "AFP_ENG_20101206.0235",
:type "story",
:headline " Headline 2 ",
:paragraphs (" Paragraph 3 ")})
I am, however, unsure how to proceed. I tried wrapping the input stream in
"<XML> ... </XML>" [10] but that requires me to read the entire file into memory
and I get OutOfMemory errors when working on the complete corpus. So in short
my questions are:
* Do you know a parser that I can use to parse this data?
* Lacking that: How can I wrap the GZIPInputStream in opening and closing
tags?
* Do you think that I should just write a parser myself? (seems a lot of work
just because the enclosing tags are missing)
* Are there other feasible approaches?
Any input would be most appreciated!
References
----------
[0] The input data is the English gigaword corpus from
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T07
[1] Example data:
<DOC id="AFP_ENG_20101220.0219" type="story" >
<HEADLINE>
Headline 1
</HEADLINE>
<DATELINE>
Location, Dec 20, 2010 (AFP)
</DATELINE>
<TEXT>
<P>
Paragraph 1
</P>
<P>
Paragraph 2
</P>
</TEXT>
</DOC>
<DOC id="AFP_ENG_20101206.0235" type="story" >
<HEADLINE>
Headline 2
</HEADLINE>
<DATELINE>
Location, Dec 6, 2010 (AFP)
</DATELINE>
<TEXT>
<P>
Paragraph 3
</P>
</TEXT>
</DOC>
[3]
https://github.com/nathell/clj-tagsoup/blob/master/src/pl/danieljanus/tagsoup.clj
[4] https://github.com/clojure/data.xml/
[5] Extraction using a zipper:
(defn gw-file->documents
[in-file]
(let [xml-zipper (zip/xml-zip (parse-gw-file in-file))]
(map (fn [doc]
{:id (dzx/xml1-> doc (dzx/attr :id))
:type (dzx/xml1-> doc (dzx/attr :type))
:headline (dzx/xml1-> doc :HEADLINE dzx/text)
:paragraphs (dzx/xml-> doc :TEXT :p dzx/text)})
(dzx/xml-> xml-zipper :DOC))))
[6] Example extraction of data on the output of parse(-xml) directly:
I use filter-tag to search for all :DOC's and call process-document for
each.
(defn- filter-tag
[tag xmls]
(filter identity
(for [x xmls
:when (= tag (:tag x))]
x)))
(defn process-document
[doc]
{:id (:id (:attrs doc))
:type (:type (:attrs doc))
:headline (filter-tag :HEADLINE (xml-seq doc))})
[7] Parsing with tagsoup
user=> (clojure.pprint/pprint
(tagsoup/parse-xml
(-> "/home/babilen/foo.gz" (io/file) (io/input-stream) (GZIPInputStream.))))
{:tag :DOC,
:attrs {:id "AFP_ENG_20101220.0219", :type "story"},
:content
[{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 1\n"]}
{:tag :DATELINE,
:attrs nil,
:content ["\nLocation, Dec 20, 2010 (AFP)\n"]}
{:tag :TEXT,
:attrs nil,
:content
[{:tag :p, :attrs nil, :content ["\nParagraph 1\n"]}
{:tag :p, :attrs nil, :content ["\nParagraph 2\n"]}]}
{:tag :DOC,
:attrs {:id "AFP_ENG_20101206.0235", :type "story"},
:content
[{:tag :HEADLINE, :attrs nil, :content ["\nHeadline 2\n"]}
{:tag :DATELINE,
:attrs nil,
:content ["\nLocation, Dec 6, 2010 (AFP)\n"]}
{:tag :TEXT,
:attrs nil,
:content [{:tag :p, :attrs nil, :content ["\nParagraph 3\n"]}]}]}]}
[8] Parsing with clojure.data.xml/parse
user=> (clojure.pprint/pprint
(clojure.data.xml/parse
(-> "/home/babilen/foo.gz" (io/file) (io/input-stream) (GZIPInputStream.))))
{:tag :DOC,
:attrs {:id "AFP_ENG_20101220.0219", :type "story"},
:content
({:tag :HEADLINE, :attrs {}, :content ("\nHeadline 1\n")}
{:tag :DATELINE,
:attrs {},
:content ("\nLocation, Dec 20, 2010 (AFP)\n")}
{:tag :TEXT,
:attrs {},
:content
({:tag :P, :attrs {}, :content ("\nParagraph 1\n")}
{:tag :P, :attrs {}, :content ("\nParagraph 2\n")})})}
[9] My actual code:
(defn- parse-gw-file
[in-file]
(->> in-file
(io/input-stream)
(GZIPInputStream.)
(ts/parse-xml)))
(defn gw-file->documents
[in-file]
(let [xml-zipper (zip/xml-zip (parse-gw-file in-file))]
(map (fn [doc]
{:id (dzx/xml1-> doc (dzx/attr :id))
:type (dzx/xml1-> doc (dzx/attr :type))
:headline (dzx/xml1-> doc :HEADLINE dzx/text)
:paragraphs (dzx/xml-> doc :TEXT :p dzx/text)})
(dzx/xml-> xml-zipper :DOC))))
[10] Wrapping the stream:
(defn- parse-gw-file
[in-file]
(let [unzipped-file (->> in-file
(io/input-stream)
(GZIPInputStream.))
wrapped-file (str "<XML>" (slurp unzipped-file) "</XML>")]
(->> wrapped-file
(ByteArrayInputStream.)
(ts/parse-xml (ByteArrayInputStream.
(.getBytes (str "<XML>" (slurp unzipped-file) "</XML>")
"UTF-8")))))
--
Wolodja <[email protected]>
4096R/CAF14EFC
081C B7CD FF04 2BA9 94EA 36B2 8B7F 7D30 CAF1 4EFC
signature.asc
Description: Digital signature
