t the
boundaries between "documents" and break them into smaller XML files.
-- Jack Krupansky
-Original Message-
From: Mike Sokolov
Sent: Wednesday, June 06, 2012 8:02 AM
To: solr-user@lucene.apache.org
Cc: Erick Erickson
Subject: Re: Efficiently mining or parsing data out of X
I agree, that seems odd. We routinely index XML using either
HTMLStripCharFilter, or XmlCharFilter (see patch:
https://issues.apache.org/jira/browse/SOLR-2597), both of which parse
the XML, and we don't see such a huge speed difference from indexing
other field types. XmlCharFilter also allo
This seems really odd. How big are these XML files? Where are you parsing them?
You could consider using a SolrJ program with a SAX-style parser.
But the first question I'd answer is "what is slow?". The implications
of your post is that
parsing the XML is the slow part, it really shouldn't be tak
I'm just wondering what the general consensus is on indexing XML data to Solr
in terms of parsing and mining the relevant data out of the file and putting
them into Solr fields. Assume that this is the XML file and resulting Solr
fields:
XML data:
foo
garbage data
Solr Fields:
Id=1234
Title