Re: Efficiently mining or parsing data out of XML source files

2012-06-06 Thread Jack Krupansky
t the boundaries between "documents" and break them into smaller XML files. -- Jack Krupansky -Original Message- From: Mike Sokolov Sent: Wednesday, June 06, 2012 8:02 AM To: solr-user@lucene.apache.org Cc: Erick Erickson Subject: Re: Efficiently mining or parsing data out of X

Re: Efficiently mining or parsing data out of XML source files

2012-06-06 Thread Mike Sokolov
I agree, that seems odd. We routinely index XML using either HTMLStripCharFilter, or XmlCharFilter (see patch: https://issues.apache.org/jira/browse/SOLR-2597), both of which parse the XML, and we don't see such a huge speed difference from indexing other field types. XmlCharFilter also allo

Re: Efficiently mining or parsing data out of XML source files

2012-06-03 Thread Erick Erickson
This seems really odd. How big are these XML files? Where are you parsing them? You could consider using a SolrJ program with a SAX-style parser. But the first question I'd answer is "what is slow?". The implications of your post is that parsing the XML is the slow part, it really shouldn't be tak

Efficiently mining or parsing data out of XML source files

2012-05-31 Thread Van Tassell, Kristian
I'm just wondering what the general consensus is on indexing XML data to Solr in terms of parsing and mining the relevant data out of the file and putting them into Solr fields. Assume that this is the XML file and resulting Solr fields: XML data: foo garbage data Solr Fields: Id=1234 Title