: Seems like Lucene/SOLR is geared to take record and non-xml-oriented : content and put it into XML format for ingest -- but really can't digest : XML content itself at all without significant setup and constraints. I : am surprised -- but I could really use it for my project big time.
Lucene is geared towards indexing records containing key=>value pairs. The values are then passed to "Analyzers" to break them up into individual terms. Solr is geared towards providing a non-Java interface to accept those Documents and hand them off to Lucene, and to providing a simple way to define Analyzers using configuration without compiling custom java code. A specific XML format is one way way to communicate with Solr what those "records" are, CSV is another, ... other generic formats can be added as plugins. (Mind You -- Lucene and Solr are "geared" for a lot of things in addition to those, but forthe purposes of this ocnveration, and the focus on indexing, those are the distinction). the aspect of your situation that neither Solr nor Lucene really focus on is extracting the key->val pairs from a larger stream of text (ie: XML in a user defined schema). this is where something like the XSLT appraoch i discribed could be helpful: you (as more of an expert on the XML Schema or your documents then solr) could write an XSLT for extracting the field=>value pairs foreach doc, to give to Solr. you could do the same thingclient side before sending the data to Solr -- the Jira issue i refered to (SOLR-285 BTW) would just allow this transform to happen server side) -Hoss