: the text out of these types of documents.  You could borrow the
: document parsing pieces from Lucene's contrib and Nutch and glue them
: together into your client that speaks to Solr, or perhaps Solr isn't
: the right approach for your needs?   It certainly is possible to add
: these capabilities into Solr, but it would be awkward to have to
: stream binary data into XML documents such that Solr could parse them
: on the server side.

Agreed.  Solr's focus is in indexing "Structured Data".  The support for
dynamic fields certainly allows you do deal with complex structured data,
and somewhat heterogeneous structured data -- but it's still structured
data.  If your goal is to do a lot of crawling of disparat physical
documents, extract the text, and build a "path,title,content" index
then Nutch is probably your best bet.


-Hoss

Reply via email to