: the text out of these types of documents. You could borrow the : document parsing pieces from Lucene's contrib and Nutch and glue them : together into your client that speaks to Solr, or perhaps Solr isn't : the right approach for your needs? It certainly is possible to add : these capabilities into Solr, but it would be awkward to have to : stream binary data into XML documents such that Solr could parse them : on the server side.
Agreed. Solr's focus is in indexing "Structured Data". The support for dynamic fields certainly allows you do deal with complex structured data, and somewhat heterogeneous structured data -- but it's still structured data. If your goal is to do a lot of crawling of disparat physical documents, extract the text, and build a "path,title,content" index then Nutch is probably your best bet. -Hoss