I recently started working with solr and find it easy to setup and tinker with.
I now want to scale up my setup and was wondering if there is an application/component that can do the following (I was not able to find documentation on this on the solr site): -Can I send solr an xml document with a url (html, pdf, word, ppt, etc..) and solr indexes it after analyzing (can it analyze pdf and other documents?). Solr would use some generic basic fields like header and content when analyzing the files. -Can I send solr a site url and it indexes the whole site? If the answer to the above is yes; are there some examples? If the answer is no; Is there a simple (basic) extractor for html, pdf, word, etc.. files that would translates this in a basic xml document (e.g. with field names, url, header and content) that solr can ingest, or preferably an application that does this for a whole site? The idea is to configure solr for generic indexing and search of a website. Frank.