I think you might be looking for Apache Tika.
On Mon, Jan 25, 2010 at 3:55 PM, Frank van Lingen <fr...@vanlingen.name>wrote: > I recently started working with solr and find it easy to setup and tinker > with. > > I now want to scale up my setup and was wondering if there is an > application/component that can do the following (I was not able to > find documentation on this on the solr site): > > -Can I send solr an xml document with a url (html, pdf, word, ppt, > etc..) and solr indexes it after analyzing (can it analyze pdf and > other documents?). Solr would use some generic basic fields like > header and content when analyzing the files. > > -Can I send solr a site url and it indexes the whole site? > > If the answer to the above is yes; are there some examples? If the > answer is no; Is there a simple (basic) extractor for html, pdf, word, > etc.. files that would translates this in a basic xml document (e.g. > with field names, url, header and content) that solr can ingest, or > preferably an application that does this for a whole site? > > The idea is to configure solr for generic indexing and search of a website. > > Frank. >