Re: solr application for website crawling and indexing html, pdf, word, ... files

mike anderson Mon, 25 Jan 2010 13:16:38 -0800

I think you might be looking for Apache Tika.


On Mon, Jan 25, 2010 at 3:55 PM, Frank van Lingen <fr...@vanlingen.name>wrote:

> I recently started working with solr and find it easy to setup and tinker
> with.
>
> I now want to scale up my setup and was wondering if there is an
> application/component that can do the following (I was not able to
> find documentation on this on the solr site):
>
> -Can I send solr an xml document with a url (html, pdf, word, ppt,
> etc..) and solr indexes it after analyzing (can it analyze pdf and
> other documents?). Solr would use some generic basic fields like
> header and content when analyzing the files.
>
> -Can I send solr a site url and it indexes the whole site?
>
> If the answer to the above is yes; are there some examples? If the
> answer is no; Is there a simple (basic) extractor for html, pdf, word,
> etc.. files that would translates this in a basic xml document (e.g.
> with field names, url, header and content) that solr can ingest, or
> preferably an application that does this for a whole site?
>
> The idea is to configure solr for generic indexing and search of a website.
>
> Frank.
>

Re: solr application for website crawling and indexing html, pdf, word, ... files

Reply via email to