Re: Indexing HTML and other doc types

Otis Gospodnetic Fri, 06 Jul 2007 16:21:25 -0700

Uh, lots of confusion in this thread:

Peter: Nutch does have plugins for parsing PDFs, Word documents, and so on.

Kuro: Solr does not have the crawling component.  Use Nutch to crawl.  Nutch 
also has a built-in webapp for searching.  Of course, you can get Solr to 
search the content that Nutch fetched, too, but there is nothing out of the box 
(see 
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html )

Otis
--
Lucene Consulting -- http://lucene-consulting.com/

----- Original Message ----
From: Peter Manis <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Friday, July 6, 2007 2:19:21 AM
Subject: Re: Indexing HTML and other doc types

I guess I misread your original question.  I believe Nutch would be the
choice for crawling, however I do not know about its abilities for indexing
other document types.  If you needed to index multiple document types such
as PDF, DOC, etc and Nutch does not provide functionality to do so you would
probably need to write a script or program that can feed the crawl results
to solr.  I believe there is a python script somewhere that is simple and
will crawl sites, it would of course need modification but it would provide
a starting point.  I have not worked with Nutch so I may be speaking
incorrectly, but by having a separate script/application handle the crawl
you may have more control over what it sent to solr to be index.  Nutch may
already include a lot of functionality to process incoming content.
.

Pete

On 7/5/07, Teruhiko Kurosaka <[EMAIL PROTECTED]> wrote:
>
> Thank you, Otis and Peter, for your replies.
>
> > From: Otis Gospodnetic [mailto:[EMAIL PROTECTED]
>
> > doc of some type -> parse content into various fields -> post to Solr
>
> I understand this part, but the question is who should do this.
> I was under assumption that it's Solr client's job to crawl the net, read
> documents, parse them, put the contents into different fields
> (the "contents", title, author, date, URL, etc.), then post the
> result to Solr via HTTP in HTML or CSV.
>
> And I was asking if there are open-source projects to build
> such clients.
>
> Peter's approach is different; he adds the intelligence of parsing of
> document to Solr itself. (I guess the crawling has to be done
> by clients.)  I wonder this fits in the model that Solr has. Or is
> it just my illusion ?
>
> -kuro
>

Re: Indexing HTML and other doc types

Reply via email to