Re: Solr for Whole Web Search

Jon Baer Wed, 22 Oct 2008 17:17:35 -0700

If that is the case you should look @ the DataImportHandler examplesas they can already index RSS, im doing it now for ~ a dozen feeds onan hourly basis. (This is also for any XML-based feed for XHTML, XML,etc). I find Nutch more useful for plain vanilla HTML (something thatwas built non-dynamic), since otherwise you can bring your DB contentin that you would have to the page to begin with. As well as Nutchfor other types of documents I think (PDF) and anything that Tika (http://incubator.apache.org/tika/) can extract.


- Jon


On Oct 22, 2008, at 11:08 AM, John Martyniak wrote:

Grant thanks for the response.
A couple of other people have recommended trying the Nutch + Solrapproach, but I am not sure what the real benefit of doing that is.Since Nutch provides most of the same features as Solr and Solr hassome nice additional features (like spell checking, incrementalindex).
So I currently have a Nutch Index of around 500,000+ Urls, butexpect it to get much bigger. And am generally pretty happy withit, but I just want to make sure that I am going down the correctpath, for the best feature set. As far as implementation to thefront end is concerned, I have been using the Nutch search app asbasically a webservice to feed the main app (So using RSS). Themain app takes that and manipulates the results for display.
As far as the Hadoop + Lucene integration, I haven't used thatdirectly just the Hadoop integration with Nutch. And of courseHadoop independently.
-John


On Oct 22, 2008, at 10:08 AM, Grant Ingersoll wrote:
On Oct 22, 2008, at 7:57 AM, John Martyniak wrote:
I am very new to Solr, but I have played with Nutch and Lucene.

Has anybody used Solr for a whole web indexing application?

Which Spider did you use?

How does it compare to Nutch?
There is a patch that combines Nutch + Solr. Nutch is used forcrawling, Solr for searching. Can't say I've used it for whole websearching, but I believe some are trying it.
At the end of the day, I'm sure Solr could do it, but it will takesome work to setup the architecture (distributed, replicated) anddeal properly with fault tolerance and fail over. There are alsosome examples on Hadoop about Hadoop + Lucene integration.
How big are you talking?
Thanks in advance for all of the info.

-John
--------------------------
Grant Ingersoll
Lucene Boot Camp Training Nov. 3-4, 2008, ApacheCon US New Orleans.
http://www.lucenebootcamp.com


Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Solr for Whole Web Search

Reply via email to