Re: Using remote Nutch Server to crawl, then merging results into local index

Dominique Bejean Thu, 23 Dec 2010 09:14:12 -0800

Hi,

In order to crawl and index your web sites, may you can have a look atwww.crawl-anywhere.com. It includes a web crawler, a document processingpipeline and a solr indexer.


Dominique

Le 23/12/10 16:27, Dietrich a écrit :

I want to use Solr to index two types of documents:
- local documents in Drupal (ca. 10M)
- a large number of web sites to be crawled thru Nutch (ca 100M)

Our data center does not have the necessary bandwith to crawl all the
external sites and we want to use a hosting provider to do the
crawling for us, but we want the actual serving of results to happen
locally.
It seems as if it  would be probably be easiest to delegate all the
indexing to a remote server and replicated those indexes to a slave in
our data center using built-in Solr replication, but then the indexing
of our internal sites would have to happen remotely, too, which I
would like to avoid.

I think Hadoop/MapReduce would be overkill for this scenario, so what
other options are there?
I was considering
- using Solr merge to merge the Drupal&  Nutch indexes
- have Nutch post the crawled results to the local Solr index

Any suggestions would be highly appreciated.

Dietrich Schmidt
http://www.linkedin.com/in/dietrichschmidt

Re: Using remote Nutch Server to crawl, then merging results into local index

Reply via email to