Hi,
In order to crawl and index your web sites, may you can have a look at
www.crawl-anywhere.com. It includes a web crawler, a document processing
pipeline and a solr indexer.
Dominique
Le 23/12/10 16:27, Dietrich a écrit :
I want to use Solr to index two types of documents:
- local documents in Drupal (ca. 10M)
- a large number of web sites to be crawled thru Nutch (ca 100M)
Our data center does not have the necessary bandwith to crawl all the
external sites and we want to use a hosting provider to do the
crawling for us, but we want the actual serving of results to happen
locally.
It seems as if it would be probably be easiest to delegate all the
indexing to a remote server and replicated those indexes to a slave in
our data center using built-in Solr replication, but then the indexing
of our internal sites would have to happen remotely, too, which I
would like to avoid.
I think Hadoop/MapReduce would be overkill for this scenario, so what
other options are there?
I was considering
- using Solr merge to merge the Drupal& Nutch indexes
- have Nutch post the crawled results to the local Solr index
Any suggestions would be highly appreciated.
Dietrich Schmidt
http://www.linkedin.com/in/dietrichschmidt