I want to use Solr to index two types of documents: - local documents in Drupal (ca. 10M) - a large number of web sites to be crawled thru Nutch (ca 100M)
Our data center does not have the necessary bandwith to crawl all the external sites and we want to use a hosting provider to do the crawling for us, but we want the actual serving of results to happen locally. It seems as if it would be probably be easiest to delegate all the indexing to a remote server and replicated those indexes to a slave in our data center using built-in Solr replication, but then the indexing of our internal sites would have to happen remotely, too, which I would like to avoid. I think Hadoop/MapReduce would be overkill for this scenario, so what other options are there? I was considering - using Solr merge to merge the Drupal & Nutch indexes - have Nutch post the crawled results to the local Solr index Any suggestions would be highly appreciated. Dietrich Schmidt http://www.linkedin.com/in/dietrichschmidt