Re: Using remote Nutch Server to crawl, then merging results into local index

Erick Erickson Thu, 23 Dec 2010 10:12:46 -0800

Merging the indexes seems problematical. It's easy enough to
#code#, but I'm not sure it would produce results you want. And it
supposes that your schemas are identical (or at least compatible)
between the crawled data and your local data, which I wonder about...


Instead, I'd think about cores. Cores can be thought of as a
virtual Solr index accessible by a single Solr instance. I'd guess that
your requirements for handling the crawled data are different enough
from the local documents that this might be what you want to do anyway.

Federating these would probably involve two queries and some kind of
manual integration of them though.

 Best
Erick

On Thu, Dec 23, 2010 at 10:27 AM, Dietrich <diet...@gmail.com> wrote:

> I want to use Solr to index two types of documents:
> - local documents in Drupal (ca. 10M)
> - a large number of web sites to be crawled thru Nutch (ca 100M)
>
> Our data center does not have the necessary bandwith to crawl all the
> external sites and we want to use a hosting provider to do the
> crawling for us, but we want the actual serving of results to happen
> locally.
> It seems as if it  would be probably be easiest to delegate all the
> indexing to a remote server and replicated those indexes to a slave in
> our data center using built-in Solr replication, but then the indexing
> of our internal sites would have to happen remotely, too, which I
> would like to avoid.
>
> I think Hadoop/MapReduce would be overkill for this scenario, so what
> other options are there?
> I was considering
> - using Solr merge to merge the Drupal & Nutch indexes
> - have Nutch post the crawled results to the local Solr index
>
> Any suggestions would be highly appreciated.
>
> Dietrich Schmidt
> http://www.linkedin.com/in/dietrichschmidt
>

Re: Using remote Nutch Server to crawl, then merging results into local index

Reply via email to