On 2010-09-04 19:53, MitchK wrote:

Hi,

this topic started a few months ago, however there are some questions from
my side, that I couldn't answer by looking at the SOLR-1301-issue nor the
wiki-pages.

Let me try to explain my thoughts:
Given: a Hadoop-cluster, a solr-search-cluster and nutch as a
crawling-engine which also performs LinkRank and webgraph-related tasks.

Once a list of documents is created by nutch, you put the list + the
LinkRank-values etc. into a Solr+Hadoop-job like it is described in
Solr-1301 to index or reindex the given documents.

There is no out of the box integration between Nutch and SOLR-1301, so there is some step that you omitted from this chain... e.g. "export from Nutch segments to CSV".


When the shards are built, they will be sent over the network to the
solr-search-cluster.
Is this description correct?

Not really. SOLR-1301 doesn't deal with how you deploy the results of indexing. It simply creates the shards on HDFS. SOLR-1301 just creates the index data - it doesn't deal with serving the data...


What makes me thinking is:
Assumed I got a Document X on machine Y in shard Y...
When I reindex that document X together with lots of other documents that
are present or not present in Shard Y... and I put the resulting shard on a
machine Z, how does machine Y notice that it has got an older version of
document X than machine Z?

Furthermore: Go on and assume that the shard Y was replicated to three other
machines, how do they all notice, that their version of document X is not
the newest available one?
In such an environment, we do not have a master (right?), so far: How to
keep the index as consistent as possible?

It's not possible to do it like this, at least for now...

Looking into the future: eventually, when SolrCloud arrives we will be able to index straight to a SolrCloud cluster, assigning documents to shards through a hashing schema (e.g. 'md5(docId) % numShards'). Since shards would be created in a consistent way, then newer versions of documents would end up in the same shards and they would replace the older versions of the same documents - thus the problem would be solved. Additional benefit from this model is that it's not a disruptive and copy-intensive operation like SOLR-1301 (where you have to do "create new indexes, deploy them and switch") but rather a regular online update that is already supported in Solr.

Once this is in place, we can modify Nutch to send documents directly to a SolrCloud cluster. Until then, you need to build and deploy indexes more or less manually (or using Katta, but again Katta is not integrated with Nutch).

SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), so medium-term I think this is your best bet.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to