On 2010-09-04 19:53, MitchK wrote:
Hi,
this topic started a few months ago, however there are some questions from
my side, that I couldn't answer by looking at the SOLR-1301-issue nor the
wiki-pages.
Let me try to explain my thoughts:
Given: a Hadoop-cluster, a solr-search-cluster and nutch as a
crawling-engine which also performs LinkRank and webgraph-related tasks.
Once a list of documents is created by nutch, you put the list + the
LinkRank-values etc. into a Solr+Hadoop-job like it is described in
Solr-1301 to index or reindex the given documents.
There is no out of the box integration between Nutch and SOLR-1301, so
there is some step that you omitted from this chain... e.g. "export from
Nutch segments to CSV".
When the shards are built, they will be sent over the network to the
solr-search-cluster.
Is this description correct?
Not really. SOLR-1301 doesn't deal with how you deploy the results of
indexing. It simply creates the shards on HDFS. SOLR-1301 just creates
the index data - it doesn't deal with serving the data...
What makes me thinking is:
Assumed I got a Document X on machine Y in shard Y...
When I reindex that document X together with lots of other documents that
are present or not present in Shard Y... and I put the resulting shard on a
machine Z, how does machine Y notice that it has got an older version of
document X than machine Z?
Furthermore: Go on and assume that the shard Y was replicated to three other
machines, how do they all notice, that their version of document X is not
the newest available one?
In such an environment, we do not have a master (right?), so far: How to
keep the index as consistent as possible?
It's not possible to do it like this, at least for now...
Looking into the future: eventually, when SolrCloud arrives we will be
able to index straight to a SolrCloud cluster, assigning documents to
shards through a hashing schema (e.g. 'md5(docId) % numShards'). Since
shards would be created in a consistent way, then newer versions of
documents would end up in the same shards and they would replace the
older versions of the same documents - thus the problem would be solved.
Additional benefit from this model is that it's not a disruptive and
copy-intensive operation like SOLR-1301 (where you have to do "create
new indexes, deploy them and switch") but rather a regular online update
that is already supported in Solr.
Once this is in place, we can modify Nutch to send documents directly to
a SolrCloud cluster. Until then, you need to build and deploy indexes
more or less manually (or using Katta, but again Katta is not integrated
with Nutch).
SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), so
medium-term I think this is your best bet.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com