Re: anyone use hadoop+solr?

Andrzej Bialecki Mon, 06 Sep 2010 03:12:01 -0700

On 2010-09-04 19:53, MitchK wrote:


Hi,

this topic started a few months ago, however there are some questions from
my side, that I couldn't answer by looking at the SOLR-1301-issue nor the
wiki-pages.

Let me try to explain my thoughts:
Given: a Hadoop-cluster, a solr-search-cluster and nutch as a
crawling-engine which also performs LinkRank and webgraph-related tasks.

Once a list of documents is created by nutch, you put the list + the
LinkRank-values etc. into a Solr+Hadoop-job like it is described in
Solr-1301 to index or reindex the given documents.

There is no out of the box integration between Nutch and SOLR-1301, sothere is some step that you omitted from this chain... e.g. "export fromNutch segments to CSV".

When the shards are built, they will be sent over the network to the
solr-search-cluster.
Is this description correct?

Not really. SOLR-1301 doesn't deal with how you deploy the results ofindexing. It simply creates the shards on HDFS. SOLR-1301 just createsthe index data - it doesn't deal with serving the data...


What makes me thinking is:
Assumed I got a Document X on machine Y in shard Y...
When I reindex that document X together with lots of other documents that
are present or not present in Shard Y... and I put the resulting shard on a
machine Z, how does machine Y notice that it has got an older version of
document X than machine Z?

Furthermore: Go on and assume that the shard Y was replicated to three other
machines, how do they all notice, that their version of document X is not
the newest available one?
In such an environment, we do not have a master (right?), so far: How to
keep the index as consistent as possible?


It's not possible to do it like this, at least for now...

Looking into the future: eventually, when SolrCloud arrives we will beable to index straight to a SolrCloud cluster, assigning documents toshards through a hashing schema (e.g. 'md5(docId) % numShards'). Sinceshards would be created in a consistent way, then newer versions ofdocuments would end up in the same shards and they would replace theolder versions of the same documents - thus the problem would be solved.Additional benefit from this model is that it's not a disruptive andcopy-intensive operation like SOLR-1301 (where you have to do "createnew indexes, deploy them and switch") but rather a regular online updatethat is already supported in Solr.

Once this is in place, we can modify Nutch to send documents directly toa SolrCloud cluster. Until then, you need to build and deploy indexesmore or less manually (or using Katta, but again Katta is not integratedwith Nutch).

SolrCloud is not far away from hitting the trunk (right, Mark? ;) ), somedium-term I think this is your best bet.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: anyone use hadoop+solr?

Reply via email to