Re: Federated Search

Tim Patton Mon, 05 Mar 2007 07:46:32 -0800


Venkatesh Seetharam wrote:

Hi Tim,
Howdy. I saw your post on Solr newsgroup and caught my attention. I'mworking on a similar problem for searching a vault of over 100 millionXML documents. I already have the encoding part done using Hadoop andLucene. It works like a charm. I create N index partitions and havebeen trying to wrap Solr to search each partition, have a Search brokerthat merges the results and returns.
I'm curious about how have you solved the distribution of additions,deletions and updates to each of the indexing servers.I use apartitioner based on a hash of the document id. Do you broadcast to theslaves as to who owns a document?
Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com<http://www.zeroc.com>) for distributing the search across these Solrservers. I'm not using HTTP.
Any ideas are greatly appreciated.
PS: I did subscribe to solr newsgroup now but did not receive aconfirmation and hence sending it to you directly.
--
Thanks,
Venkatesh
"Perfection (in design) is achieved not when there is nothing more toadd, but rather when there is nothing more to take away."
- Antoine de Saint-Exupéry

I used a SQL database to keep track of which server had which document.Then I originally used JMS and would use a selector for which servernumber the document should go to. I switched over to a home grown,lightweight message server since JMS behaves really badly when it backsup and I couldn't find a server that would simply pause the producers ifthere was a problem with the consumers. Additions are pretty muchassigned randomly to whichever server gets them first. At this point Iam up to around 20 million documents.

The hash idea sounds really interesting and if I had a fixed number ofindexes it would be perfect. But I don't know how big the index willgrow and I wanted to be able to add servers at any point. I would liketo eliminate any outside dependencies (SQL, JMS), which is why adistributed Solr would let me focus on other areas.

How did you work around not being able to update a lucene index that isstored in Hadoop? I know there were changes in Lucene 2.1 to supportthis but I haven't looked that far into it yet, I've just been testingthe new IndexWriter. As an aside, I hope those features can be used bySolr soon (if they aren't already in the nightlys).

Tim

Re: Federated Search

Reply via email to