I have several indexes now (4 at the moment, 20gb each, and I want to be
able to drop in a new machine easily). I'm using SQL server as a DB and
it scales well. The DB doesn't get hit too hard, mostly doing location
lookups, and the app does some checking to make sure a document has
really changed before updating that back in the DB or the index. When a
new server is added it randomly picks up additions from the message
server (it's approximately round-robin) and the rest of the system
really doesn't even need to know about it.
I've realized partitioned indexing is a difficult, but solvable problem.
It could be a big project though. I mean we have all solved it in our
own way but no one has a general solution. Distributed searching might
be a better area to add to Solr since that should basically be the same
for everyone. I'm going to mess around with Jini on my own indexes,
there's finally a new book out to go with the newer versions.
How were you planning on using Solr with Hadoop? Maybe I don't fully
understand how hadoop works.
Tim
Venkatesh Seetharam wrote:
Hi Tim,
Thanks for your response. Interesting idea. Does the DB scale? Do you have
one single index which you plan to use Solr for or you have multiple
indexes?
But I don't know how big the index will grow and I wanted to be able to
add servers at any point.
I'm thinking of having N partitions with a max of 10 million documents per
partition. Adding a server should not be a problem but the newly added
server would take time to grow so that distribution of documents are equal
in the cluster. I've tested with 50 million documents of 10 size each and
looks very promising.
The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect.
I'm infact looking around for a reverse-hash algorithm where in given a
docId, I should be able to find which partition contains the document so I
can save cycles on broadcasting slaves.
I mean, even if you use a DB, how have you solved the problem of
distribution when a new server is added into the mix.
We have the same problem since we get daily updates to documents and
document metadata.
How did you work around not being able to update a lucene index that is
stored in Hadoop?
I do not use HDFS. I use a NetApp mounted on all the nodes in the cluster
and hence did not need any change to Lucene.
I plan to index using Lucene/Hadoop and use Solr as the partition searcher
and a broker which would merge the results and return 'em.
Thanks,
Venkatesh
On 3/5/07, Tim Patton <[EMAIL PROTECTED]> wrote:
Venkatesh Seetharam wrote:
> Hi Tim,
>
> Howdy. I saw your post on Solr newsgroup and caught my attention. I'm
> working on a similar problem for searching a vault of over 100 million
> XML documents. I already have the encoding part done using Hadoop and
> Lucene. It works like a charm. I create N index partitions and have
> been trying to wrap Solr to search each partition, have a Search broker
> that merges the results and returns.
>
> I'm curious about how have you solved the distribution of additions,
> deletions and updates to each of the indexing servers.I use a
> partitioner based on a hash of the document id. Do you broadcast to the
> slaves as to who owns a document?
>
> Also, I'm looking at Hadoop RPC and ICE ( www.zeroc.com
> <http://www.zeroc.com>) for distributing the search across these Solr
> servers. I'm not using HTTP.
>
> Any ideas are greatly appreciated.
>
> PS: I did subscribe to solr newsgroup now but did not receive a
> confirmation and hence sending it to you directly.
>
> --
> Thanks,
> Venkatesh
>
> "Perfection (in design) is achieved not when there is nothing more to
> add, but rather when there is nothing more to take away."
> - Antoine de Saint-Exupéry
I used a SQL database to keep track of which server had which document.
Then I originally used JMS and would use a selector for which server
number the document should go to. I switched over to a home grown,
lightweight message server since JMS behaves really badly when it backs
up and I couldn't find a server that would simply pause the producers if
there was a problem with the consumers. Additions are pretty much
assigned randomly to whichever server gets them first. At this point I
am up to around 20 million documents.
The hash idea sounds really interesting and if I had a fixed number of
indexes it would be perfect. But I don't know how big the index will
grow and I wanted to be able to add servers at any point. I would like
to eliminate any outside dependencies (SQL, JMS), which is why a
distributed Solr would let me focus on other areas.
How did you work around not being able to update a lucene index that is
stored in Hadoop? I know there were changes in Lucene 2.1 to support
this but I haven't looked that far into it yet, I've just been testing
the new IndexWriter. As an aside, I hope those features can be used by
Solr soon (if they aren't already in the nightlys).
Tim