Hi Jed, Thanks for sharing your thoughts and the link.
Venkatesh On 3/11/07, Jed Reynolds <[EMAIL PROTECTED]> wrote:
Venkatesh Seetharam wrote: > >> The hash idea sounds really interesting and if I had a fixed number of > indexes it would be perfect. > I'm infact looking around for a reverse-hash algorithm where in given a > docId, I should be able to find which partition contains the document > so I > can save cycles on broadcasting slaves. Many large databases partition their data either by load or by another logical manner, like by alphabet. I hear that Hotmail, for instance, partitions its users alphabetically. Having a broker will certainly abstract this mechninism, and of course your application(s) want to be able to bypass a broker when necessary. > I mean, even if you use a DB, how have you solved the problem of > distribution when a new server is added into the mix. http://www8.org/w8-papers/2a-webserver/caching/paper2.html I saw this link on the memcached list and the thread surrounding it certainly covered some similar ground. Some ideas have been discussed like: - high availability of memcached, redundant entries - scaling out clusters and facing the need to rebuild the entire cache on all nodes depending on your bucketing. I see some similarties with maintaining multiple indicies/lucene partitions and having a memcache deployment: mostly if you are hashing your keys to partitions (or buckets or machines) then you might be faced with a) availability issues if there's a machine/partition outtage b) rebuilding partitions if adding a partition/bucket changes the hash mapping. The ways I can think of to scale-out new indexes would be to have your application maintain two sets of bucket mappings for ids to indexes, and the second would be to key your documents and partition them by date. The former method would allow you to rebuild a second set of repartitioned indexes and buckets and allow you to update your application to use the new bucket mapping (when all the indexes has been rebuilt). The latter method would only apply if you could organize your document ids by date and only added new documents to the 'now' end or evenly across most dates. You'd have to add a new partition onto the end as time progressed, and rarely rebuild old indexes unless your documents grow unevenly. Interesting topic! I don't yet need to run multiple Lucene partitions, but I have a few memcached servers and increasing the number of them I expect will force my site to take a performance accordingly as I am forced to rebuild the caches. I can see similarly if I had multiple lucene partitions, that if I had to fission some of them, rebuilding the resulting partitions would be time intensive and I'd want to have procedures in place for availibility, scaling out and changing application code as necessary. Just having one fail-over Solr index is just so easy in comparison. Jed