On Thu, 15 May 2008 09:23:03 -0700 "William Pierce" <[EMAIL PROTECTED]> wrote:
[...] > > Our app in brief: We get merchant sku files (in either xml/csv) which we > process and index and make available to our site visitors to search. Our > current plan calls for us to support approx 10,000 merchants each with an > average of 50,000 sku's. This will make a total of approx 500 Million SKUs. > In addition, we assume that on a daily basis approx 5-10% of the SKUs need > to be updated (either added/deleted/modified). (Assume each sku will be > approx 4K) [...] > > b) Or, should we partition the 10,000 merchants into N buckets and have a > master index for each of the N buckets? We could partition the merchants > depending on their type or some other simple algorithm. Then, we could > have slaves setup for each of the N masters. The trick here will be to > partition the merchants carefully. Ideally we would like a search for any > product type to hit only one index but this may not be possible always. For > example, a search for "Harry Potter" may result in hits in "books", "dvds", > "memorabilia", etc etc. > > With N masters we will have to plan for having a distributed search across > the N indices (and then some mechanism for weighting the results across the > results that come back). Any recommendations for a distributed search > solution? SOLR 1.3 supports it > I saw some references to Katta. Is this viable? I was going to suggest that a Map reduce approach may be able to help -> Hadoop (or possibly even some other implementation of distributed computing). I didn't know of Katta , thanks for the reference. It seems that Katta is a full fledged integration between a lucene index an Hadoop - i am not sure where SOLR would sit in this solution. No idea how well developed Katta is. > In the extreme case, we could have one master for each of the merchants (if > there are 10000 merchants there will be 10,000 master indices). The > advantage here is that indices will have to be updated only for every > merchant who submits a new data file. The others remain unchanged. Not sure about this...gut feel tells me you'll be wasting lots of resources in containers rather than data... Let us know what design you come up with :) Cheers, B _________________________ {Beto|Norberto|Numard} Meijome "When the Paris Exhibition closes electric light will close with it and no more be heard of." Erasmus Wilson (1878) Professor at Oxford University I speak for myself, not my employer. Contents may be hot. Slippery when wet. Reading disclaimers makes you go blind. Writing them is worse. You have been Warned.