Bill, Quick feedback:
1) use 1.3-dev or 1.3 when it comes out, not 1.2 2) you did not mention Solr's distributed search functionality explicitly, so I get a feeling you are not aware of it. See DistributedSearch page on the Solr wiki 3) you definitely don't want a single 500M docs index that's 2TB in size - think about the index size : RAM amount ratio 4) you can try logically sharding your index, but I suspect that will result in uneven term distribution that will not yield optimal relevancy-based ordering. Instead, you may have to assign records/documents to shards in some more random fashion (see ML archives for some recent discussion on this (search for MD5 and SHA-1 -- Lance, want to put that on the Wiki?) 5) Hardware recommendations are hard to do. While people may make suggestions, the only way to know how *your* hardware works with *your* data and *your* shards and *your* type of queries is by benchmarking. Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: William Pierce <[EMAIL PROTECTED]> > To: solr-user@lucene.apache.org > Sent: Thursday, May 15, 2008 12:23:03 PM > Subject: Some advice on scalability > > Folks: > > We are building a search capability into our web and plan to use Solr. While > we > have the initial prototype version up and running on Solr 1.2, we are now > turning our attention to sizing/scalability. > > Our app in brief: We get merchant sku files (in either xml/csv) which we > process and index and make available to our site visitors to search. Our > current plan calls for us to support approx 10,000 merchants each with an > average of 50,000 sku's. This will make a total of approx 500 Million SKUs. > > In addition, we assume that on a daily basis approx 5-10% of the SKUs need > to > be updated (either added/deleted/modified). (Assume each sku will be approx > 4K) > > Here are a few questions that we are thinking about and would value any > insights > you all may have: > > a) Should we have just one giant master index (containing all the sku's) and > then have multiple slaves to handle the search queries? In this case, the > master index will be approx 2 TB in size. Not being an expert in > solr/lucene, > I am thinking that this may be a bad idea to let one index become so large. > What size limit should we assume for each index? > > b) Or, should we partition the 10,000 merchants into N buckets and have a > master > index for each of the N buckets? We could partition the merchants depending > on > their type or some other simple algorithm. Then, we could have slaves > setup > for each of the N masters. The trick here will be to partition the merchants > carefully. Ideally we would like a search for any product type to hit only > one > index but this may not be possible always. For example, a search for "Harry > Potter" may result in hits in "books", "dvds", "memorabilia", etc etc. > > With N masters we will have to plan for having a distributed search across > the N > indices (and then some mechanism for weighting the results across the results > that come back). Any recommendations for a distributed search solution? I > saw some references to Katta. Is this viable? > > In the extreme case, we could have one master for each of the merchants (if > there are 10000 merchants there will be 10,000 master indices). The > advantage > here is that indices will have to be updated only for every merchant who > submits > a new data file. The others remain unchanged. > > c) By the way, for those of you who have deployed solr on a production > environment can you give me your hardware configuration and the rough number > of > search queries that can be handled per second by a single solr instance -- > assuming a dedicated box? > > d) Our plan is to release a beta version Spring 2009. Should we plan on > using > Solr 1.2 or else move to solr 1.3 now? > > Any insights/thoughts/whitepapers will be greatly appreciated! > > Cheers, > > Bill