Bill,

Quick feedback:

1) use 1.3-dev or 1.3 when it comes out, not 1.2

2) you did not mention Solr's distributed search functionality explicitly, so I 
get a feeling you are not aware of it.  See DistributedSearch page on the Solr 
wiki

3) you definitely don't want a single 500M docs index that's 2TB in size - 
think about the index size : RAM amount ratio

4) you can try logically sharding your index, but I suspect that will result in 
uneven term distribution that will not yield optimal relevancy-based ordering.  
Instead, you may have to assign records/documents to shards in some more random 
fashion (see ML archives for some recent discussion on this (search for MD5 and 
SHA-1 -- Lance, want to put that on the Wiki?)


5) Hardware recommendations are hard to do.  While people may make suggestions, 
the only way to know how *your* hardware works with *your* data and *your* 
shards and *your* type of queries is by benchmarking.

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: William Pierce <[EMAIL PROTECTED]>
> To: solr-user@lucene.apache.org
> Sent: Thursday, May 15, 2008 12:23:03 PM
> Subject: Some advice on scalability
> 
> Folks:
> 
> We are building a search capability into our web and plan to use Solr.  While 
> we 
> have the initial prototype version up and running on Solr 1.2,  we are now 
> turning our attention to sizing/scalability.  
> 
> Our app in brief:  We get merchant sku files (in either xml/csv) which we 
> process and index and make available to our site visitors to search.   Our 
> current plan calls for us to support approx 10,000 merchants each with an 
> average of 50,000 sku's.   This will make a total of approx 500 Million SKUs. 
>  
> In addition,  we assume that on a daily basis approx 5-10% of the SKUs need 
> to 
> be updated (either added/deleted/modified).   (Assume each sku will be approx 
> 4K)
> 
> Here are a few questions that we are thinking about and would value any 
> insights 
> you all may have:
> 
> a) Should we have just one giant master index (containing all the sku's) and 
> then have multiple slaves to handle the search queries?    In this case, the 
> master index will be approx 2 TB in size.  Not being an expert in 
> solr/lucene,  
> I am thinking that this may be a bad idea to let one index become so large.  
> What size limit should we assume for each index?
> 
> b) Or, should we partition the 10,000 merchants into N buckets and have a 
> master 
> index for each of the N buckets?   We could partition the merchants depending 
> on 
> their type or some other simple algorithm.   Then,  we could have slaves 
> setup 
> for each of the N masters.  The trick here will be to partition the merchants 
> carefully.  Ideally we would like a search for any product type to hit only 
> one 
> index but this may not be possible always.   For example, a search for "Harry 
> Potter" may result in hits in "books", "dvds", "memorabilia", etc etc.  
> 
> With N masters we will have to plan for having a distributed search across 
> the N 
> indices (and then some mechanism for weighting the results across the results 
> that come back).   Any recommendations for a distributed search solution?   I 
> saw some references to Katta.  Is this viable?
> 
> In the extreme case, we could have one master for each of the merchants (if 
> there are 10000 merchants there will be 10,000 master indices).   The 
> advantage 
> here is that indices will have to be updated only for every merchant who 
> submits 
> a new data file.  The others remain unchanged.
> 
> c) By the way,  for those of you who have deployed solr on a production 
> environment can you give me your hardware configuration and the rough number 
> of 
> search queries that can be handled per second by a single solr instance -- 
> assuming a dedicated box?
> 
> d) Our plan is to release a beta version Spring 2009.  Should we plan on 
> using 
> Solr 1.2 or else move to solr 1.3 now?
> 
> Any insights/thoughts/whitepapers will be greatly appreciated!
> 
> Cheers,
> 
> Bill

Reply via email to