Re: Some advice on scalability

William Pierce Thu, 15 May 2008 14:22:29 -0700

Otis:

I will take a look at the DistributedSearch page on solr wiki.


Thanks,

Bill

--------------------------------------------------
From: "Otis Gospodnetic" <[EMAIL PROTECTED]>
Sent: Thursday, May 15, 2008 12:54 PM
To: <solr-user@lucene.apache.org>
Subject: Re: Some advice on scalability

Bill,

Quick feedback:

1) use 1.3-dev or 1.3 when it comes out, not 1.2
2) you did not mention Solr's distributed search functionality explicitly,so I get a feeling you are not aware of it. See DistributedSearch page onthe Solr wiki
3) you definitely don't want a single 500M docs index that's 2TB in size -think about the index size : RAM amount ratio
4) you can try logically sharding your index, but I suspect that willresult in uneven term distribution that will not yield optimalrelevancy-based ordering. Instead, you may have to assignrecords/documents to shards in some more random fashion (see ML archivesfor some recent discussion on this (search for MD5 and SHA-1 -- Lance,want to put that on the Wiki?)
5) Hardware recommendations are hard to do. While people may makesuggestions, the only way to know how *your* hardware works with *your*data and *your* shards and *your* type of queries is by benchmarking.
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: William Pierce <[EMAIL PROTECTED]>
To: solr-user@lucene.apache.org
Sent: Thursday, May 15, 2008 12:23:03 PM
Subject: Some advice on scalability

Folks:
We are building a search capability into our web and plan to use Solr.While wehave the initial prototype version up and running on Solr 1.2, we arenow
turning our attention to sizing/scalability.

Our app in brief:  We get merchant sku files (in either xml/csv) which we
process and index and make available to our site visitors to search.Our
current plan calls for us to support approx 10,000 merchants each with an
average of 50,000 sku's. This will make a total of approx 500 MillionSKUs.In addition, we assume that on a daily basis approx 5-10% of the SKUsneed tobe updated (either added/deleted/modified). (Assume each sku will beapprox
4K)
Here are a few questions that we are thinking about and would value anyinsights
you all may have:
a) Should we have just one giant master index (containing all the sku's)andthen have multiple slaves to handle the search queries? In this case,themaster index will be approx 2 TB in size. Not being an expert insolr/lucene,I am thinking that this may be a bad idea to let one index become solarge.
What size limit should we assume for each index?
b) Or, should we partition the 10,000 merchants into N buckets and have amasterindex for each of the N buckets? We could partition the merchantsdepending ontheir type or some other simple algorithm. Then, we could have slavessetupfor each of the N masters. The trick here will be to partition themerchantscarefully. Ideally we would like a search for any product type to hitonly oneindex but this may not be possible always. For example, a search for"Harry
Potter" may result in hits in "books", "dvds", "memorabilia", etc etc.
With N masters we will have to plan for having a distributed searchacross the Nindices (and then some mechanism for weighting the results across theresultsthat come back). Any recommendations for a distributed search solution?I
saw some references to Katta.  Is this viable?
In the extreme case, we could have one master for each of the merchants(ifthere are 10000 merchants there will be 10,000 master indices). Theadvantagehere is that indices will have to be updated only for every merchant whosubmits
a new data file.  The others remain unchanged.

c) By the way,  for those of you who have deployed solr on a production
environment can you give me your hardware configuration and the roughnumber ofsearch queries that can be handled per second by a single solrinstance --assuming a dedicated box?
d) Our plan is to release a beta version Spring 2009. Should we plan onusing
Solr 1.2 or else move to solr 1.3 now?

Any insights/thoughts/whitepapers will be greatly appreciated!

Cheers,

Bill

Re: Some advice on scalability

Reply via email to