With five servers, assign 1/5 of user_id's to each server. Choose
the number of servers to handle the number of logged-in users.
Each user's searches go to the single server with their data.

Partitioning by user_id is common with relational databases.
We do this to hold our two billion movie ratings from ten
million customers.

wunder

On 2/26/09 8:21 AM, "Vikram Kumar" <vikrambku...@gmail.com> wrote:

> Hi Wunder,
> Can you please elaborate?
> 
> Vikram
> 
> On Thu, Feb 26, 2009 at 10:13 AM, Walter Underwood
> <wunderw...@netflix.com>wrote:
> 
>> 1a. Multiple Solr instances partitioned by user_id%N, with index
>> files segmented by user_id field.
>> 
>> That can scale rather gracefully, though it does need reindexing
>> to add a server.
>> 
>> wunder
>> 
>> On 2/26/09 3:44 AM, "Vikram B. Kumar" <vikrambku...@gmail.com> wrote:
>> 
>>> Hi All,
>>> 
>>> Our web based document management system has few thousand users and is
>>> growing rapidly. Like any SaaS, while we support a lot of customers,
>>> only few of them (those logged in) will be reading their index and only
>>> a subset of those logged in (who are adding documents) will be writing
>>> to their index.
>>> 
>>> i.,e TU > L > U
>>> 
>>> and TU ~ 100 x L
>>> 
>>> where TU is total no of users, L is logged in users who are searching
>>> and U is the uploaders who are updating their index.
>>> 
>>> We have been using Lucene over a simple RESTful server for searching.
>>> Indexing is currently done using regular JavaSE based setup, instead of
>>> a server. We are thinking about moving to Solr to scale better and to
>>> get rid of the latency associated with our non-live JavaSE based
>>> indexer. We have a custom Analyzer/Filter that adds some payload to each
>>> term to support our web based service.
>>> 
>>> My message is about on how best to partition the index to support
>>> multiple users.
>>> 
>>> Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
>>> cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
>>> 8000-10000 users before I need to start sharding them across multiple
>> hosts.
>>> 
>>> I have thought of the following options:
>>> 
>>> 1. One Monilithic index, but index files segmented by user_id field.
>>> 
>>> 2. MultiCore - One core per user.
>>> 
>>> 3. Multiple Solr instances - Non scalable.
>>> 
>>> 4. Don't use Solr, but enhance our Lucene +RESTful server model to
>>> support indexing as well. - Least favored approach as we will be doing a
>>> lot of things that Solr already does (replication, live
>>> add/update/delete). Most of the things we are doing, can be done with
>>> Solr's pluggable query handlers. (I guess this is not a true option at
>> all).
>>> 
>>> I am currently favouring Option 2 though want to try out whether 1 works
>>> as well.
>>> 
>>> Looks like some of the most obvious problems with MultiCores are "too
>>> many open file" problems, which can be handled with hardware and
>>> software boundaries (properly close index after updating and after users
>>> logout).
>>> 
>>> My questions:
>>> 
>>> 1. Can our analyzers/filters be plugged into Solr during the time of
>>> indexing?
>>> 2. Does option 2 fit the above needs? Has anybody done option 2 with
>>> thousands of cores in a Solr instance?
>>> 3. Does option 2 to support horizontal scaling (sharding?)
>>> 
>>> Thanks,
>>> Vikram
>>> 
>>> 
>> 
>> 

Reply via email to