With five servers, assign 1/5 of user_id's to each server. Choose the number of servers to handle the number of logged-in users. Each user's searches go to the single server with their data.
Partitioning by user_id is common with relational databases. We do this to hold our two billion movie ratings from ten million customers. wunder On 2/26/09 8:21 AM, "Vikram Kumar" <vikrambku...@gmail.com> wrote: > Hi Wunder, > Can you please elaborate? > > Vikram > > On Thu, Feb 26, 2009 at 10:13 AM, Walter Underwood > <wunderw...@netflix.com>wrote: > >> 1a. Multiple Solr instances partitioned by user_id%N, with index >> files segmented by user_id field. >> >> That can scale rather gracefully, though it does need reindexing >> to add a server. >> >> wunder >> >> On 2/26/09 3:44 AM, "Vikram B. Kumar" <vikrambku...@gmail.com> wrote: >> >>> Hi All, >>> >>> Our web based document management system has few thousand users and is >>> growing rapidly. Like any SaaS, while we support a lot of customers, >>> only few of them (those logged in) will be reading their index and only >>> a subset of those logged in (who are adding documents) will be writing >>> to their index. >>> >>> i.,e TU > L > U >>> >>> and TU ~ 100 x L >>> >>> where TU is total no of users, L is logged in users who are searching >>> and U is the uploaders who are updating their index. >>> >>> We have been using Lucene over a simple RESTful server for searching. >>> Indexing is currently done using regular JavaSE based setup, instead of >>> a server. We are thinking about moving to Solr to scale better and to >>> get rid of the latency associated with our non-live JavaSE based >>> indexer. We have a custom Analyzer/Filter that adds some payload to each >>> term to support our web based service. >>> >>> My message is about on how best to partition the index to support >>> multiple users. >>> >>> Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4 >>> cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support >>> 8000-10000 users before I need to start sharding them across multiple >> hosts. >>> >>> I have thought of the following options: >>> >>> 1. One Monilithic index, but index files segmented by user_id field. >>> >>> 2. MultiCore - One core per user. >>> >>> 3. Multiple Solr instances - Non scalable. >>> >>> 4. Don't use Solr, but enhance our Lucene +RESTful server model to >>> support indexing as well. - Least favored approach as we will be doing a >>> lot of things that Solr already does (replication, live >>> add/update/delete). Most of the things we are doing, can be done with >>> Solr's pluggable query handlers. (I guess this is not a true option at >> all). >>> >>> I am currently favouring Option 2 though want to try out whether 1 works >>> as well. >>> >>> Looks like some of the most obvious problems with MultiCores are "too >>> many open file" problems, which can be handled with hardware and >>> software boundaries (properly close index after updating and after users >>> logout). >>> >>> My questions: >>> >>> 1. Can our analyzers/filters be plugged into Solr during the time of >>> indexing? >>> 2. Does option 2 fit the above needs? Has anybody done option 2 with >>> thousands of cores in a Solr instance? >>> 3. Does option 2 to support horizontal scaling (sharding?) >>> >>> Thanks, >>> Vikram >>> >>> >> >>