1a. Multiple Solr instances partitioned by user_id%N, with index files segmented by user_id field.
That can scale rather gracefully, though it does need reindexing to add a server. wunder On 2/26/09 3:44 AM, "Vikram B. Kumar" <vikrambku...@gmail.com> wrote: > Hi All, > > Our web based document management system has few thousand users and is > growing rapidly. Like any SaaS, while we support a lot of customers, > only few of them (those logged in) will be reading their index and only > a subset of those logged in (who are adding documents) will be writing > to their index. > > i.,e TU > L > U > > and TU ~ 100 x L > > where TU is total no of users, L is logged in users who are searching > and U is the uploaders who are updating their index. > > We have been using Lucene over a simple RESTful server for searching. > Indexing is currently done using regular JavaSE based setup, instead of > a server. We are thinking about moving to Solr to scale better and to > get rid of the latency associated with our non-live JavaSE based > indexer. We have a custom Analyzer/Filter that adds some payload to each > term to support our web based service. > > My message is about on how best to partition the index to support > multiple users. > > Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4 > cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support > 8000-10000 users before I need to start sharding them across multiple hosts. > > I have thought of the following options: > > 1. One Monilithic index, but index files segmented by user_id field. > > 2. MultiCore - One core per user. > > 3. Multiple Solr instances - Non scalable. > > 4. Don't use Solr, but enhance our Lucene +RESTful server model to > support indexing as well. - Least favored approach as we will be doing a > lot of things that Solr already does (replication, live > add/update/delete). Most of the things we are doing, can be done with > Solr's pluggable query handlers. (I guess this is not a true option at all). > > I am currently favouring Option 2 though want to try out whether 1 works > as well. > > Looks like some of the most obvious problems with MultiCores are "too > many open file" problems, which can be handled with hardware and > software boundaries (properly close index after updating and after users > logout). > > My questions: > > 1. Can our analyzers/filters be plugged into Solr during the time of > indexing? > 2. Does option 2 fit the above needs? Has anybody done option 2 with > thousands of cores in a Solr instance? > 3. Does option 2 to support horizontal scaling (sharding?) > > Thanks, > Vikram > >