Hi All,

Our web based document management system has few thousand users and is growing rapidly. Like any SaaS, while we support a lot of customers, only few of them (those logged in) will be reading their index and only a subset of those logged in (who are adding documents) will be writing to their index.

i.,e TU > L > U

and TU ~ 100 x L

where TU is total no of users, L is logged in users who are searching and U is the uploaders who are updating their index.

We have been using Lucene over a simple RESTful server for searching. Indexing is currently done using regular JavaSE based setup, instead of a server. We are thinking about moving to Solr to scale better and to get rid of the latency associated with our non-live JavaSE based indexer. We have a custom Analyzer/Filter that adds some payload to each term to support our web based service.

My message is about on how best to partition the index to support multiple users.

Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4 cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support 8000-10000 users before I need to start sharding them across multiple hosts.

I have thought of the following options:

1. One Monilithic index, but index files segmented by user_id field.

2. MultiCore - One core per user.

3. Multiple Solr instances - Non scalable.

4. Don't use Solr, but enhance our Lucene +RESTful server model to support indexing as well. - Least favored approach as we will be doing a lot of things that Solr already does (replication, live add/update/delete). Most of the things we are doing, can be done with Solr's pluggable query handlers. (I guess this is not a true option at all).

I am currently favouring Option 2 though want to try out whether 1 works as well.

Looks like some of the most obvious problems with MultiCores are "too many open file" problems, which can be handled with hardware and software boundaries (properly close index after updating and after users logout).

My questions:

1. Can our analyzers/filters be plugged into Solr during the time of indexing? 2. Does option 2 fit the above needs? Has anybody done option 2 with thousands of cores in a Solr instance?
3. Does option 2 to support horizontal scaling (sharding?)

Thanks,
Vikram


Reply via email to