Hi All,
Our web based document management system has few thousand users and is
growing rapidly. Like any SaaS, while we support a lot of customers,
only few of them (those logged in) will be reading their index and only
a subset of those logged in (who are adding documents) will be writing
to their index.
i.,e TU > L > U
and TU ~ 100 x L
where TU is total no of users, L is logged in users who are searching
and U is the uploaders who are updating their index.
We have been using Lucene over a simple RESTful server for searching.
Indexing is currently done using regular JavaSE based setup, instead of
a server. We are thinking about moving to Solr to scale better and to
get rid of the latency associated with our non-live JavaSE based
indexer. We have a custom Analyzer/Filter that adds some payload to each
term to support our web based service.
My message is about on how best to partition the index to support
multiple users.
Hardware: The servers I have are 64 bit 1.7GHz x 2xDual Core (i.,e 4
cores totally) with 1/2 TB disks. By my estimate, 1/2 TB can support
8000-10000 users before I need to start sharding them across multiple hosts.
I have thought of the following options:
1. One Monilithic index, but index files segmented by user_id field.
2. MultiCore - One core per user.
3. Multiple Solr instances - Non scalable.
4. Don't use Solr, but enhance our Lucene +RESTful server model to
support indexing as well. - Least favored approach as we will be doing a
lot of things that Solr already does (replication, live
add/update/delete). Most of the things we are doing, can be done with
Solr's pluggable query handlers. (I guess this is not a true option at all).
I am currently favouring Option 2 though want to try out whether 1 works
as well.
Looks like some of the most obvious problems with MultiCores are "too
many open file" problems, which can be handled with hardware and
software boundaries (properly close index after updating and after users
logout).
My questions:
1. Can our analyzers/filters be plugged into Solr during the time of
indexing?
2. Does option 2 fit the above needs? Has anybody done option 2 with
thousands of cores in a Solr instance?
3. Does option 2 to support horizontal scaling (sharding?)
Thanks,
Vikram