Planning of future Solr setup

Christian von Wendt-Jensen Wed, 23 May 2012 06:16:29 -0700

Hi,

I'm in the middle of planning a new Solr setup. The situation is this:
- We currently have one document type with around 20 fields, indexed, not 
stored, except for a few date fields
- We currently have indexed 400M documents across 20+ shards.
- The number of documents to be indexed is around 1M/day, and this number is 
increasing.
- The index files totals to around 750GB
- Users will mostly search newly indexed documents (news), and therefore the 
shards represents dateranges.
- Each month or so, we add a new shard.



In my planning, my goals are:
- it should be very easy to add a new shard and bring it online. Maybe it could 
even be fully automated.
- it should be very easy to retire a (old) shard in order to reclaim the 
hardware resources for newer documents.
- It should be very easy to scale wide or high by adding more machines or more 
CPU/RAM. The resources should be able to autobalance the shards for optimum 
resources usage.
- Rebalancing should be very fast.
- The setup should support one writer and many readers of the same physical 
index. This avoids replication and moving large files around. This again 
supports fast rebalancing of hardware resources.
- Clients should be notified about shards coming online or going offline.

The goals require a kind of distributed configuration and notifcation system. 
Here I imagine Zookeeper comes into play.
In order to make rebalancing very fast, the index should stay where they are, 
and not be moved around. Instead Solr instances on available resources should 
be configured to point to relevant shards. This requires a SAN storage, I 
imagine.


Questions:
1. What is best practice in regard to using a machines resources: one tomcat 
instance per one shard until memory and CPU is used up? Or rather one 
tomcat/multiple cores, and the tomcat gets all memory available on the machine?
2. Would it be a good idea to mix master and slave cores in the same tomcat 
instance or should a machine be dedicated to either master cores or slave cores?
3. What would be the best way to notify the slave cores about recent commits by 
the masters, remembering that replication is disabled?
4. In the one writer, many readers scenario, what happens when the writer 
merges/updates segments? Will the index files be physically deleted/altered? 
And how will the slaves react to that?
5. Would it be advisable to use a SAN for sharing index files between readers 
and writers (one writer)? Any best practices on this area? I imagine one large 
share on the SAN that all "resources" can mount.






Med venlig hilsen / Best Regards

Christian von Wendt-Jensen

Planning of future Solr setup

Reply via email to