Christian, You don't mention SolrCloud explicitly and based on what you wrote I'm assuming you are thinking/planning on using the Solr 3.* setup for this. I think that's the first thing to change - this is going to be a pain to manage if you use Solr 3.*. You should immediately start looking at using SolrCloud for this. Once you have a look you will see how a number of your questions will quickly become non-questions. :)
Otis ---- Performance Monitoring for Solr / ElasticSearch / HBase - http://sematext.com/spm >________________________________ > From: Christian von Wendt-Jensen <christian.sonne.jen...@infopaq.com> >To: "solr-user@lucene.apache.org" <solr-user@lucene.apache.org> >Sent: Wednesday, May 23, 2012 6:59 AM >Subject: Planning of future Solr setup > >Hi, > >I'm in the middle of planning a new Solr setup. The situation is this: >- We currently have one document type with around 20 fields, indexed, not >stored, except for a few date fields >- We currently have indexed 400M documents across 20+ shards. >- The number of documents to be indexed is around 1M/day, and this number is >increasing. >- The index files totals to around 750GB >- Users will mostly search newly indexed documents (news), and therefore the >shards represents dateranges. >- Each month or so, we add a new shard. > > >In my planning, my goals are: >- it should be very easy to add a new shard and bring it online. Maybe it >could even be fully automated. >- it should be very easy to retire a (old) shard in order to reclaim the >hardware resources for newer documents. >- It should be very easy to scale wide or high by adding more machines or more >CPU/RAM. The resources should be able to autobalance the shards for optimum >resources usage. >- Rebalancing should be very fast. >- The setup should support one writer and many readers of the same physical >index. This avoids replication and moving large files around. This again >supports fast rebalancing of hardware resources. >- Clients should be notified about shards coming online or going offline. > >The goals require a kind of distributed configuration and notifcation system. >Here I imagine Zookeeper comes into play. >In order to make rebalancing very fast, the index should stay where they are, >and not be moved around. Instead Solr instances on available resources should >be configured to point to relevant shards. This requires a SAN storage, I >imagine. > > >Questions: >1. What is best practice in regard to using a machines resources: one tomcat >instance per one shard until memory and CPU is used up? Or rather one >tomcat/multiple cores, and the tomcat gets all memory available on the machine? >2. Would it be a good idea to mix master and slave cores in the same tomcat >instance or should a machine be dedicated to either master cores or slave >cores? >3. What would be the best way to notify the slave cores about recent commits >by the masters, remembering that replication is disabled? >4. In the one writer, many readers scenario, what happens when the writer >merges/updates segments? Will the index files be physically deleted/altered? >And how will the slaves react to that? >5. Would it be advisable to use a SAN for sharing index files between readers >and writers (one writer)? Any best practices on this area? I imagine one large >share on the SAN that all "resources" can mount. > > > > > > >Med venlig hilsen / Best Regards > >Christian von Wendt-Jensen > > > >