Hi, I'm in the middle of planning a new Solr setup. The situation is this: - We currently have one document type with around 20 fields, indexed, not stored, except for a few date fields - We currently have indexed 400M documents across 20+ shards. - The number of documents to be indexed is around 1M/day, and this number is increasing. - The index files totals to around 750GB - Users will mostly search newly indexed documents (news), and therefore the shards represents dateranges. - Each month or so, we add a new shard.
In my planning, my goals are: - it should be very easy to add a new shard and bring it online. Maybe it could even be fully automated. - it should be very easy to retire a (old) shard in order to reclaim the hardware resources for newer documents. - It should be very easy to scale wide or high by adding more machines or more CPU/RAM. The resources should be able to autobalance the shards for optimum resources usage. - Rebalancing should be very fast. - The setup should support one writer and many readers of the same physical index. This avoids replication and moving large files around. This again supports fast rebalancing of hardware resources. - Clients should be notified about shards coming online or going offline. The goals require a kind of distributed configuration and notifcation system. Here I imagine Zookeeper comes into play. In order to make rebalancing very fast, the index should stay where they are, and not be moved around. Instead Solr instances on available resources should be configured to point to relevant shards. This requires a SAN storage, I imagine. Questions: 1. What is best practice in regard to using a machines resources: one tomcat instance per one shard until memory and CPU is used up? Or rather one tomcat/multiple cores, and the tomcat gets all memory available on the machine? 2. Would it be a good idea to mix master and slave cores in the same tomcat instance or should a machine be dedicated to either master cores or slave cores? 3. What would be the best way to notify the slave cores about recent commits by the masters, remembering that replication is disabled? 4. In the one writer, many readers scenario, what happens when the writer merges/updates segments? Will the index files be physically deleted/altered? And how will the slaves react to that? 5. Would it be advisable to use a SAN for sharing index files between readers and writers (one writer)? Any best practices on this area? I imagine one large share on the SAN that all "resources" can mount. Med venlig hilsen / Best Regards Christian von Wendt-Jensen