Hi, Did you say you have 150 servers in this cluster? And 10 shards for just 90M docs? If so, that 150 hosts sounds like too much for all other numbers I see here. I'd love to see some metrics here. e.g. what happens with disk IO around those commits? How about GC time/size info? Are JVM memory pools full-ish and is the CPU jumping like crazy? Can you share more info to give us a more complete picture of your system? SPM for Solr <http://sematext.com/spm/> will help if you don't already capture these types of things.
Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Thu, Feb 12, 2015 at 11:07 AM, Vijay Sekhri <sekhrivi...@gmail.com> wrote: > Hi Erick, > We have following configuration of our solr cloud > > 1. 10 Shards > 2. 15 replicas per shard > 3. 9 GB of index size per shard > 4. a total of around 90 mil documents > 5. 2 collection viz search1 serving live traffic and search 2 for > indexing. We swap collection when indexing finishes > 6. On 150 hosts we have 2 JVMs running one for search1 collection and > other for search2 collection > 7. Each jvm has 12 GB of heap assigned to it while the host has 50GB in > total > 8. Each host has 16 processors > 9. Linux XXXXXXX 2.6.32-431.5.1.el6.x86_64 #1 SMP Wed Feb 12 00:41:43 > UTC 2014 x86_64 x86_64 x86_64 GNU/Linux > 10. We have two ways to index data. > 1. Bulk indexing . All 90 million docs pumped in from 14 parallel > process (on 14 different client hosts). This is done on > collection that is > not serving live traffic > 2. Incremental indexing . Only delta changes (Range from 100K to 5 > Mil) every two hours. This is done on collection also serving live > traffic > 11. The request per second count on live collection is around 300 TPS > 12. Hard commit setting is every 30 second with open searcher false and > soft commit setting is every 15 minutes . We have tried a lot of > different > setting here BTW. > > > > > Now we have two issues with indexing > 1) Solr just could not keep up with the bulk indexing when replicas are > also active. We have concluded this by changing the number of replicas to > just 2 , to 4 and then to 15. When the number of replicas increases the > bulk indexing time increase almost exponentially > We seem to have encountered the same issue reported here > https://issues.apache.org/jira/browse/SOLR-6816 > It gets to a point that even to index 100 docs the solr cluster would take > 300 second. It would start of indexing 100 docs in 55 millisecond and > slowly increase over time and within hour and a half just could not keep > up. We have a workaround for this and i.e we stop all the replicas , do the > bulk indexing and bring all the replicas up one by one . This sort of > defeats the purpose of solr cloud but we can still work with this > workaround. We can do this because , bulk indexing happen on the collection > that is not serving live traffic. However we would love to have a solution > from the solr cloud itself like ask it to stop replication and start via an > API at the end of indexing. > > 2) This issues is related to soft commit with incremental indexing . When > we do incremental indexing, it is done on the same collection serving live > traffic with 300 request per second throughput. Everything is fine except > whenever the soft commit happens. Each time soft commit (autosoftcommit in > sorlconfig.xml) happens which BTW happens almost at the same time > throughout the cluster , there is a spike in the response times and > throughput decreases almost to 150 tps. The spike continues for 2 minutes > and then it happens again at the exact interval when the soft commit > happens. We have monitored the logs and found a direct co relation when the > soft commit happens and when the response time tanks. > > Now the latter issue is quite disturbing , because it is serving live > traffic and we cannot sustain these periodic degradation. We have played > around with different soft commit setting . Interval ranging from 2 minutes > to 30 minutes . Auto warming half cache , auto warming full cache, auto > warming only 10 %. Doing warm up queries on every new searcher , doing NONE > warm up queries on every new searching and all the different setting yields > the same results . As and when soft commit happens the response time tanks > and throughput deceases. The difference is almost 50 % in response times > and 50 % in throughput > > > Our workaround for this solution is to also do incremental delta indexing > on the collection not serving live traffic and swap when it is done. As you > can see that this also defeats the purpose of solr cloud . We cannot do > bulk indexing because replicas cannot keeps up and we cannot do incremental > indexing because of soft commit performance. > > Is there a way to make the cluster not do soft commit all at the same time > or is there a way to make soft commit not cause this degradation ? > We are open to any ideas at this time now. > > > > > > > -- > ********************************************* > Vijay Sekhri > ********************************************* >