On 5/1/2013 3:39 AM, Annette Newton wrote: > We have a 4 shard - 2 replica solr cloud setup, each with about 26GB of > index. A total of 24,000,000. We issued a rather large delete yesterday > morning to reduce that size by about half, this resulted in the loss of all > shards while the delete was taking place, but when it had apparently > finished as soon as we started writing again we continued to lose shards. > > We have also issued much smaller deletes and lost shards but before they > have always come back ok. This time we couldn't keep them online. We > ended up rebuilding out cloud setup and switching over to it. > > Is there a better process for deleting documents? Is this expected > behaviour?
How was the delete composed? Was it a single request with a simple query, or was a it a huge list of IDs or a huge query? Was it millions of individual delete queries? All of those should be fine, but the last option is the hardest on Solr, especially if you are doing a lot of commits at the same time. You might need to increase the zkTimeout value on your startup commandline or in solr.xml. How many machines do your eight SolrCloud replicas live on? How much RAM to they have? How much of that memory is allocated to the Java heap? Assuming that your SolrCloud is living on eight separate machines that each have a 26GB index, I hope that you have 16 to 32 GB of RAM on each of those machines, and that a large chunk of that RAM is not allocated to Java or any other program. If you don't, then it will be very difficult to get good performance out of Solr, especially for index commits. If you have multiple 26GB shards per machine, you'll need even more free memory. The free memory is used to cache your index files. Another possible problem here is Java garbage collection pauses. If you have a large max heap and don't have a tuned GC configuration, then the only way to fix this is to reduce your heap and/or to tune Java's garbage collection. Thanks, Shawn