On 12/18/2017 7:36 AM, Susheel Kumar wrote:
Yes, Emir. If I repeat the query, it will spread to other nodes but that's
not the case. This is my test env and i am deliberately executing the
query with very high offset and wildcard to cause OOM but executing only
one time.
So it shouldn't spread to other replica sets and at the end of my test,
the first 6 shard/replica set's which gets hit should go down while other 6
should survive but that's not what I see at the end.
Setup : 400+ million docs, JVM is 12GB. Yes, only one collection. Total
12 machines with 6 shards and 6 replica's (replicationFactor = 2)
Do you know what the exact OOME you are encountering is? Is it "java
heap space" or something else?
While ordinarily I would expect multiple replicas in SolrCloud to ensure
high availability, OutOfMemoryError is a special class of problem. When
you encounter OOME on one server, it is likely that other similarly
equipped/configured servers in the cloud will *also* encounter the
error, and if that happens, you're not going to have high availability.
To eliminate the problem, you're going to have to figure out which
resource is being depleted and either increase that resource or change
things so that Solr doesn't need as much of that resource.
Thanks,
Shawn