Check whether the oom killer script was called. If so, there will be log files obviously relating to that. I've seen nodes mysteriously disappear as a result of this with no message in the regular solr logs. If that's the case, you need to increase your heap.
Erick On Wed, Sep 18, 2019 at 8:21 AM Shawn Heisey <apa...@elyograg.org> wrote: > > On 9/18/2019 6:11 AM, Shawn Heisey wrote: > > On 9/17/2019 9:35 PM, Hongxu Ma wrote: > >> My questions: > >> > >> * Is this error possible caused by "long gc pause"? my solr > >> zkClientTimeout=60000 > > > > It's possible. I can't say for sure that this is the issue, but it > > might be. > > A followup. I was thinking about the interactions here. It looks like > Solr only waits four seconds for the leader election, and both of the > pauses you mentioned are longer than that. > > Four seconds is probably too short a time to wait, and I do not think > that timeout is configurable anywhere. > > > What version of Solr do you have, and what is your max heap? The CMS > > garbage collection that Solr 5.0 and later incorporate by default is > > pretty good. My G1 settings might do slightly better, but the > > improvement won't be dramatic unless your existing commandline has > > absolutely no gc tuning at all. > > That question will be important. If you already have our CMS GC tuning, > switching to G1 probably is not going to solve this. Lowering the max > heap might be the only viable solution in that case, and depending on > what you're dealing with, it will either be impossible or it will > require more servers. > > Thanks, > Shawn