Hi Shawn, thank you for the reply and for your advises, will try all of them today. Some of them are already applied, i.e. "Stop other software" and "zkClientTimeout". Timeout set to 60 seconds, also reduced autowarm count and increased autoCommit interval to 5 minutes.
Situation improved now and number of errors decreased, only a few errors since yesterday: a few forwarding update to http://208.85.150.171:8090/solr/crm-prod/ failed - retrying ... and a few of IOException occured when talking to server at: http://208.85.150.171:8081/solr/crm-prod on both shards and only warnings on replicas about too many updates received since start - startingUpdates no longer overlaps with our currentUpdates and Starting/stopping log replay But the load has decreased also. When I told about production-ready I meant documents loss, even though SolrCloud use journal, this does not appear to be a guarantee that the data will be indexed. How that happened that the journal was replayed, truncated but docs aren't in index? I think if Solr accepted a request, docs should appear in index, if there are less resources than required — it could work slow, could crash, could stop working, but data should not be lost, log should be in place and replayed after restart (as it is supposed to be), that is my point of view. Because there is no way my index queue workers could check for Solr failures after getting a successful response from it. I agree, there are too many software for this hardware :) Load average now is under 16. Previously we only had a single Solr instance on this server and decided to switch to SolrCloud to improve the search speed. And it really became much faster now. But also became unreliable under a high load. Perhaps it is really because of some server configuration, checking now. Thank you, Alex -- View this message in context: http://lucene.472066.n3.nabble.com/ColrCloud-IOException-occured-when-talking-to-server-at-tp4061831p4062234.html Sent from the Solr - User mailing list archive at Nabble.com.