On 4/3/2017 7:52 AM, Salih Sen wrote: > We have a three server set up with each server having 756G ram, 48 > cores, 4SSDs (each having tree solr instances on them) and a dedicated > mechanical disk for zookeeper (3 zk instances total). Each Solr > instances have 31G of heap space allocated to them. In total we have > 36 Solr Instances and 3 Zookeeper instances (with 1G heapspace). Also > servers 10Gig network between them.
You haven't described your index(es). How many collections in the cloud? How many shards for each? How many replicas for each shard? How many docs in each collection? How much *total* index data is on each of those systems? To determine this, add up the size of the solr home in all of the Solr instances that exist on that server. With this information, we can make an educated guess about whether the setup you have engineered is reasonably correct for the scale of your data. It sounds like you have twelve Solr instances per server, with each one using a 31GB heap. That's 372GB of memory JUST for Solr heaps. Unless you're dealing with terabytes of index data and hundreds of millions (or billions) of documents, I cannot imagine needing that many Solr instances per server or that much heap memory. Have you increased the maximum number of processes that the user which is running Solr can have? 12 instances of Solr is going to be a LOT of threads, and on most operating systems, each thread counts against the user process limit. Some operating systems might have a separate configuration for thread limits, but I do know that Linux does not, and counts them as processes. > We set Auto hardcommit time to 15sec and 10000 docs, and soft commit > to 60000 sec and 5000 seconds in order to avoid soft committing too > much and avoiding indexing bottlenecks. We also > set DzkClientTimeout=90000. Side issue: It's generally preferable to only use either maxDoc or maxTime, and maxTime will usually result in more predictable behavior, so I recommend removing the maxDoc settings on autoCommit and autoSoftCommit. I doubt this will have any effect on the problem you're experiencing, just something I noticed. I recommend a maxTime of 60000 (one minute) for autoCommit, with openSearcher set to false, and a maxTime of at least 120000 (two minutes) for autoSoftCommit. If these seem excessively high to you, go with 30000 and 60000. On zkClientTimeout, unless you have increased the ZK server tickTime, you'll find that you can't actually define a zkClientTimeout that high. The maximum is 20*tickTime. A typical tickTime value is 2000, which means that the usual maximum value for zkClientTimeout is 40 seconds. The error you've reported doesn't look related to zkClientTimeout, so increasing that beyond 30 seconds is probably unnecessary. The default values for Zookeeper server tuning have been worked on by the ZK developers for years. I wouldn't mess with tickTime without a REALLY good reason. Another side issue: Putting Zookeeper data on a mechanical disk when there are SSDs available seems like a mistake to me. Zookeeper is even more sensitive to disk performance than Solr is. > But it seems replicas still randomly go down while indexing. Do you > have any suggestions to prevent this situation? <snip> > Caused by: java.net.SocketTimeoutException: Read timed out This error says that a TCP connection (http on port 9132) from one Solr server to another hit the socket timeout -- there was no activity on the connection for whatever the timeout is set to. Usually a problem like this has two causes: 1) A *serious* performance issue with Solr resulting in an incredibly long processing time. Most performance issues are memory-related. 2) The socket timeout has been set to a very low value. In a later message on the thread, you indicated that the configured socket timeout is ten minutes. This should be plenty, and makes me think option number one above is what we are dealing with, and the information I asked for in the first paragraph of this reply is required for any deeper insight. Are there other errors in the Solr logfile that you haven't included? It seems likely that this is not the only problem Solr has encountered. Thanks, Shawn