On 9/3/2015 12:06 AM, Arnon Yogev wrote: > I wanted to ask about the implications of different timeout values one can > use. > > For example: > From what I see in the code, the default socket timeout value for Solr is > 0. > Does that mean Solr nodes will wait to update \ receive update from each > other without any timeout?
The socket timeout is a property of the TCP connection, which is ultimately handled by the operating system. Solr uses HTTP, which is a TCP-based protocol. This is not specific to Solr. A value of zero means the operating system won't time out and disconnect the TCP session. Generally you want your servers to have no socket timeout, and depending on exactly what you are doing, *maybe* you will configure a socket timeout on the client side. For zookeeper, there is no need to have a socket timeout, as you will see when I continue below. > In other words, can the following scenario happen: > 1. One solr node becomes very slow for some reason, but is still > considered alive in ZK. > 2. Other servers in the cluster try to update \ receive updates from this > node, but do not get responds. > 3. Since there's no timeout defined, all nodes in the cluster will > eventually become unresponsive (when the thread pool is exhausted). Even though the socket timeout is generally zero so the OS won't terminate idle TCP connections, the application can take care of timeouts and terminations. Solr configures a zkClientTimeout. If I remember my last dive into SolrCloud code correctly, this is transferred pretty much straight across to the zookeeper client as its session timeout. If this timeout is exceeded on pretty much any inter-server communication, SolrCloud will generally mark the node down. Historically there have been a lot of problems with SolrCloud nodes being marked down due to garbage collection pauses that exceed the timeout. Since 5.0 this should be less of a problem, because the included start scripts have aggressive GC tuning. The zkClientTimeout defaults to 15 seconds internally inside Solr if you do not have any configuration that sets the value, but most recent Solr example configurations set it to 30 seconds. In most situations, a 15 second timeout is VERY long ... if that's being exceeded, there is usually a serious problem that needs fixing. Thanks, Shawn