Hello,
 
We are using Solr  8.5.2
 
We are having trouble with dealing with network errors between a Solr node and 
a client.
In our situation, our Solr Nodes and Zk hosts are healthy and can communication 
with each other, all our collections are up and healthy.
 
When we simulate a network problem between a client and a Solr Node (whilst 
maintaining the connections and healthy status of everything else), our Admin 
health check (HealthCheckRequest)fails with this type of network issue as we 
get a
"org.apache.solr.client.solrj.SolrServerException: IOException occurred when 
talking to server at: https://solr2:8984/solr "
with the root cause being a 
"java.net.SocketTimeoutException: connect timed out"
(seen in LBSolrClient).
 
In admin commands, it appears that the client's Zombie list is only updated and 
the operation only continues when the root cause is a ConnectException. 
We can confirm that a ConnectException (by changing it manually in the 
debugger) works as we would like. The operation succeeds. And subsequent calls 
to the client consider our blocked node as a Zombie.

A SocketTimeoutException type of exception does not update the client's Zombie 
list and continue with the operation, instead throwing an overall exception. 
And as the Zombie list is not updated, next time we try with the same client, 
we have the same problem as the node that has been blocked is still the first 
one that is returned in the live nodes list, and is the first that the request 
is sent to.
 
How can we work around this?
 
We have drilled down into the LBSolrClient to have a look.
 
Our main concern is that we believe that this will also be a problem for us 
with Updates.
 
 
An example scenario:
Solr1 on server Solr1
Solr2 on server Solr2
A collection with replication factor 2 with replicas for each shard being 
hosted on both Solr nodes.
An application server is on ApplicationServer1.
Another application server is on ApplicationServer2.
 
The Solr Nodes are up and the collection is healthy.
 
(Depending on the order of the live nodes)
If access is blocked to Solr2 from ApplicationServer1, update from 
ApplicationServer1 should succeed and a health check/ping from 
ApplicationServer1 should return "healthy".
Update from ApplicationServer2 should succeed and health check/ping from 
ApplicationServer2 should return "healthy".
 
If access is then unblocked to Solr2 from ApplicationServer1 but blocked to 
Solr1, then update from ApplicationServer1 fails and a health check/ping from 
ApplicationServer1 throws an exception.
Update from ApplicationServer2 should succeed and health check/ping from 
ApplicationServer2 should return "healthy".
 
Redacted stacktrace:

[err] org.apache.solr.client.solrj.SolrServerException: IOException occurred 
when talking to server at: https://solr2:8984/solr
[err]     at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:695)
[err]     at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:266)
[err]     at 
org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:248)
[err]     at 
org.apache.solr.client.solrj.impl.LBSolrClient.doRequest(LBSolrClient.java:370)
[err]     at 
org.apache.solr.client.solrj.impl.LBSolrClient.request(LBSolrClient.java:298)
[err]     at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.sendRequest(BaseCloudSolrClient.java:1157)
[err]     at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.requestWithRetryOnStaleState(BaseCloudSolrClient.java:918)
[err]     at 
org.apache.solr.client.solrj.impl.BaseCloudSolrClient.request(BaseCloudSolrClient.java:850)
[err]     at <Redacted internal package that calls through to the SolrClient> 
(SolrClientProxy.java:136)
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err]     at <Redacted internal calls>
[err] Caused by: 
[err] org.apache.http.conn.ConnectTimeoutException: Connect to solr2:8984 
[solr2/172.18.0.6] failed: connect timed out
[err]     at 
org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:151)
[err]     at 
org.apache.http.impl.conn.PoolingHttpClientConnectionManager.connect(PoolingHttpClientConnectionManager.java:374)
[err]     at 
org.apache.http.impl.execchain.MainClientExec.establishRoute(MainClientExec.java:393)
[err]     at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:236)
[err]     at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
[err]     at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
[err]     at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
[err]     at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
[err]     at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
[err]     at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
[err]     at 
org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:571)
[err]     ... 22 more
[err] Caused by: 
[err] java.net.SocketTimeoutException: connect timed out
[err]     at java.net.PlainSocketImpl.socketConnect(Native Method)
[err]     at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
[err]     at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
[err]     at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
[err]     at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
[err]     at java.net.Socket.connect(Socket.java:607)
[err]     at 
org.apache.http.conn.ssl.SSLConnectionSocketFactory.connectSocket(SSLConnectionSocketFactory.java:368)
[err]     at 
org.apache.http.impl.conn.DefaultHttpClientConnectionOperator.connect(DefaultHttpClientConnectionOperator.java:142)
[err]     ... 32 more
 
 
We were also wondering why admin requests that do not modify anything, e.g. a 
Ping or a HealthCheck, are nonRetryable? They should be idempotent too, 
shouldn't they?
 
Thanks!
LisaUnless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Reply via email to