Hi all, we recently upgraded our SolrCloud cluster from version 7.7.1 to version 8.5.0 and ran into multiple problems. In the end we had to revert the upgrade and went back to Solr 7.7.1.
In our company we are using Solr since Version 4 and so far, upgrading Solr to a newer version was possible without any problems. We are curious if others are experiencing the same kind of problems and if these are some known issues. Or maybe we did something wrong and missed something when upgrading? 1. Network issues when indexing documents ======================================= Our collection contains roughly 150 million documents. When we re-created the collection and re-indexed all documents, we regularly experienced network problems that causes our loader application to fail. The Solr log always contains an IOException Exception: ERROR (updateExecutor-5-thread-1338-processing-x:PSMG_CI_2020_04_15_10_07_04_sha rd6_replica_n22 r:core_node25 null n:solr2:8983_solr c:PSMG_CI_2020_04_15_10_07_04 s:shard6) [c:PSMG_CI_2020_04_15_10_07_04 s:shard6 r:core_node25 x:PSMG_CI_2020_04_15_10_07_04_shard6_replica_n22] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ to http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ => java.io.IOException: java.io.IOException: cancel_stream_error at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten tProvider.java:197) java.io.IOException: java.io.IOException: cancel_stream_error at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] at org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt ream.flush(OutputStreamContentProvider.java:151) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] at org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt ream.write(OutputStreamContentProvider.java:145) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] at org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:2 16) ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-1309:38:26] at org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream. java:209) ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 202003-13 09:38:26] at org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:172) ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 2020-03-13 09:3826] After the Exception the collection usually was in a degraded state for some time and shards try to recover and sync with the leader. In the Solr changelog we saw that one major change from 7.x to 8.x was that Solr now uses HTTP/2 instead of HTTP/1.1. So we tried to disable HTTP/2 by setting the system property solr.http1=true. That did make the indexing process a LOT more stable but we still saw a IOExceptions from time to time. Disabling HTTP/2 did not completely fix the problem. ERROR (updateExecutor-5-thread-9310-processing-x:PSMG_BOM_2020_04_28_05_00_11_sh ard7_replica_n24 r:core_node27 null n:solr3:8983_solr c:PSMG_BOM_2020_04_28_05_00_11 s:shard7) [c:PSMG_BOM_2020_04_28_05_00_11 s:shard7 r:core_node27 x:PSMG_BOM_2020_04_28_05_00_11_shard7_replica_n24] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=add{,id=5141653a-e33a-4b60-856d-7aa2ce73dee7}; node=ForwardNode: http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ to http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ => java.io.IOException: java.io.EOFException: HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0. 0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/600000}{io=0/0,ki o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <-> r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan ge=HttpExchange@6ffd260f req=PENDING/null@null res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]] at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten tProvider.java:197) java.io.IOException: java.io.EOFException: HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0. 0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/600000}{io=0/0,ki o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <-> r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan ge=HttpExchange@6ffd260f req=PENDING/null@null res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]] at org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] at org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt ream.flush(OutputStreamContentProvider.java:151) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] Our Solr nodes run inside Docker containers in a Docker Swarm cluster and we use a software defined overlay network (https://docs.docker.com/network/network-tutorial-overlay/#use-a-user-defi ned-overlay-network). Maybe the reason for the network problems is the combination of the new HTTP/2 implementation and the overlay network? We never had any network issues in Solr 7 with an otherwise exact same setup. 2. Incorrect Load Balancing ======================================= Our SolrCloud cluster contains three nodes and we use a cluster of three ZooKeeper nodes. We initialize our CloudSolrClient with the addresses of our ZooKeeper nodes and the CloudSolrClient should then load balance queries between the three Solr nodes. This works as expected in Solr 7. However, in Solr 8 we often see that the first Solr node receives twice as much queries as the second node and the third node receives no queries at all. 3. Problems with indexing Child Documents ======================================= When we index documents that contain Child Documents the our application regularly runs into a SocketTimeoutException: {"@timestamp":"2020-04-29T06:56:31.587Z","level":"SEVERE","logger_name":"o rg.apache.solr.client.solrj.impl.BaseCloudSolrClient","thread_name":"concu rrent/batchJobExecutorService-managedThreadFactory-Thread-17","log_message ": "Request to collection [PSMG_BOM_2020_04_29_06_52_36] failed due to (0) java.net.SocketTimeoutException: Read timed out, retry=0 commError=false errorCode=0 "} {"@timestamp":"2020-04-29T06:56:31.588Z","level":"INFO","logger_name":"org .apache.solr.client.solrj.impl.BaseCloudSolrClient","thread_name":"concurr ent/batchJobExecutorService-managedThreadFactory-Thread-17","log_message": "request was not communication error it seems"} Indexing Child Documents seems to be significantly slower in Solr 8 compared to Solr7. We set a timeout value of 2 minutes with CloudSolrClient.setSoTimeout(). In Solr 7 documents could be added within a few seconds and a timeout of 2 minutes was more than enough. Cheers, Ludger