Hello Ludger, I don't have answers to all of your questions, but for #2 (Incorrect Load Balancing) it is a bug that will be fixed in 8.6. You can find more info at SOLR-14471 <https://issues.apache.org/jira/browse/SOLR-14471>.
- Houston On Mon, May 11, 2020 at 8:16 AM Ludger Steens <ludger.ste...@qaware.de> wrote: > Hi all, > > we recently upgraded our SolrCloud cluster from version 7.7.1 to version > 8.5.0 and ran into multiple problems. > In the end we had to revert the upgrade and went back to Solr 7.7.1. > > In our company we are using Solr since Version 4 and so far, upgrading > Solr to a newer version was possible without any problems. > We are curious if others are experiencing the same kind of problems and if > these are some known issues. Or maybe we did something wrong and missed > something when upgrading? > > > 1. Network issues when indexing documents > ======================================= > > Our collection contains roughly 150 million documents. When we re-created > the collection and re-indexed all documents, we regularly experienced > network problems that causes our loader application to fail. > The Solr log always contains an IOException Exception: > > ERROR > (updateExecutor-5-thread-1338-processing-x:PSMG_CI_2020_04_15_10_07_04_sha > rd6_replica_n22 r:core_node25 null n:solr2:8983_solr > c:PSMG_CI_2020_04_15_10_07_04 s:shard6) [c:PSMG_CI_2020_04_15_10_07_04 > s:shard6 r:core_node25 x:PSMG_CI_2020_04_15_10_07_04_shard6_replica_n22] > o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling > SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode: > http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ to > http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ => > java.io.IOException: java.io.IOException: cancel_stream_error > at > org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten > tProvider.java:197) > java.io.IOException: java.io.IOException: cancel_stream_error > at > org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten > tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] > at > org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt > ream.flush(OutputStreamContentProvider.java:151) > ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] > at > org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt > ream.write(OutputStreamContentProvider.java:145) > ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] > at > org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:2 > 16) ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 > - romseygeek - 2020-03-1309:38:26] > at > org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream. > java:209) ~[solr-solrj-8.5.0.jar:8.5.0 > 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 202003-13 > 09:38:26] > at > org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:172) > ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - > romseygeek - 2020-03-13 09:3826] > > After the Exception the collection usually was in a degraded state for > some time and shards try to recover and sync with the leader. > > In the Solr changelog we saw that one major change from 7.x to 8.x was > that Solr now uses HTTP/2 instead of HTTP/1.1. So we tried to disable > HTTP/2 by setting the system property solr.http1=true. > That did make the indexing process a LOT more stable but we still saw a > IOExceptions from time to time. Disabling HTTP/2 did not completely fix > the problem. > > ERROR > (updateExecutor-5-thread-9310-processing-x:PSMG_BOM_2020_04_28_05_00_11_sh > ard7_replica_n24 r:core_node27 null n:solr3:8983_solr > c:PSMG_BOM_2020_04_28_05_00_11 s:shard7) [c:PSMG_BOM_2020_04_28_05_00_11 > s:shard7 r:core_node27 x:PSMG_BOM_2020_04_28_05_00_11_shard7_replica_n24] > o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling > SolrCmdDistributor$Req: cmd=add{,id=5141653a-e33a-4b60-856d-7aa2ce73dee7}; > node=ForwardNode: > http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ to > http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ => > java.io.IOException: java.io.EOFException: > HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0. > 0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/600000}{io=0/0,ki > o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <-> > r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan > ge=HttpExchange@6ffd260f req=PENDING/null@null > res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID > LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve > rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]] > at > org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten > tProvider.java:197) > java.io.IOException: java.io.EOFException: > HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0. > 0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/600000}{io=0/0,ki > o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <-> > r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan > ge=HttpExchange@6ffd260f req=PENDING/null@null > res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID > LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve > rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]] > at > org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten > tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] > at > org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt > ream.flush(OutputStreamContentProvider.java:151) > ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120] > > Our Solr nodes run inside Docker containers in a Docker Swarm cluster and > we use a software defined overlay network > (https://docs.docker.com/network/network-tutorial-overlay/#use-a-user-defi > ned-overlay-network > <https://docs.docker.com/network/network-tutorial-overlay/#use-a-user-defined-overlay-network> > ). > Maybe the reason for the network problems is the combination of the new > HTTP/2 implementation and the overlay network? We never had any network > issues in Solr 7 with an otherwise exact same setup. > > 2. Incorrect Load Balancing > ======================================= > > Our SolrCloud cluster contains three nodes and we use a cluster of three > ZooKeeper nodes. > We initialize our CloudSolrClient with the addresses of our ZooKeeper > nodes and the CloudSolrClient should then load balance queries between the > three Solr nodes. > This works as expected in Solr 7. However, in Solr 8 we often see that the > first Solr node receives twice as much queries as the second node and the > third node receives no queries at all. > > 3. Problems with indexing Child Documents > ======================================= > > When we index documents that contain Child Documents the our application > regularly runs into a SocketTimeoutException: > {"@timestamp":"2020-04-29T06:56:31.587Z","level":"SEVERE","logger_name":"o > rg.apache.solr.client.solrj.impl.BaseCloudSolrClient","thread_name":"concu > rrent/batchJobExecutorService-managedThreadFactory-Thread-17","log_message > ": > "Request to collection [PSMG_BOM_2020_04_29_06_52_36] failed due to (0) > java.net.SocketTimeoutException: Read timed out, retry=0 commError=false > errorCode=0 "} > > {"@timestamp":"2020-04-29T06:56:31.588Z","level":"INFO","logger_name":"org > .apache.solr.client.solrj.impl.BaseCloudSolrClient","thread_name":"concurr > ent/batchJobExecutorService-managedThreadFactory-Thread-17","log_message": > "request was not communication error it seems"} > > Indexing Child Documents seems to be significantly slower in Solr 8 > compared to Solr7. We set a timeout value of 2 minutes with > CloudSolrClient.setSoTimeout(). > In Solr 7 documents could be added within a few seconds and a timeout of 2 > minutes was more than enough. > > Cheers, > Ludger >