Hello Ludger,

I don't have answers to all of your questions, but for #2 (Incorrect Load
Balancing) it is a bug that will be fixed in 8.6. You can find more info at
SOLR-14471 <https://issues.apache.org/jira/browse/SOLR-14471>.

- Houston

On Mon, May 11, 2020 at 8:16 AM Ludger Steens <ludger.ste...@qaware.de>
wrote:

> Hi all,
>
> we recently upgraded our SolrCloud cluster from version 7.7.1 to version
> 8.5.0 and ran into multiple problems.
> In the end we had to revert the upgrade and went back to Solr 7.7.1.
>
> In our company we are using Solr since Version 4 and so far, upgrading
> Solr to a newer version was possible without any problems.
> We are curious if others are experiencing the same kind of problems and if
> these are some known issues. Or maybe we did something wrong and missed
> something when upgrading?
>
>
> 1. Network issues when indexing documents
> =======================================
>
> Our collection contains roughly 150 million documents.  When we re-created
> the collection and re-indexed all documents, we regularly experienced
> network problems that causes our loader application to fail.
> The Solr log always contains an IOException Exception:
>
> ERROR
> (updateExecutor-5-thread-1338-processing-x:PSMG_CI_2020_04_15_10_07_04_sha
> rd6_replica_n22 r:core_node25 null n:solr2:8983_solr
> c:PSMG_CI_2020_04_15_10_07_04 s:shard6) [c:PSMG_CI_2020_04_15_10_07_04
> s:shard6 r:core_node25 x:PSMG_CI_2020_04_15_10_07_04_shard6_replica_n22]
> o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
> SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode:
> http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ to
> http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ =>
> java.io.IOException: java.io.IOException: cancel_stream_error
>          at
> org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
> tProvider.java:197)
>  java.io.IOException: java.io.IOException: cancel_stream_error
>          at
> org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
> tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
>          at
> org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt
> ream.flush(OutputStreamContentProvider.java:151)
> ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
>          at
> org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt
> ream.write(OutputStreamContentProvider.java:145)
> ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
>          at
> org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:2
> 16) ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42
> - romseygeek - 2020-03-1309:38:26]
>          at
> org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.
> java:209) ~[solr-solrj-8.5.0.jar:8.5.0
> 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 202003-13
> 09:38:26]
>          at
> org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:172)
> ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 -
> romseygeek - 2020-03-13 09:3826]
>
> After the Exception the collection usually was in a degraded state for
> some time and shards try to recover and sync with the leader.
>
> In the Solr changelog we saw that one major change from 7.x to 8.x was
> that Solr now uses HTTP/2 instead of HTTP/1.1. So we tried to disable
> HTTP/2 by setting the system property solr.http1=true.
> That did make the indexing process a LOT more stable but we still saw a
> IOExceptions from time to time. Disabling HTTP/2 did not completely fix
> the problem.
>
> ERROR
> (updateExecutor-5-thread-9310-processing-x:PSMG_BOM_2020_04_28_05_00_11_sh
> ard7_replica_n24 r:core_node27 null n:solr3:8983_solr
> c:PSMG_BOM_2020_04_28_05_00_11 s:shard7) [c:PSMG_BOM_2020_04_28_05_00_11
> s:shard7 r:core_node27 x:PSMG_BOM_2020_04_28_05_00_11_shard7_replica_n24]
> o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
> SolrCmdDistributor$Req: cmd=add{,id=5141653a-e33a-4b60-856d-7aa2ce73dee7};
> node=ForwardNode:
> http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ to
> http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ =>
> java.io.IOException: java.io.EOFException:
> HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0.
> 0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/600000}{io=0/0,ki
> o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <->
> r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan
> ge=HttpExchange@6ffd260f req=PENDING/null@null
> res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID
> LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve
> rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]]
>         at
> org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
> tProvider.java:197)
> java.io.IOException: java.io.EOFException:
> HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0.
> 0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/600000}{io=0/0,ki
> o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <->
> r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan
> ge=HttpExchange@6ffd260f req=PENDING/null@null
> res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID
> LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve
> rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]]
>         at
> org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
> tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
>         at
> org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt
> ream.flush(OutputStreamContentProvider.java:151)
> ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
>
> Our Solr nodes run inside Docker containers in a Docker Swarm cluster and
> we use a software defined overlay network
> (https://docs.docker.com/network/network-tutorial-overlay/#use-a-user-defi
> ned-overlay-network
> <https://docs.docker.com/network/network-tutorial-overlay/#use-a-user-defined-overlay-network>
> ).
> Maybe the reason for the network problems is the combination of the new
> HTTP/2 implementation and the overlay network? We never had any network
> issues in Solr 7 with an otherwise exact same setup.
>
> 2. Incorrect Load Balancing
> =======================================
>
> Our SolrCloud cluster contains three nodes and we use a cluster of three
> ZooKeeper nodes.
> We initialize our CloudSolrClient with the addresses of our ZooKeeper
> nodes and the CloudSolrClient should then load balance queries between the
> three Solr nodes.
> This works as expected in Solr 7. However, in Solr 8 we often see that the
> first Solr node receives twice as much queries as the second node and the
> third node receives no queries at all.
>
> 3. Problems with indexing Child Documents
> =======================================
>
> When we index documents that contain Child Documents the our application
> regularly runs into a SocketTimeoutException:
> {"@timestamp":"2020-04-29T06:56:31.587Z","level":"SEVERE","logger_name":"o
> rg.apache.solr.client.solrj.impl.BaseCloudSolrClient","thread_name":"concu
> rrent/batchJobExecutorService-managedThreadFactory-Thread-17","log_message
> ":
>  "Request to collection [PSMG_BOM_2020_04_29_06_52_36] failed due to (0)
> java.net.SocketTimeoutException: Read timed out, retry=0 commError=false
> errorCode=0 "}
>
> {"@timestamp":"2020-04-29T06:56:31.588Z","level":"INFO","logger_name":"org
> .apache.solr.client.solrj.impl.BaseCloudSolrClient","thread_name":"concurr
> ent/batchJobExecutorService-managedThreadFactory-Thread-17","log_message":
>  "request was not communication error it seems"}
>
> Indexing Child Documents seems to be significantly slower in Solr 8
> compared to Solr7. We set a timeout value of 2 minutes  with
> CloudSolrClient.setSoTimeout().
> In Solr 7 documents could be added within a few seconds and a timeout of 2
> minutes was more than enough.
>
> Cheers,
> Ludger
>

Reply via email to