Hi all,

we recently upgraded our SolrCloud cluster from version 7.7.1 to version
8.5.0 and ran into multiple problems.
In the end we had to revert the upgrade and went back to Solr 7.7.1.

In our company we are using Solr since Version 4 and so far, upgrading
Solr to a newer version was possible without any problems.
We are curious if others are experiencing the same kind of problems and if
these are some known issues. Or maybe we did something wrong and missed
something when upgrading?


1. Network issues when indexing documents
=======================================

Our collection contains roughly 150 million documents.  When we re-created
the collection and re-indexed all documents, we regularly experienced
network problems that causes our loader application to fail.
The Solr log always contains an IOException Exception:

ERROR
(updateExecutor-5-thread-1338-processing-x:PSMG_CI_2020_04_15_10_07_04_sha
rd6_replica_n22 r:core_node25 null n:solr2:8983_solr
c:PSMG_CI_2020_04_15_10_07_04 s:shard6) [c:PSMG_CI_2020_04_15_10_07_04
s:shard6 r:core_node25 x:PSMG_CI_2020_04_15_10_07_04_shard6_replica_n22]
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req: cmd=add{,id=(null)}; node=StdNode:
http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ to
http://solr1:8983/solr/PSMG_CI_2020_04_15_10_07_04_shard6_replica_n20/ =>
java.io.IOException: java.io.IOException: cancel_stream_error
         at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
tProvider.java:197)
 java.io.IOException: java.io.IOException: cancel_stream_error
         at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
         at
org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt
ream.flush(OutputStreamContentProvider.java:151)
~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
         at
org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt
ream.write(OutputStreamContentProvider.java:145)
~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
         at
org.apache.solr.common.util.FastOutputStream.flush(FastOutputStream.java:2
16) ~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42
- romseygeek - 2020-03-1309:38:26]
         at
org.apache.solr.common.util.FastOutputStream.flushBuffer(FastOutputStream.
java:209) ~[solr-solrj-8.5.0.jar:8.5.0
7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 - romseygeek - 202003-13
09:38:26]
         at
org.apache.solr.common.util.JavaBinCodec.marshal(JavaBinCodec.java:172)
~[solr-solrj-8.5.0.jar:8.5.0 7ac489bf7b97b61749b19fa2ee0dc46e74b8dc42 -
romseygeek - 2020-03-13 09:3826]

After the Exception the collection usually was in a degraded state for
some time and shards try to recover and sync with the leader.

In the Solr changelog we saw that one major change from 7.x to 8.x was
that Solr now uses HTTP/2 instead of HTTP/1.1. So we tried to disable
HTTP/2 by setting the system property solr.http1=true.
That did make the indexing process a LOT more stable but we still saw a
IOExceptions from time to time. Disabling HTTP/2 did not completely fix
the problem.

ERROR
(updateExecutor-5-thread-9310-processing-x:PSMG_BOM_2020_04_28_05_00_11_sh
ard7_replica_n24 r:core_node27 null n:solr3:8983_solr
c:PSMG_BOM_2020_04_28_05_00_11 s:shard7) [c:PSMG_BOM_2020_04_28_05_00_11
s:shard7 r:core_node27 x:PSMG_BOM_2020_04_28_05_00_11_shard7_replica_n24]
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling
SolrCmdDistributor$Req: cmd=add{,id=5141653a-e33a-4b60-856d-7aa2ce73dee7};
node=ForwardNode:
http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ to
http://solr2:8983/solr/PSMG_BOM_2020_04_28_05_00_11_shard6_replica_n22/ =>
java.io.IOException: java.io.EOFException:
HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0.
0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/600000}{io=0/0,ki
o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <->
r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan
ge=HttpExchange@6ffd260f req=PENDING/null@null
res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID
LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve
rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]]
        at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
tProvider.java:197)
java.io.IOException: java.io.EOFException:
HttpConnectionOverHTTP@9dc7ad1::SocketChannelEndPoint@2d20213b{solr2/10.0.
0.216:8983<->/10.0.0.193:38728,ISHUT,fill=-,flush=-,to=5/600000}{io=0/0,ki
o=0,kro=1}->HttpConnectionOverHTTP@9dc7ad1(l:/10.0.0.193:38728 <->
r:solr2/10.0.0.216:8983,closed=false)=>HttpChannelOverHTTP@47a242c3(exchan
ge=HttpExchange@6ffd260f req=PENDING/null@null
res=PENDING/null@null)[send=HttpSenderOverHTTP@17e056f9(req=CONTENT,snd=ID
LE,failure=null)[HttpGenerator@3b6594c7{s=COMMITTED}],recv=HttpReceiverOve
rHTTP@6e847d32(rsp=IDLE,failure=null)[HttpParser{s=CLOSED,0 of -1}]]
        at
org.eclipse.jetty.client.util.DeferredContentProvider.flush(DeferredConten
tProvider.java:197) ~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]
        at
org.eclipse.jetty.client.util.OutputStreamContentProvider$DeferredOutputSt
ream.flush(OutputStreamContentProvider.java:151)
~[jetty-client-9.4.24.v20191120.jar:9.4.24.v20191120]

Our Solr nodes run inside Docker containers in a Docker Swarm cluster and
we use a software defined overlay network
(https://docs.docker.com/network/network-tutorial-overlay/#use-a-user-defi
ned-overlay-network).
Maybe the reason for the network problems is the combination of the new
HTTP/2 implementation and the overlay network? We never had any network
issues in Solr 7 with an otherwise exact same setup.

2. Incorrect Load Balancing
=======================================

Our SolrCloud cluster contains three nodes and we use a cluster of three
ZooKeeper nodes.
We initialize our CloudSolrClient with the addresses of our ZooKeeper
nodes and the CloudSolrClient should then load balance queries between the
three Solr nodes.
This works as expected in Solr 7. However, in Solr 8 we often see that the
first Solr node receives twice as much queries as the second node and the
third node receives no queries at all.

3. Problems with indexing Child Documents
=======================================

When we index documents that contain Child Documents the our application
regularly runs into a SocketTimeoutException:
{"@timestamp":"2020-04-29T06:56:31.587Z","level":"SEVERE","logger_name":"o
rg.apache.solr.client.solrj.impl.BaseCloudSolrClient","thread_name":"concu
rrent/batchJobExecutorService-managedThreadFactory-Thread-17","log_message
":
 "Request to collection [PSMG_BOM_2020_04_29_06_52_36] failed due to (0)
java.net.SocketTimeoutException: Read timed out, retry=0 commError=false
errorCode=0 "}

{"@timestamp":"2020-04-29T06:56:31.588Z","level":"INFO","logger_name":"org
.apache.solr.client.solrj.impl.BaseCloudSolrClient","thread_name":"concurr
ent/batchJobExecutorService-managedThreadFactory-Thread-17","log_message":
 "request was not communication error it seems"}

Indexing Child Documents seems to be significantly slower in Solr 8
compared to Solr7. We set a timeout value of 2 minutes  with
CloudSolrClient.setSoTimeout().
In Solr 7 documents could be added within a few seconds and a timeout of 2
minutes was more than enough.

Cheers,
Ludger

Reply via email to