Hello,

I have a few very strange problems and hope anyone can help me with that. I'm 
trying to index something with Solr 8.4.1 but after a few documents I get the 
following exceptions:

2020-04-23 13:00:43.484 INFO  (qtp1635378213-21) [c:cc5363_dm_documentversion 
s:shard1 r:core_node3 x:cc5363_dm_documentversion_shard1_replica_n1] 
o.a.s.u.SolrCmdDistributor SolrCmdDistributor found 1 errors
2020-04-23 13:00:45.484 ERROR (qtp1635378213-21) [c:cc5363_dm_documentversion 
s:shard1 r:core_node3 x:cc5363_dm_documentversion_shard1_replica_n1] 
o.a.s.u.SolrCmdDistributor forwarding update to 
http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/
  failed - retrying ... retries: 25/25. add{,id=100004691!100004706} 
params:update.distrib=TOLEADER&distrib.from=http://solr-1.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard1_replica_n1/
  rsp:404:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: 
Error from server at 
http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/:
  null



request: 
http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/
             at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:274)
             at 
org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:181)
             at 
com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181)
             at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210)
             at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
             at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
             at java.base/java.lang.Thread.run(Thread.java:834)

2020-04-23 13:00:45.489 WARN  
(updateExecutor-5-thread-1-processing-x:cc5363_dm_documentversion_shard1_replica_n1
 r:core_node3 null n:solr-1.solr.cc-demo:8983_solr c:cc5363_dm_documentversion 
s:shard1) [c:cc5363_dm_documentversion s:shard1 r:core_node3 
x:cc5363_dm_documentversion_shard1_replica_n1] 
o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Failed to parse error response from 
http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/
  due to: java.lang.RuntimeException: Invalid version (expected 2, but 60) or 
the data in not in 'javabin' format
2020-04-23 13:00:45.489 ERROR 
(updateExecutor-5-thread-1-processing-x:cc5363_dm_documentversion_shard1_replica_n1
 r:core_node3 null n:solr-1.solr.cc-demo:8983_solr c:cc5363_dm_documentversion 
s:shard1) [c:cc5363_dm_documentversion s:shard1 r:core_node3 
x:cc5363_dm_documentversion_shard1_replica_n1] 
o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling 
SolrCmdDistributor$Req: cmd=add{,id=100004691!100004706}; node=ForwardNode: 
http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/
  to 
http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/
  => org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: 
Error from server at 
http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/:
  null

I have two nodes(solr-0 and solr-1) running in a stateful set in OpenShift with 
a single zookeeper instance. The collection cc5363_dm_documentversion is 
configured with a shardCount 2, replicationFactor 2, maxShardsPerNode 2, router 
compositeId and autoAddReplicas false. I create the collection on demand while 
indexing when encountering that a collection does not yet exist and create 
around 10 documents per "transaction" i.e. commit after 10 documents.

The first thing that is strange is that some shards that are created 
automatically get the replica created on the same node as the leader. In this 
case shard1 has two replicas core_node3(replica_n1, the leader) and 
core_node5(replica_n2, the replica) which are both on solr-1. The shard2 has 
core_node7(replica_n4, the leader) on solr-0 and core_node8(replica_n6, the 
replica) on solr-1. That's what the web-interface tells me

Replica: core_node3
core:cc5363_dm_documentversion_shard1_replica_n1
base URL:http://solr-1.solr.cc-demo:8983/solr
node name:solr-1.solr.cc-demo:8983_solr
state:active
leader: yes
Replica: core_node5
core:cc5363_dm_documentversion_shard1_replica_n2
base URL:http://solr-1.solr.cc-demo:8983/solr
node name:solr-1.solr.cc-demo:8983_solr
state:active
leader: no

Replica: core_node7
core:cc5363_dm_documentversion_shard2_replica_n4
base URL:http://solr-0.solr.cc-demo:8983/solr
node name:solr-0.solr.cc-demo:8983_solr
state:active
leader: yes
Replica: core_node8
core:cc5363_dm_documentversion_shard2_replica_n6
base URL:http://solr-1.solr.cc-demo:8983/solr
node name:solr-1.solr.cc-demo:8983_solr
state:active
leader: no

I thought the replica for shard1 should be, according to the configuration, on 
a different node. When trying to index with that configuration, I run into the 
described error.
The next strange thing is, when I try to create a replica through the web 
interface on solr-0, I get a timeout, and when I refresh I see multiple 
replicas being created. This is how shard1 looks like afterwards:

Replica: core_node3
core:cc5363_dm_documentversion_shard1_replica_n1
base URL:http://solr-1.solr.cc-demo:8983/solr
node name:solr-1.solr.cc-demo:8983_solr
state:active
leader: yes
Replica: core_node5
core:cc5363_dm_documentversion_shard1_replica_n2
base URL:http://solr-1.solr.cc-demo:8983/solr
node name:solr-1.solr.cc-demo:8983_solr
state:active
leader: no
Replica: core_node10
core:cc5363_dm_documentversion_shard1_replica_n9
base URL:http://solr-1.solr.cc-demo:8983/solr
node name:solr-1.solr.cc-demo:8983_solr
state:active
leader: no
Replica: core_node12
core:cc5363_dm_documentversion_shard1_replica_n11
base URL:http://solr-0.solr.cc-demo:8983/solr
node name:solr-0.solr.cc-demo:8983_solr
state:recovering
leader: no
Replica: core_node14
core:cc5363_dm_documentversion_shard1_replica_n13
base URL:http://solr-0.solr.cc-demo:8983/solr
node name:solr-0.solr.cc-demo:8983_solr
state:recovering
leader: no
Replica: core_node16
core:cc5363_dm_documentversion_shard1_replica_n15
base URL:http://solr-0.solr.cc-demo:8983/solr
node name:solr-0.solr.cc-demo:8983_solr
state:recovering
leader: no
Replica: core_node18
core:cc5363_dm_documentversion_shard1_replica_n17
base URL:http://solr-0.solr.cc-demo:8983/solr
node name:solr-0.solr.cc-demo:8983_solr
state:recovering
leader: no

Deleting the unnecessary replicas for solr-1 is no problem and works 
instantaneously, but whenever I try to delete a replica on solr-0 the web 
interface runs into a timeout. When reloading the view, the replica seems 
deleted though.
In the logs I see message like these:

Error from server at http://solr-0.solr.cc-demo:8983/solr:  Cannot unload 
non-existent core [cc5363_dm_documentversion_shard1_replica_n15]

After deleting unnecessary replicas, I am left with the replica 
core_node12(replica_n11) which seems stuck in the recovery state for a long 
time which makes no sense because there is almost no data to replicate only 6 
small documents overall 16kb in size.
After a while the replica becomes active. When I restart the indexing it seems 
to work for a while until another collection is created with the replica again 
on the same node as the leader. That's when the errors start again.

It seems the root cause for the errors I see is having a replica on the same 
node as the leader. The thing is, I am not creating the replicas manually, they 
are created automatically according to the collection configuration. Regardless 
if this configuration makes sense, why should it be a problem to have the 
replica on the same node?

Can anyone help me figure out how to fix this? I'm really desperate.

Freundliche Grüße

-----------------------------------------------
Christian Beikov
Software-Architect, R&D

curecomp Software Services GmbH
Neue Werft
Industriezeile 35
4020 Linz

web: www.curecomp.com<http://www.curecomp.com/>
E-Mail: c.bei...@curecomp.com<mailto:c.bei...@curecomp.com>
mobile: +43 660 5566055

[BMEzertifizierung_Banner_Signatur7]



Reply via email to