Hello, I have a few very strange problems and hope anyone can help me with that. I'm trying to index something with Solr 8.4.1 but after a few documents I get the following exceptions:
2020-04-23 13:00:43.484 INFO (qtp1635378213-21) [c:cc5363_dm_documentversion s:shard1 r:core_node3 x:cc5363_dm_documentversion_shard1_replica_n1] o.a.s.u.SolrCmdDistributor SolrCmdDistributor found 1 errors 2020-04-23 13:00:45.484 ERROR (qtp1635378213-21) [c:cc5363_dm_documentversion s:shard1 r:core_node3 x:cc5363_dm_documentversion_shard1_replica_n1] o.a.s.u.SolrCmdDistributor forwarding update to http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/ failed - retrying ... retries: 25/25. add{,id=100004691!100004706} params:update.distrib=TOLEADER&distrib.from=http://solr-1.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard1_replica_n1/ rsp:404:org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/: null request: http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/ at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.sendUpdateStream(ConcurrentUpdateHttp2SolrClient.java:274) at org.apache.solr.client.solrj.impl.ConcurrentUpdateHttp2SolrClient$Runner.run(ConcurrentUpdateHttp2SolrClient.java:181) at com.codahale.metrics.InstrumentedExecutorService$InstrumentedRunnable.run(InstrumentedExecutorService.java:181) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:210) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:834) 2020-04-23 13:00:45.489 WARN (updateExecutor-5-thread-1-processing-x:cc5363_dm_documentversion_shard1_replica_n1 r:core_node3 null n:solr-1.solr.cc-demo:8983_solr c:cc5363_dm_documentversion s:shard1) [c:cc5363_dm_documentversion s:shard1 r:core_node3 x:cc5363_dm_documentversion_shard1_replica_n1] o.a.s.c.s.i.ConcurrentUpdateHttp2SolrClient Failed to parse error response from http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/ due to: java.lang.RuntimeException: Invalid version (expected 2, but 60) or the data in not in 'javabin' format 2020-04-23 13:00:45.489 ERROR (updateExecutor-5-thread-1-processing-x:cc5363_dm_documentversion_shard1_replica_n1 r:core_node3 null n:solr-1.solr.cc-demo:8983_solr c:cc5363_dm_documentversion s:shard1) [c:cc5363_dm_documentversion s:shard1 r:core_node3 x:cc5363_dm_documentversion_shard1_replica_n1] o.a.s.u.ErrorReportingConcurrentUpdateSolrClient Error when calling SolrCmdDistributor$Req: cmd=add{,id=100004691!100004706}; node=ForwardNode: http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/ to http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/ => org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://solr-0.solr.cc-demo:8983/solr/cc5363_dm_documentversion_shard2_replica_n4/: null I have two nodes(solr-0 and solr-1) running in a stateful set in OpenShift with a single zookeeper instance. The collection cc5363_dm_documentversion is configured with a shardCount 2, replicationFactor 2, maxShardsPerNode 2, router compositeId and autoAddReplicas false. I create the collection on demand while indexing when encountering that a collection does not yet exist and create around 10 documents per "transaction" i.e. commit after 10 documents. The first thing that is strange is that some shards that are created automatically get the replica created on the same node as the leader. In this case shard1 has two replicas core_node3(replica_n1, the leader) and core_node5(replica_n2, the replica) which are both on solr-1. The shard2 has core_node7(replica_n4, the leader) on solr-0 and core_node8(replica_n6, the replica) on solr-1. That's what the web-interface tells me Replica: core_node3 core:cc5363_dm_documentversion_shard1_replica_n1 base URL:http://solr-1.solr.cc-demo:8983/solr node name:solr-1.solr.cc-demo:8983_solr state:active leader: yes Replica: core_node5 core:cc5363_dm_documentversion_shard1_replica_n2 base URL:http://solr-1.solr.cc-demo:8983/solr node name:solr-1.solr.cc-demo:8983_solr state:active leader: no Replica: core_node7 core:cc5363_dm_documentversion_shard2_replica_n4 base URL:http://solr-0.solr.cc-demo:8983/solr node name:solr-0.solr.cc-demo:8983_solr state:active leader: yes Replica: core_node8 core:cc5363_dm_documentversion_shard2_replica_n6 base URL:http://solr-1.solr.cc-demo:8983/solr node name:solr-1.solr.cc-demo:8983_solr state:active leader: no I thought the replica for shard1 should be, according to the configuration, on a different node. When trying to index with that configuration, I run into the described error. The next strange thing is, when I try to create a replica through the web interface on solr-0, I get a timeout, and when I refresh I see multiple replicas being created. This is how shard1 looks like afterwards: Replica: core_node3 core:cc5363_dm_documentversion_shard1_replica_n1 base URL:http://solr-1.solr.cc-demo:8983/solr node name:solr-1.solr.cc-demo:8983_solr state:active leader: yes Replica: core_node5 core:cc5363_dm_documentversion_shard1_replica_n2 base URL:http://solr-1.solr.cc-demo:8983/solr node name:solr-1.solr.cc-demo:8983_solr state:active leader: no Replica: core_node10 core:cc5363_dm_documentversion_shard1_replica_n9 base URL:http://solr-1.solr.cc-demo:8983/solr node name:solr-1.solr.cc-demo:8983_solr state:active leader: no Replica: core_node12 core:cc5363_dm_documentversion_shard1_replica_n11 base URL:http://solr-0.solr.cc-demo:8983/solr node name:solr-0.solr.cc-demo:8983_solr state:recovering leader: no Replica: core_node14 core:cc5363_dm_documentversion_shard1_replica_n13 base URL:http://solr-0.solr.cc-demo:8983/solr node name:solr-0.solr.cc-demo:8983_solr state:recovering leader: no Replica: core_node16 core:cc5363_dm_documentversion_shard1_replica_n15 base URL:http://solr-0.solr.cc-demo:8983/solr node name:solr-0.solr.cc-demo:8983_solr state:recovering leader: no Replica: core_node18 core:cc5363_dm_documentversion_shard1_replica_n17 base URL:http://solr-0.solr.cc-demo:8983/solr node name:solr-0.solr.cc-demo:8983_solr state:recovering leader: no Deleting the unnecessary replicas for solr-1 is no problem and works instantaneously, but whenever I try to delete a replica on solr-0 the web interface runs into a timeout. When reloading the view, the replica seems deleted though. In the logs I see message like these: Error from server at http://solr-0.solr.cc-demo:8983/solr: Cannot unload non-existent core [cc5363_dm_documentversion_shard1_replica_n15] After deleting unnecessary replicas, I am left with the replica core_node12(replica_n11) which seems stuck in the recovery state for a long time which makes no sense because there is almost no data to replicate only 6 small documents overall 16kb in size. After a while the replica becomes active. When I restart the indexing it seems to work for a while until another collection is created with the replica again on the same node as the leader. That's when the errors start again. It seems the root cause for the errors I see is having a replica on the same node as the leader. The thing is, I am not creating the replicas manually, they are created automatically according to the collection configuration. Regardless if this configuration makes sense, why should it be a problem to have the replica on the same node? Can anyone help me figure out how to fix this? I'm really desperate. Freundliche Grüße ----------------------------------------------- Christian Beikov Software-Architect, R&D curecomp Software Services GmbH Neue Werft Industriezeile 35 4020 Linz web: www.curecomp.com<http://www.curecomp.com/> E-Mail: c.bei...@curecomp.com<mailto:c.bei...@curecomp.com> mobile: +43 660 5566055 [BMEzertifizierung_Banner_Signatur7]