We are running Solr 7.2.0 Our configuration has several collections that are loaded into a solr cloud which is set to replicate using CDCR to 3 different solrclouds. All of our target collections have 2 shards with two replicas per shard. Our source collection has 2 shards, and 1 replica per shard.
Frequently we start to see errors where the target collections are out of date, and the cdcr action=errors endpoint shows large numbers of errors For example: {"responseHeader": { "status": 0, "QTime": 0}, "errors": [ "uc1f-ecom-mzk01:2181,uc1f-ecom-mzk02:2181,uc1f-ecom-mzk03:2181/solr", ["sial-catalog-product-20190824", [ "consecutiveErrors", 700357, "bad_request", 0, "internal", 700357, "last", [ "2019-09-09T19:17:57.453Z", "internal", "2019-09-09T19:17:56.949Z", "internal", "2019-09-09T19:17:56.448Z" ,"internal",... We have found that one or more tlogs have become corrupt. It appears that the CDCR keeps trying to send data, but cannot read the data from the tlog and then it retrys, forever. How does this happen? It seems to be very frequent, on a weekly basis and difficult to trouble shoot Today we had it happen with one of our collections. Here is the listing for the tlog files: $ ls -alht total 604M drwxr-xr-x 2 apache apache 44K Sep 9 14:27 . -rw-r--r-- 1 apache apache 6.7M Sep 6 19:44 tlog.0000000000000000766.1643975309914013696 -rw-r--r-- 1 apache apache 35M Sep 6 19:43 tlog.0000000000000000765.1643975245907886080 -rw-r--r-- 1 apache apache 30M Sep 6 19:42 tlog.0000000000000000764.1643975182924120064 -rw-r--r-- 1 apache apache 37M Sep 6 19:41 tlog.0000000000000000763.1643975118316109824 -rw-r--r-- 1 apache apache 19M Sep 6 19:40 tlog.0000000000000000762.1643975053918863360 -rw-r--r-- 1 apache apache 21M Sep 6 19:39 tlog.0000000000000000761.1643974989726089216 -rw-r--r-- 1 apache apache 21M Sep 6 19:38 tlog.0000000000000000760.1643974926010417152 -rw-r--r-- 1 apache apache 29M Sep 6 19:37 tlog.0000000000000000759.1643974862567374848 -rw-r--r-- 1 apache apache 6.2M Sep 6 19:10 tlog.0000000000000000758.1643973174027616256 -rw-r--r-- 1 apache apache 228K Sep 5 19:48 tlog.0000000000000000757.1643885009483857920 -rw-r--r-- 1 apache apache 27M Sep 5 19:48 tlog.0000000000000000756.1643884946565103616 -rw-r--r-- 1 apache apache 35M Sep 5 19:47 tlog.0000000000000000755.1643884877912735744 -rw-r--r-- 1 apache apache 30M Sep 5 19:46 tlog.0000000000000000754.1643884812724862976 -rw-r--r-- 1 apache apache 25M Sep 5 19:45 tlog.0000000000000000753.1643884748976685056 -rw-r--r-- 1 apache apache 18M Sep 5 19:44 tlog.0000000000000000752.1643884685794738176 -rw-r--r-- 1 apache apache 21M Sep 5 19:43 tlog.0000000000000000751.1643884621330382848 -rw-r--r-- 1 apache apache 16M Sep 5 19:42 tlog.0000000000000000750.1643884558054064128 -rw-r--r-- 1 apache apache 26M Sep 5 19:41 tlog.0000000000000000749.1643884494725316608 -rw-r--r-- 1 apache apache 5.8M Sep 5 19:12 tlog.0000000000000000748.1643882681969147904 -rw-r--r-- 1 apache apache 31M Sep 4 19:56 tlog.0000000000000000747.1643794877229563904 -rw-r--r-- 1 apache apache 31M Sep 4 19:55 tlog.0000000000000000746.1643794813706829824 -rw-r--r-- 1 apache apache 30M Sep 4 19:54 tlog.0000000000000000745.1643794749615767552 -rw-r--r-- 1 apache apache 22M Sep 4 19:53 tlog.0000000000000000744.1643794686253465600 -rw-r--r-- 1 apache apache 18M Sep 4 19:52 tlog.0000000000000000743.1643794622319689728 -rw-r--r-- 1 apache apache 21M Sep 4 19:51 tlog.0000000000000000742.1643794558055612416 -rw-r--r-- 1 apache apache 15M Sep 4 19:50 tlog.0000000000000000741.1643794493330161664 -rw-r--r-- 1 apache apache 26M Sep 4 19:49 tlog.0000000000000000740.1643794428790308864 -rw-r--r-- 1 apache apache 11M Sep 4 14:58 tlog.0000000000000000737.1643701398824550400 drwxr-xr-x 5 apache apache 53 Aug 21 06:30 .. [apache@dfw-pauth-msc01 tlog]$ ls -alht tlog.0000000000000000757.1643885009483857920 -rw-r--r-- 1 apache apache 228K Sep 5 19:48 tlog.0000000000000000757.1643885009483857920 $ date Mon Sep 9 14:27:31 CDT 2019 $ pwd /var/solr/data/sial-catalog-product-20190824_shard1_replica_n1/data/tlog CDCR started replicating after we deleted the oldest tlog file and restarted CDCR tlog.0000000000000000737.1643701398824550400 About the same time I found a number of errors in the solr logs like this: 2019-09-04 19:58:01.393 ERROR (recoveryExecutor-162-thread-1-processing-n:dfw-pauth-msc01:8983_solr x:sial-catalog-product-20190824_shard1_replica_n1 s:shard1 c:sial-catalog-product-20190824 r:core_node3) [c:sial-catalog-product-20190824 s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.u.UpdateLog java.lang.ClassCastException This was the most common error at the time, I saw it for all of our collections 2019-09-04 19:57:46.572 ERROR (qtp1355531311-20) [c:sial-catalog-product-20190824 s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Action LASTPROCESSEDVERSION sent to non-leader replica I would like to understand how this happens, and find better ways to diagnose it early. This occurs frequently enough that we are thinking about developing an alternative to CDCR as it does not seem to be very robust Any help would be appreciated.