We are running Solr 7.2.0

Our configuration has several collections that are loaded into a solr cloud 
which is set to replicate using CDCR to 3 different solrclouds. All of our 
target collections have 2 shards with two replicas per shard. Our source 
collection has 2 shards, and 1 replica per shard.

Frequently we start to see errors where the target collections are out of date, 
and the cdcr action=errors endpoint shows large numbers of errors
For example:
{"responseHeader": {
"status": 0,
"QTime": 0},
"errors": [
"uc1f-ecom-mzk01:2181,uc1f-ecom-mzk02:2181,uc1f-ecom-mzk03:2181/solr",
["sial-catalog-product-20190824",
[
"consecutiveErrors",
700357,
"bad_request",
0,
"internal",
700357,
"last",
[
"2019-09-09T19:17:57.453Z",
"internal",
"2019-09-09T19:17:56.949Z",
"internal",
"2019-09-09T19:17:56.448Z"
,"internal",...

We have found that one or more tlogs have become corrupt. It appears that the 
CDCR keeps trying to send data, but cannot read the data from the tlog and then 
it retrys, forever.
How does this happen?  It seems to be very frequent, on a weekly basis and 
difficult to trouble shoot
Today we had it happen with one of our collections. Here is the listing for the 
tlog files:

$ ls -alht
total 604M
drwxr-xr-x 2 apache apache  44K Sep  9 14:27 .
-rw-r--r-- 1 apache apache 6.7M Sep  6 19:44 
tlog.0000000000000000766.1643975309914013696
-rw-r--r-- 1 apache apache  35M Sep  6 19:43 
tlog.0000000000000000765.1643975245907886080
-rw-r--r-- 1 apache apache  30M Sep  6 19:42 
tlog.0000000000000000764.1643975182924120064
-rw-r--r-- 1 apache apache  37M Sep  6 19:41 
tlog.0000000000000000763.1643975118316109824
-rw-r--r-- 1 apache apache  19M Sep  6 19:40 
tlog.0000000000000000762.1643975053918863360
-rw-r--r-- 1 apache apache  21M Sep  6 19:39 
tlog.0000000000000000761.1643974989726089216
-rw-r--r-- 1 apache apache  21M Sep  6 19:38 
tlog.0000000000000000760.1643974926010417152
-rw-r--r-- 1 apache apache  29M Sep  6 19:37 
tlog.0000000000000000759.1643974862567374848
-rw-r--r-- 1 apache apache 6.2M Sep  6 19:10 
tlog.0000000000000000758.1643973174027616256
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 
tlog.0000000000000000757.1643885009483857920
-rw-r--r-- 1 apache apache  27M Sep  5 19:48 
tlog.0000000000000000756.1643884946565103616
-rw-r--r-- 1 apache apache  35M Sep  5 19:47 
tlog.0000000000000000755.1643884877912735744
-rw-r--r-- 1 apache apache  30M Sep  5 19:46 
tlog.0000000000000000754.1643884812724862976
-rw-r--r-- 1 apache apache  25M Sep  5 19:45 
tlog.0000000000000000753.1643884748976685056
-rw-r--r-- 1 apache apache  18M Sep  5 19:44 
tlog.0000000000000000752.1643884685794738176
-rw-r--r-- 1 apache apache  21M Sep  5 19:43 
tlog.0000000000000000751.1643884621330382848
-rw-r--r-- 1 apache apache  16M Sep  5 19:42 
tlog.0000000000000000750.1643884558054064128
-rw-r--r-- 1 apache apache  26M Sep  5 19:41 
tlog.0000000000000000749.1643884494725316608
-rw-r--r-- 1 apache apache 5.8M Sep  5 19:12 
tlog.0000000000000000748.1643882681969147904
-rw-r--r-- 1 apache apache  31M Sep  4 19:56 
tlog.0000000000000000747.1643794877229563904
-rw-r--r-- 1 apache apache  31M Sep  4 19:55 
tlog.0000000000000000746.1643794813706829824
-rw-r--r-- 1 apache apache  30M Sep  4 19:54 
tlog.0000000000000000745.1643794749615767552
-rw-r--r-- 1 apache apache  22M Sep  4 19:53 
tlog.0000000000000000744.1643794686253465600
-rw-r--r-- 1 apache apache  18M Sep  4 19:52 
tlog.0000000000000000743.1643794622319689728
-rw-r--r-- 1 apache apache  21M Sep  4 19:51 
tlog.0000000000000000742.1643794558055612416
-rw-r--r-- 1 apache apache  15M Sep  4 19:50 
tlog.0000000000000000741.1643794493330161664
-rw-r--r-- 1 apache apache  26M Sep  4 19:49 
tlog.0000000000000000740.1643794428790308864
-rw-r--r-- 1 apache apache  11M Sep  4 14:58 
tlog.0000000000000000737.1643701398824550400
drwxr-xr-x 5 apache apache   53 Aug 21 06:30 ..
[apache@dfw-pauth-msc01 tlog]$ ls -alht 
tlog.0000000000000000757.1643885009483857920
-rw-r--r-- 1 apache apache 228K Sep  5 19:48 
tlog.0000000000000000757.1643885009483857920
$ date
Mon Sep  9 14:27:31 CDT 2019
$ pwd
/var/solr/data/sial-catalog-product-20190824_shard1_replica_n1/data/tlog

CDCR started replicating after we deleted the oldest tlog file and restarted 
CDCR
tlog.0000000000000000737.1643701398824550400

About the same time I found a number of errors in the solr logs like this:
2019-09-04 19:58:01.393 ERROR 
(recoveryExecutor-162-thread-1-processing-n:dfw-pauth-msc01:8983_solr 
x:sial-catalog-product-20190824_shard1_replica_n1 s:shard1 
c:sial-catalog-product-20190824 r:core_node3) [c:sial-catalog-product-20190824 
s:shard1 r:core_node3 x:sial-catalog-product-20190824_shard1_replica_n1] 
o.a.s.u.UpdateLog java.lang.ClassCastException

This was the most common error at the time, I saw it for all of our collections
2019-09-04 19:57:46.572 ERROR (qtp1355531311-20) 
[c:sial-catalog-product-20190824 s:shard1 r:core_node3 
x:sial-catalog-product-20190824_shard1_replica_n1] o.a.s.h.RequestHandlerBase 
org.apache.solr.common.SolrException: Action LASTPROCESSEDVERSION sent to 
non-leader replica

I would like to understand how this happens, and find better ways to diagnose 
it early.

This occurs frequently enough that we are thinking about developing an 
alternative to CDCR as it does not seem to be very robust

Any help would be appreciated.

Reply via email to