Thanks for the response Eric. To answer your question about index size on
disk, it is 3 TB on every node. As mentioned it's a 32 GB machine and I
allocated 24GB to Java heap.

Further monitoring the recovery, I see that when the follower node is
recovering, the leader node (which is NOT recovering) almost freezes with
100% CPU usage and 80%+ memory usage. Follower node's memory usage is 80%+
but CPU is very healthy. Also Follower node's log is filled up with updates
forwarded from the leader ("...PRE_UPDATE FINISH
{update.distrib=FROMLEADER&distrib.from=...") and replication starts much
afterwards.
There have been instances when complete recovery took 10+ hours. We have
upgraded to a 4 Gbps NIC between the nodes to see if it helps.

Also, a few followup questions:

1) Is  there a configuration which would start throttling update requests
if the replica falls behind a certain number of updates so as to not
trigger an index replication later?  If not, would it be a worthy
enhancement?
2) What would be a recommended hard commit interval for this kind of setup
?
3) What are some of the improvements in 7.5 with respect to recovery as
compared to 7.2.1?
4) What do the below peersync failure logs lines mean?  This would help me
better understand the reasons for peersync failure and maybe devise some
alert mechanism to start throttling update requests from application
program if feasible.

*PeerSync Failure type 1*:
----------------------------------
2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync Fingerprint comparison: 1

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync Other fingerprint:
{maxVersionSpecified=1624579878580912128,
maxVersionEncountered=1624579893816721408, maxInHash=1624579878580912128,
versionsHash=-8308981502886241345, numVersions=32966082, numDocs=32966165,
maxDoc=1828452}, Our fingerprint: {maxVersionSpecified=1624579878580912128,
maxVersionEncountered=1624579975760838656, maxInHash=1624579878580912128,
versionsHash=4017509388564167234, numVersions=32966066, numDocs=32966165,
maxDoc=1828452}

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 url=
http://indexnode1:8983/solr DONE. sync failed

2019-02-04 20:43:50.018 INFO
(recoveryExecutor-4-thread-2-processing-n:indexnode1:8983_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
org.apache.solr.cloud.RecoveryStrategy PeerSync Recovery was not successful
- trying replication.


*PeerSync Failure type 1*:
---------------------------------
2019-02-02 20:26:56.256 WARN
(recoveryExecutor-4-thread-11-processing-n:indexnode1:20000_solr
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46
s:shard12 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node49)
[c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard12 r:core_node49
x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46]
org.apache.solr.update.PeerSync PeerSync:
core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46 url=
http://indexnode1:20000/solr too many updates received since start -
startingUpdates no longer overlaps with our currentUpdates


Regards,
Rahul

On Thu, Feb 7, 2019 at 12:59 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> bq. We have a heavy indexing load of about 10,000 documents every 150
> seconds.
> Not so heavy query load.
>
> It's unlikely that changing numRecordsToKeep will help all that much if
> your
> maintenance window is very large. Rather, that number would have to be
> _very_
> high.
>
> 7 hours is huge. How big are your indexes on disk? You're essentially
> going to get a
> full copy from the leader for each replica, so network bandwidth may
> be the bottleneck.
> Plus, every doc that gets indexed to the leader during sync will be stored
> away in the replica's tlog (not limited by numRecordsToKeep) and replayed
> after
> the full index replication is accomplished.
>
> Much of the retry logic for replication has been improved starting
> with Solr 7.3 and,
> in particular, Solr 7.5. That might address your replicas that just
> fail to replicate ever,
> but won't help that replicas need to full sync anyway.
>
> That said, by far the simplest thing would be to stop indexing during
> your maintenance
> window if at all possible.
>
> Best,
> Erick
>
> On Tue, Feb 5, 2019 at 9:11 PM Rahul Goswami <rahul196...@gmail.com>
> wrote:
> >
> > Hello Solr gurus,
> >
> > So I have a scenario where on Solr cluster restart the replica node goes
> > into full index replication for about 7 hours. Both replica nodes are
> > restarted around the same time for maintenance. Also, during usual times,
> > if one node goes down for whatever reason, upon restart it again does
> index
> > replication. In certain instances, some replicas just fail to recover.
> >
> > *SolrCloud 7.2.1 *cluster configuration*:*
> > ============================
> > 16 shards - replication factor=2
> >
> > Per server configuration:
> > ======================
> > 32GB machine - 16GB heap space for Solr
> > Index size : 3TB per server
> >
> > autoCommit (openSearcher=false) of 3 minutes
> >
> > We have a heavy indexing load of about 10,000 documents every 150
> seconds.
> > Not so heavy query load.
> >
> > Reading through some of the threads on similar topic, I suspect it would
> be
> > the disparity between the number of updates(>100) between the replicas
> that
> > is causing this (courtesy our indexing load). One of the suggestions I
> saw
> > was using numRecordsToKeep.
> > However as Erick mentioned in one of the threads, that's a bandaid
> measure
> > and I am trying to eliminate some of the fundamental issues that might
> > exist.
> >
> > 1) Is the heap too less for that index size? If yes, what would be a
> > recommended max heap size?
> > 2) Is there a general guideline to estimate the required max heap based
> on
> > index size on disk?
> > 3) What would be a recommended autoCommit and autoSoftCommit interval ?
> > 4) Any configurations that would help improve the restart time and avoid
> > full replication?
> > 5) Does Solr retain "numRecordsToKeep" number of  documents in tlog *per
> > replica*?
> > 6) The reasons for peersync from below logs are not completely clear to
> me.
> > Can someone please elaborate?
> >
> > *PeerSync fails with* :
> >
> > Failure type 1:
> > -----------------
> > 2019-02-04 20:43:50.018 INFO
> > (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> > s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> > [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> > org.apache.solr.update.PeerSync Fingerprint comparison: 1
> >
> > 2019-02-04 20:43:50.018 INFO
> > (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> > s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> > [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> > org.apache.solr.update.PeerSync Other fingerprint:
> > {maxVersionSpecified=1624579878580912128,
> > maxVersionEncountered=1624579893816721408, maxInHash=1624579878580912128,
> > versionsHash=-8308981502886241345, numVersions=32966082,
> numDocs=32966165,
> > maxDoc=1828452}, Our fingerprint:
> {maxVersionSpecified=1624579878580912128,
> > maxVersionEncountered=1624579975760838656, maxInHash=1624579878580912128,
> > versionsHash=4017509388564167234, numVersions=32966066, numDocs=32966165,
> > maxDoc=1828452}
> >
> > 2019-02-04 20:43:50.018 INFO
> > (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> > s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> > [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> > org.apache.solr.update.PeerSync PeerSync:
> > core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> url=
> > http://indexnode1:8983/solr DONE. sync failed
> >
> > 2019-02-04 20:43:50.018 INFO
> > (recoveryExecutor-4-thread-2-processing-n:indexnode1:8983_solr
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> > s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> > [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> > org.apache.solr.cloud.RecoveryStrategy PeerSync Recovery was not
> successful
> > - trying replication.
> >
> >
> > Failure type 2:
> > ------------------
> > 2019-02-02 20:26:56.256 WARN
> > (recoveryExecutor-4-thread-11-processing-n:indexnode1:20000_solr
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46
> > s:shard12 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node49)
> > [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard12 r:core_node49
> > x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46]
> > org.apache.solr.update.PeerSync PeerSync:
> > core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46
> url=
> > http://indexnode1:20000/solr too many updates received since start -
> > startingUpdates no longer overlaps with our currentUpdates
> >
> >
> > Thanks,
> > Rahul
>

Reply via email to