bq. We have a heavy indexing load of about 10,000 documents every 150 seconds.
Not so heavy query load.

It's unlikely that changing numRecordsToKeep will help all that much if your
maintenance window is very large. Rather, that number would have to be _very_
high.

7 hours is huge. How big are your indexes on disk? You're essentially
going to get a
full copy from the leader for each replica, so network bandwidth may
be the bottleneck.
Plus, every doc that gets indexed to the leader during sync will be stored
away in the replica's tlog (not limited by numRecordsToKeep) and replayed after
the full index replication is accomplished.

Much of the retry logic for replication has been improved starting
with Solr 7.3 and,
in particular, Solr 7.5. That might address your replicas that just
fail to replicate ever,
but won't help that replicas need to full sync anyway.

That said, by far the simplest thing would be to stop indexing during
your maintenance
window if at all possible.

Best,
Erick

On Tue, Feb 5, 2019 at 9:11 PM Rahul Goswami <rahul196...@gmail.com> wrote:
>
> Hello Solr gurus,
>
> So I have a scenario where on Solr cluster restart the replica node goes
> into full index replication for about 7 hours. Both replica nodes are
> restarted around the same time for maintenance. Also, during usual times,
> if one node goes down for whatever reason, upon restart it again does index
> replication. In certain instances, some replicas just fail to recover.
>
> *SolrCloud 7.2.1 *cluster configuration*:*
> ============================
> 16 shards - replication factor=2
>
> Per server configuration:
> ======================
> 32GB machine - 16GB heap space for Solr
> Index size : 3TB per server
>
> autoCommit (openSearcher=false) of 3 minutes
>
> We have a heavy indexing load of about 10,000 documents every 150 seconds.
> Not so heavy query load.
>
> Reading through some of the threads on similar topic, I suspect it would be
> the disparity between the number of updates(>100) between the replicas that
> is causing this (courtesy our indexing load). One of the suggestions I saw
> was using numRecordsToKeep.
> However as Erick mentioned in one of the threads, that's a bandaid measure
> and I am trying to eliminate some of the fundamental issues that might
> exist.
>
> 1) Is the heap too less for that index size? If yes, what would be a
> recommended max heap size?
> 2) Is there a general guideline to estimate the required max heap based on
> index size on disk?
> 3) What would be a recommended autoCommit and autoSoftCommit interval ?
> 4) Any configurations that would help improve the restart time and avoid
> full replication?
> 5) Does Solr retain "numRecordsToKeep" number of  documents in tlog *per
> replica*?
> 6) The reasons for peersync from below logs are not completely clear to me.
> Can someone please elaborate?
>
> *PeerSync fails with* :
>
> Failure type 1:
> -----------------
> 2019-02-04 20:43:50.018 INFO
> (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> org.apache.solr.update.PeerSync Fingerprint comparison: 1
>
> 2019-02-04 20:43:50.018 INFO
> (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> org.apache.solr.update.PeerSync Other fingerprint:
> {maxVersionSpecified=1624579878580912128,
> maxVersionEncountered=1624579893816721408, maxInHash=1624579878580912128,
> versionsHash=-8308981502886241345, numVersions=32966082, numDocs=32966165,
> maxDoc=1828452}, Our fingerprint: {maxVersionSpecified=1624579878580912128,
> maxVersionEncountered=1624579975760838656, maxInHash=1624579878580912128,
> versionsHash=4017509388564167234, numVersions=32966066, numDocs=32966165,
> maxDoc=1828452}
>
> 2019-02-04 20:43:50.018 INFO
> (recoveryExecutor-4-thread-2-processing-n:indexnode1:20000_solr
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> org.apache.solr.update.PeerSync PeerSync:
> core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42 url=
> http://indexnode1:8983/solr DONE. sync failed
>
> 2019-02-04 20:43:50.018 INFO
> (recoveryExecutor-4-thread-2-processing-n:indexnode1:8983_solr
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42
> s:shard11 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node45)
> [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard11 r:core_node45
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard11_replica_n42]
> org.apache.solr.cloud.RecoveryStrategy PeerSync Recovery was not successful
> - trying replication.
>
>
> Failure type 2:
> ------------------
> 2019-02-02 20:26:56.256 WARN
> (recoveryExecutor-4-thread-11-processing-n:indexnode1:20000_solr
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46
> s:shard12 c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 r:core_node49)
> [c:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66 s:shard12 r:core_node49
> x:DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46]
> org.apache.solr.update.PeerSync PeerSync:
> core=DataIndex_1C6F947C-6673-4778-847D-2DE0FDE56C66_shard12_replica_n46 url=
> http://indexnode1:20000/solr too many updates received since start -
> startingUpdates no longer overlaps with our currentUpdates
>
>
> Thanks,
> Rahul

Reply via email to