Yeah, this is weird. First of all, recovery shouldn't take much memory
even if it's a full recovery, it's just file copies. And there
shouldn't be any problem with the leadership changing, that should
_not_ require a replication.

Your autocommit max time is  a bit worrisome. You have docs hanging
around in the tlog for 10 minutes, potentially they must be replayed
at startup which could get weird. You say that you stop indexing, but
do you wait for at least 10 minutes afterwards? Hard commits
(autocommit) with openSearcher=false is actually a pretty inexpensive
operation. See:
https://lucidworks.com/2013/08/23/understanding-transaction-logs-softcommit-and-commit-in-sorlcloud/.
I'd recommend making these much shorter, say 15 seconds. That's a shot
in the dark as a factor in your situation though.

Next up wold be a heap dump to see where the memory spike comes from,
'cause I'm out of ideas...

Best,
Erick
On Wed, Dec 12, 2018 at 8:20 PM shamik <sham...@gmail.com> wrote:
>
> Erick,
>
>    Thanks for your input. All our fields (for facet, group & sort) have
> docvalues enabled since 6.5. That includes the id field. Here's the field
> cache entry:
>
> CACHE.core.fieldCache.entries_count:    0
> CACHE.core.fieldCache.total_size:          0 bytes
>
> Based on whatever I've seen so far, I think zookeeper is not the culprit
> here. All the nodes including zookeeper was setup recently. The all are all
> inside the same VPC within the same AZ. The instances talk to each other
> through a dedicated network. Both zookeeper and Solr instances have SSDs.
>
> Here's what's happening based on my observation. Whenever an instance is
> getting restarted, it initiates a preprecovery command to its leader or a
> different node in the other shard. The node which receives the recovery
> request is the one which is due to go down next. Within few minutes, the
> heap size (old gen) reaches the max allocated heap, thus stalling the
> process. I guess due to this, it fails to send the credentials for a
> zookeeper session within the stipulated timeframe, which is why zookeeper
> terminates the session. Here's from the startup log.
>
> 2018-12-13 04:02:34.910 INFO
> (recoveryExecutor-4-thread-1-processing-n:x.x.193.244:8983_solr
> x:knowledge_shard2_replica_n4 c:knowledge s:shard2 r:core_node9)
> [c:knowledge s:shard2 r:core_node9 x:knowledge_shard2_replica_n4]
> o.a.s.c.RecoveryStrategy Sending prep recovery command to
> [http://x.x.240.225:8983/solr]; [WaitForState:
> action=PREPRECOVERY&core=knowledge_shard2_replica_n6&nodeName=x.x.x.244:8983_solr&coreNodeName=core_node9&state=recovering&checkLive=true&onlyIfLeader=true&onlyIfLeaderActive=true]
>
> The node sends a recovery command to its replica which immediately triggers
> G1 Old Gen jvm pool to reach the max heap size. Please see the screenshot
> below which shows the sudden jump in heap size. We've made sure that the
> indexing process is completely switched off at this point, so there's no
> commit happening.
>
> JVM Pool --> https://www.dropbox.com/s/5s0igznhrol6c05/jvm_pool_1.png?dl=0
>
> I'm totally puzzled by this weird behavior, never seen something like this
> before. Could G1GC settings be contributing to this issue?
>
> From zookeeper log:
>
> 2018-12-13 03:47:27,905 [myid:1] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
> Accepted socket connection from /10.0.0.160:58376
> 2018-12-13 03:47:27,905 [myid:1] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@215] -
> Accepted socket connection from /10.0.0.160:58378
> 2018-12-13 03:47:27,905 [myid:1] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@938] - Client
> attempting to establish new session at /10.0.0.160:58376
> 2018-12-13 03:47:27,905 [myid:1] - WARN
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@376] - Unable to
> read additional data from client sessionid 0x0, likely client has closed
> socket
> 2018-12-13 03:47:27,905 [myid:1] - INFO
> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1040] - Closed
> socket connection for client /10.0.0.160:58378 (no session established for
> client)
> 2018-12-13 03:47:27,907 [myid:1] - INFO
> [CommitProcessor:1:ZooKeeperServer@683] - Established session
> 0x100c46d01440072 with negotiated timeout 10000 for client /10.0.0.160:58376
> 2018-12-13 03:47:39,386 [myid:1] - INFO
> [CommitProcessor:1:NIOServerCnxn@1040] - Closed socket connection for client
> /10.0.0.160:58376 which had sessionid 0x100c46d01440072
>
>
>
> --
> Sent from: http://lucene.472066.n3.nabble.com/Solr-User-f472068.html

Reply via email to