[ 
https://issues.apache.org/jira/browse/SOLR-14940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229764#comment-17229764
 ] 

Tomas Eduardo Fernandez Lobbe commented on SOLR-14940:
------------------------------------------------------

I believe the test added to {{TestPullReplicaErrorHandling}} is causing the 
failures described in https://issues.apache.org/jira/browse/SOLR-14992. The 
failures happen in the tests running immediately after 
{{testCloseHooksDeletedOnReconnect}}, for example:

{noformat}
    [junit4]   2> 37296 INFO  
(TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037])
 [     ] o.a.s.c.MiniSolrCloudCluster Expired zookeeper session 
72103195153465373 from node http://127.0.0.1:64010/solr
   [junit4]   2> 37296 INFO  
(TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037])
 [     ] o.a.s.c.SolrCloudTestCase waitForState 
(pull_replica_error_handling_test_close_hooks_deleted_on_reconnect): Expecting 
node to reconnect
   [junit4]   2> 37296 WARN  (NIOWorkerThread-1) [     ] o.a.z.s.NIOServerCnxn 
Unexpected exception
   [junit4]   2>           => EndOfStreamException: Unable to read additional 
data from client, it probably closed the socket: address = /127.0.0.1:64510, 
session = 0x1002979564b001d
   [junit4]   2>        at 
org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
   [junit4]   2> org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: 
Unable to read additional data from client, it probably closed the socket: 
address = /127.0.0.1:64510, session = 0x1002979564b001d
   [junit4]   2>        at 
org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163)
 ~[zookeeper-3.6.2.jar:3.6.2]
   [junit4]   2>        at 
org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) 
[zookeeper-3.6.2.jar:3.6.2]
   [junit4]   2>        at 
org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522)
 [zookeeper-3.6.2.jar:3.6.2]
   [junit4]   2>        at 
org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154)
 [zookeeper-3.6.2.jar:3.6.2]
   [junit4]   2>        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) 
[?:?]
   [junit4]   2>        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) 
[?:?]
   [junit4]   2>        at java.lang.Thread.run(Thread.java:832) [?:?]
   [junit4]   2> 37299 INFO  (zkCallback-214-thread-1) [     ] 
o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (4) -> (3)
   [junit4]   2> 37299 INFO  (zkCallback-184-thread-3) [     ] 
o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (4) -> (3)
   [junit4]   2> 37299 INFO  (zkCallback-154-thread-1) [     ] 
o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (4) -> (3)
   [junit4]   2> 37299 INFO  (zkCallback-100-thread-1) [     ] 
o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (4) -> (3)
   [junit4]   2> 37304 INFO  
(TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037])
 [     ] o.a.s.c.TestPullReplicaErrorHandling tearDown deleting collection
   [junit4]   2> 37306 INFO  (SocketProxy-Acceptor-64086) [     ] 
o.a.s.c.s.c.SocketProxy accepted 
Socket[addr=/127.0.0.1,port=64511,localport=64086], receiveBufferSize: 65536
   [junit4]   2> 37306 INFO  (SocketProxy-Acceptor-64086) [     ] 
o.a.s.c.s.c.SocketProxy proxy connection 
Socket[addr=/127.0.0.1,port=64102,localport=64512], receiveBufferSize=65536
   [junit4]   2> 37308 INFO  (qtp1774678962-277) [n:127.0.0.1:64086_solr     ] 
o.a.s.h.a.CollectionsHandler Invoked Collection Action :delete with params 
name=pull_replica_error_handling_test_close_hooks_deleted_on_reconnect&action=DELETE&wt=javabin&version=2
 and sendToOCPQueue=true
   [junit4]   2> 37319 INFO  
(OverseerThreadFactory-238-thread-1-processing-n:127.0.0.1:64086_solr) 
[n:127.0.0.1:64086_solr     ] o.a.s.c.a.c.OverseerCollectionMessageHandler 
Executing Collection 
Cmd=action=UNLOAD&deleteInstanceDir=true&deleteDataDir=true&deleteMetricsHistory=true,
 asyncId=null
   [junit4]   2> 37321 INFO  (SocketProxy-Acceptor-64180) [     ] 
o.a.s.c.s.c.SocketProxy accepted 
Socket[addr=/127.0.0.1,port=64513,localport=64180], receiveBufferSize: 65536
   [junit4]   2> 37322 INFO  (SocketProxy-Acceptor-64180) [     ] 
o.a.s.c.s.c.SocketProxy proxy connection 
Socket[addr=/127.0.0.1,port=64194,localport=64514], receiveBufferSize=65536
   [junit4]   2> 37324 INFO  (qtp420303283-332) [n:127.0.0.1:64180_solr    
x:pull_replica_error_handling_test_close_hooks_deleted_on_reconnect_shard1_replica_n1
 ] o.a.s.m.SolrMetricManager Closing metric reporters for 
registry=solr.core.pull_replica_error_handling_test_close_hooks_deleted_on_reconnect.shard1.replica_n1
 tag=null
   [junit4]   2> 37324 INFO  (qtp420303283-332) [n:127.0.0.1:64180_solr     ] 
o.a.s.c.SolrCore 
[pull_replica_error_handling_test_close_hooks_deleted_on_reconnect_shard1_replica_n1]
  CLOSING SolrCore org.apache.solr.core.SolrCore@25632cb5
   [junit4]   2> 37324 INFO  (qtp420303283-332) [n:127.0.0.1:64180_solr     ] 
o.a.s.m.SolrMetricManager Closing metric reporters for 
registry=solr.core.pull_replica_error_handling_test_close_hooks_deleted_on_reconnect.shard1.replica_n1
 tag=SolrCore@25632cb5
   [junit4]   2> 37324 INFO  (qtp420303283-332) [n:127.0.0.1:64180_solr     ] 
o.a.s.m.SolrMetricManager Closing metric reporters for 
registry=solr.collection.pull_replica_error_handling_test_close_hooks_deleted_on_reconnect.shard1.leader
 tag=SolrCore@25632cb5
   [junit4]   2> 37324 INFO  (qtp420303283-332) [n:127.0.0.1:64180_solr     ] 
o.a.s.u.DirectUpdateHandler2 Committing on IndexWriter.close()  ... SKIPPED 
(unnecessary).
   [junit4]   2> 37342 INFO  (qtp420303283-332) [n:127.0.0.1:64180_solr     ] 
o.a.s.c.ZkShardTerms Successful update of terms at 
/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/terms/shard1
 to Terms{values={}, version=2}
   [junit4]   2> 37349 INFO  (qtp420303283-332) [n:127.0.0.1:64180_solr     ] 
o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores 
params={deleteInstanceDir=true&deleteMetricsHistory=true&core=pull_replica_error_handling_test_close_hooks_deleted_on_reconnect_shard1_replica_n1&qt=/admin/cores&deleteDataDir=true&action=UNLOAD&wt=javabin&version=2}
 status=0 QTime=24
   [junit4]   2> 37469 INFO  (zkCallback-184-thread-2) [     ] 
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent 
state:SyncConnected type:NodeDeleted 
path:/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/state.json]
 for collection 
[pull_replica_error_handling_test_close_hooks_deleted_on_reconnect] has 
occurred - updating... (live nodes size: [3])
   [junit4]   2> 37469 INFO  (zkCallback-184-thread-1) [     ] 
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent 
state:SyncConnected type:NodeDeleted 
path:/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/state.json]
 for collection 
[pull_replica_error_handling_test_close_hooks_deleted_on_reconnect] has 
occurred - updating... (live nodes size: [3])
   [junit4]   2> 37469 INFO  (zkCallback-154-thread-1) [     ] 
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent 
state:SyncConnected type:NodeDeleted 
path:/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/state.json]
 for collection 
[pull_replica_error_handling_test_close_hooks_deleted_on_reconnect] has 
occurred - updating... (live nodes size: [3])
   [junit4]   2> 37469 INFO  (zkCallback-184-thread-3) [     ] 
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent 
state:SyncConnected type:NodeDeleted 
path:/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/state.json]
 for collection 
[pull_replica_error_handling_test_close_hooks_deleted_on_reconnect] has 
occurred - updating... (live nodes size: [3])
   [junit4]   2> 37506 INFO  (qtp1774678962-277) [n:127.0.0.1:64086_solr     ] 
o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections 
params={name=pull_replica_error_handling_test_close_hooks_deleted_on_reconnect&action=DELETE&wt=javabin&version=2}
 status=0 QTime=182
   [junit4]   2> 37507 INFO  
(TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037])
 [     ] o.a.s.c.TestPullReplicaErrorHandling Collection deleted
   [junit4]   2> 37507 INFO  
(TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037])
 [     ] o.a.s.SolrTestCaseJ4 ###Ending testCloseHooksDeletedOnReconnect
   [junit4] OK      16.1s J4 | 
TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect
   [junit4]   2> 37551 INFO  
(TEST-TestPullReplicaErrorHandling.testCantConnectToPullReplica-seed#[64BCF377B675F037])
 [     ] o.a.s.SolrTestCaseJ4 ###Starting testCantConnectToPullReplica
   [junit4]   2> 37554 INFO  (qtp420303283-333) [n:127.0.0.1:64180_solr     ] 
o.a.s.h.a.CollectionsHandler Invoked Collection Action :create with params 
pullReplicas=1&collection.configName=conf&maxShardsPerNode=1&name=pull_replica_error_handling_test_cant_connect_to_pull_replica&nrtReplicas=1&action=CREATE&numShards=2&tlogReplicas=0&wt=javabin&version=2
 and sendToOCPQueue=true
   [junit4]   2> 37563 INFO  
(OverseerThreadFactory-238-thread-2-processing-n:127.0.0.1:64086_solr) 
[n:127.0.0.1:64086_solr     ] o.a.s.c.a.c.CreateCollectionCmd Create collection 
pull_replica_error_handling_test_cant_connect_to_pull_replica
   [junit4]   2> 37564 INFO  
(OverseerCollectionConfigSetProcessor-72103195153465365-127.0.0.1:64086_solr-n_0000000005)
 [n:127.0.0.1:64086_solr     ] o.a.s.c.OverseerTaskQueue Response ZK path: 
/overseer/collection-queue-work/qnr-0000000002 doesn't exist. Requestor may 
have disconnected from ZooKeeper
   [junit4]   2> 37795 INFO  
(OverseerThreadFactory-238-thread-2-processing-n:127.0.0.1:64086_solr) 
[n:127.0.0.1:64086_solr     ] o.a.s.c.a.c.OverseerCollectionMessageHandler 
Executing Collection 
Cmd=action=UNLOAD&deleteInstanceDir=true&deleteDataDir=true&deleteMetricsHistory=true,
 asyncId=null
   [junit4]   2> 37914 INFO  (zkCallback-154-thread-1) [     ] 
o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent 
state:SyncConnected type:NodeDeleted 
path:/collections/pull_replica_error_handling_test_cant_connect_to_pull_replica/state.json]
 for collection [pull_replica_error_handling_test_cant_connect_to_pull_replica] 
has occurred - updating... (live nodes size: [3])
   [junit4]   2> 37920 ERROR 
(OverseerThreadFactory-238-thread-2-processing-n:127.0.0.1:64086_solr) 
[n:127.0.0.1:64086_solr     ] o.a.s.c.a.c.OverseerCollectionMessageHandler 
Collection: pull_replica_error_handling_test_cant_connect_to_pull_replica 
operation: create failed:org.apache.solr.common.SolrException: Cannot create 
collection pull_replica_error_handling_test_cant_connect_to_pull_replica. Value 
of maxShardsPerNode is 1, and the number of nodes currently live or live and 
part of your createNodeSet is 3. This allows a maximum of 3 to be created. 
Value of numShards is 2, value of nrtReplicas is 1, value of tlogReplicas is 0 
and value of pullReplicas is 1. This requires 4 shards to be created (higher 
than the allowed number)
   [junit4]   2>        at 
org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:198)
   [junit4]   2>        at 
org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:264)
   [junit4]   2>        at 
org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:517)
   [junit4]   2>        at 
org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218)
   [junit4]   2>        at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
   [junit4]   2>        at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
   [junit4]   2>        at java.base/java.lang.Thread.run(Thread.java:832)
{noformat}

I didn't validate this myself, but from the code:
{code:java}
for (int i = 0; i < 5; i++) {
      cluster.expireZkSession(jetty);
      waitForState("Expecting node to reconnect", collectionName, 
activeReplicaCount(1, 0, 1));
      // We have two active ReplicationHandler with two close hooks each, one 
for triggering recovery and one for doing interval polling
      assertEquals(5, core.getCloseHooks().size());
    }
{code}
I believe the problem may be that we are checking for the state immediately 
after expiration, and maybe the collection state wasn't updated to reflect the 
dead node. Maybe we could first {{waitForState}} for the node to be down and 
then up? or have a watch in the live_nodes znode?



> ReplicationHandler memory leak through SolrCore.closeHooks
> ----------------------------------------------------------
>
>                 Key: SOLR-14940
>                 URL: https://issues.apache.org/jira/browse/SOLR-14940
>             Project: Solr
>          Issue Type: Bug
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: replication (java)
>         Environment: Solr Cloud Cluster on v.8.6.2 configured as 3 TLOG nodes 
> with 2 cores in each JVM.
>  
>            Reporter: Anver Sotnikov
>            Assignee: Mike Drob
>            Priority: Major
>             Fix For: master (9.0), 8.8
>
>         Attachments: Actual references to hooks that in turn hold references 
> to ReplicationHandlers.png, Memory Analyzer SolrCore.closeHooks .png
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We are experiencing a memory leak in Solr Cloud cluster configured as 3 TLOG 
> nodes.
> Leader does not seem to be affected while Followers are.
>  
> Looking at memory dump we noticed that SolrCore holds lots of references to 
> ReplicationHandler through anonymous inner classes in SolrCore.closeHooks, 
> which in turn holds ReplicationHandlers.
> ReplicationHandler registers hooks as anonymous inner classes in 
> SolrCore.closeHooks through ReplicationHandler.inform() -> 
> ReplicationHandler.registerCloseHook().
>  
> Whenever ZkController.stopReplicationFromLeader is called - it would shutdown 
> ReplicationHandler (ReplicationHandler.shutdown()), BUT reference to 
> ReplicationHandler will stay in SolrCore.closeHooks. Once replication is 
> started again on same SolrCore - new ReplicationHandler will be created and 
> registered in closeHooks.
>  
> It looks like there are few scenarios when replication is stopped and 
> restarted on same core and in our TLOG setup it shows up quite often.
>  
> Potential solutions:
>  # Allow unregistering SolrCore.closeHooks so it can be used from 
> ReplicationHandler.shutdown
>  # Hack but easier - break the link between ReplicationHandler close hooks 
> and full ReplicationHandler object so ReplicationHandler can be GCed even 
> when hooks are still registered in SolrCore.closeHooks



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to