[ https://issues.apache.org/jira/browse/SOLR-14940?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17229764#comment-17229764 ]
Tomas Eduardo Fernandez Lobbe commented on SOLR-14940: ------------------------------------------------------ I believe the test added to {{TestPullReplicaErrorHandling}} is causing the failures described in https://issues.apache.org/jira/browse/SOLR-14992. The failures happen in the tests running immediately after {{testCloseHooksDeletedOnReconnect}}, for example: {noformat} [junit4] 2> 37296 INFO (TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037]) [ ] o.a.s.c.MiniSolrCloudCluster Expired zookeeper session 72103195153465373 from node http://127.0.0.1:64010/solr [junit4] 2> 37296 INFO (TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037]) [ ] o.a.s.c.SolrCloudTestCase waitForState (pull_replica_error_handling_test_close_hooks_deleted_on_reconnect): Expecting node to reconnect [junit4] 2> 37296 WARN (NIOWorkerThread-1) [ ] o.a.z.s.NIOServerCnxn Unexpected exception [junit4] 2> => EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /127.0.0.1:64510, session = 0x1002979564b001d [junit4] 2> at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) [junit4] 2> org.apache.zookeeper.server.ServerCnxn$EndOfStreamException: Unable to read additional data from client, it probably closed the socket: address = /127.0.0.1:64510, session = 0x1002979564b001d [junit4] 2> at org.apache.zookeeper.server.NIOServerCnxn.handleFailedRead(NIOServerCnxn.java:163) ~[zookeeper-3.6.2.jar:3.6.2] [junit4] 2> at org.apache.zookeeper.server.NIOServerCnxn.doIO(NIOServerCnxn.java:326) [zookeeper-3.6.2.jar:3.6.2] [junit4] 2> at org.apache.zookeeper.server.NIOServerCnxnFactory$IOWorkRequest.doWork(NIOServerCnxnFactory.java:522) [zookeeper-3.6.2.jar:3.6.2] [junit4] 2> at org.apache.zookeeper.server.WorkerService$ScheduledWorkRequest.run(WorkerService.java:154) [zookeeper-3.6.2.jar:3.6.2] [junit4] 2> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?] [junit4] 2> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?] [junit4] 2> at java.lang.Thread.run(Thread.java:832) [?:?] [junit4] 2> 37299 INFO (zkCallback-214-thread-1) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (4) -> (3) [junit4] 2> 37299 INFO (zkCallback-184-thread-3) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (4) -> (3) [junit4] 2> 37299 INFO (zkCallback-154-thread-1) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (4) -> (3) [junit4] 2> 37299 INFO (zkCallback-100-thread-1) [ ] o.a.s.c.c.ZkStateReader Updated live nodes from ZooKeeper... (4) -> (3) [junit4] 2> 37304 INFO (TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037]) [ ] o.a.s.c.TestPullReplicaErrorHandling tearDown deleting collection [junit4] 2> 37306 INFO (SocketProxy-Acceptor-64086) [ ] o.a.s.c.s.c.SocketProxy accepted Socket[addr=/127.0.0.1,port=64511,localport=64086], receiveBufferSize: 65536 [junit4] 2> 37306 INFO (SocketProxy-Acceptor-64086) [ ] o.a.s.c.s.c.SocketProxy proxy connection Socket[addr=/127.0.0.1,port=64102,localport=64512], receiveBufferSize=65536 [junit4] 2> 37308 INFO (qtp1774678962-277) [n:127.0.0.1:64086_solr ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :delete with params name=pull_replica_error_handling_test_close_hooks_deleted_on_reconnect&action=DELETE&wt=javabin&version=2 and sendToOCPQueue=true [junit4] 2> 37319 INFO (OverseerThreadFactory-238-thread-1-processing-n:127.0.0.1:64086_solr) [n:127.0.0.1:64086_solr ] o.a.s.c.a.c.OverseerCollectionMessageHandler Executing Collection Cmd=action=UNLOAD&deleteInstanceDir=true&deleteDataDir=true&deleteMetricsHistory=true, asyncId=null [junit4] 2> 37321 INFO (SocketProxy-Acceptor-64180) [ ] o.a.s.c.s.c.SocketProxy accepted Socket[addr=/127.0.0.1,port=64513,localport=64180], receiveBufferSize: 65536 [junit4] 2> 37322 INFO (SocketProxy-Acceptor-64180) [ ] o.a.s.c.s.c.SocketProxy proxy connection Socket[addr=/127.0.0.1,port=64194,localport=64514], receiveBufferSize=65536 [junit4] 2> 37324 INFO (qtp420303283-332) [n:127.0.0.1:64180_solr x:pull_replica_error_handling_test_close_hooks_deleted_on_reconnect_shard1_replica_n1 ] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.core.pull_replica_error_handling_test_close_hooks_deleted_on_reconnect.shard1.replica_n1 tag=null [junit4] 2> 37324 INFO (qtp420303283-332) [n:127.0.0.1:64180_solr ] o.a.s.c.SolrCore [pull_replica_error_handling_test_close_hooks_deleted_on_reconnect_shard1_replica_n1] CLOSING SolrCore org.apache.solr.core.SolrCore@25632cb5 [junit4] 2> 37324 INFO (qtp420303283-332) [n:127.0.0.1:64180_solr ] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.core.pull_replica_error_handling_test_close_hooks_deleted_on_reconnect.shard1.replica_n1 tag=SolrCore@25632cb5 [junit4] 2> 37324 INFO (qtp420303283-332) [n:127.0.0.1:64180_solr ] o.a.s.m.SolrMetricManager Closing metric reporters for registry=solr.collection.pull_replica_error_handling_test_close_hooks_deleted_on_reconnect.shard1.leader tag=SolrCore@25632cb5 [junit4] 2> 37324 INFO (qtp420303283-332) [n:127.0.0.1:64180_solr ] o.a.s.u.DirectUpdateHandler2 Committing on IndexWriter.close() ... SKIPPED (unnecessary). [junit4] 2> 37342 INFO (qtp420303283-332) [n:127.0.0.1:64180_solr ] o.a.s.c.ZkShardTerms Successful update of terms at /collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/terms/shard1 to Terms{values={}, version=2} [junit4] 2> 37349 INFO (qtp420303283-332) [n:127.0.0.1:64180_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/cores params={deleteInstanceDir=true&deleteMetricsHistory=true&core=pull_replica_error_handling_test_close_hooks_deleted_on_reconnect_shard1_replica_n1&qt=/admin/cores&deleteDataDir=true&action=UNLOAD&wt=javabin&version=2} status=0 QTime=24 [junit4] 2> 37469 INFO (zkCallback-184-thread-2) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDeleted path:/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/state.json] for collection [pull_replica_error_handling_test_close_hooks_deleted_on_reconnect] has occurred - updating... (live nodes size: [3]) [junit4] 2> 37469 INFO (zkCallback-184-thread-1) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDeleted path:/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/state.json] for collection [pull_replica_error_handling_test_close_hooks_deleted_on_reconnect] has occurred - updating... (live nodes size: [3]) [junit4] 2> 37469 INFO (zkCallback-154-thread-1) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDeleted path:/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/state.json] for collection [pull_replica_error_handling_test_close_hooks_deleted_on_reconnect] has occurred - updating... (live nodes size: [3]) [junit4] 2> 37469 INFO (zkCallback-184-thread-3) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDeleted path:/collections/pull_replica_error_handling_test_close_hooks_deleted_on_reconnect/state.json] for collection [pull_replica_error_handling_test_close_hooks_deleted_on_reconnect] has occurred - updating... (live nodes size: [3]) [junit4] 2> 37506 INFO (qtp1774678962-277) [n:127.0.0.1:64086_solr ] o.a.s.s.HttpSolrCall [admin] webapp=null path=/admin/collections params={name=pull_replica_error_handling_test_close_hooks_deleted_on_reconnect&action=DELETE&wt=javabin&version=2} status=0 QTime=182 [junit4] 2> 37507 INFO (TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037]) [ ] o.a.s.c.TestPullReplicaErrorHandling Collection deleted [junit4] 2> 37507 INFO (TEST-TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect-seed#[64BCF377B675F037]) [ ] o.a.s.SolrTestCaseJ4 ###Ending testCloseHooksDeletedOnReconnect [junit4] OK 16.1s J4 | TestPullReplicaErrorHandling.testCloseHooksDeletedOnReconnect [junit4] 2> 37551 INFO (TEST-TestPullReplicaErrorHandling.testCantConnectToPullReplica-seed#[64BCF377B675F037]) [ ] o.a.s.SolrTestCaseJ4 ###Starting testCantConnectToPullReplica [junit4] 2> 37554 INFO (qtp420303283-333) [n:127.0.0.1:64180_solr ] o.a.s.h.a.CollectionsHandler Invoked Collection Action :create with params pullReplicas=1&collection.configName=conf&maxShardsPerNode=1&name=pull_replica_error_handling_test_cant_connect_to_pull_replica&nrtReplicas=1&action=CREATE&numShards=2&tlogReplicas=0&wt=javabin&version=2 and sendToOCPQueue=true [junit4] 2> 37563 INFO (OverseerThreadFactory-238-thread-2-processing-n:127.0.0.1:64086_solr) [n:127.0.0.1:64086_solr ] o.a.s.c.a.c.CreateCollectionCmd Create collection pull_replica_error_handling_test_cant_connect_to_pull_replica [junit4] 2> 37564 INFO (OverseerCollectionConfigSetProcessor-72103195153465365-127.0.0.1:64086_solr-n_0000000005) [n:127.0.0.1:64086_solr ] o.a.s.c.OverseerTaskQueue Response ZK path: /overseer/collection-queue-work/qnr-0000000002 doesn't exist. Requestor may have disconnected from ZooKeeper [junit4] 2> 37795 INFO (OverseerThreadFactory-238-thread-2-processing-n:127.0.0.1:64086_solr) [n:127.0.0.1:64086_solr ] o.a.s.c.a.c.OverseerCollectionMessageHandler Executing Collection Cmd=action=UNLOAD&deleteInstanceDir=true&deleteDataDir=true&deleteMetricsHistory=true, asyncId=null [junit4] 2> 37914 INFO (zkCallback-154-thread-1) [ ] o.a.s.c.c.ZkStateReader A cluster state change: [WatchedEvent state:SyncConnected type:NodeDeleted path:/collections/pull_replica_error_handling_test_cant_connect_to_pull_replica/state.json] for collection [pull_replica_error_handling_test_cant_connect_to_pull_replica] has occurred - updating... (live nodes size: [3]) [junit4] 2> 37920 ERROR (OverseerThreadFactory-238-thread-2-processing-n:127.0.0.1:64086_solr) [n:127.0.0.1:64086_solr ] o.a.s.c.a.c.OverseerCollectionMessageHandler Collection: pull_replica_error_handling_test_cant_connect_to_pull_replica operation: create failed:org.apache.solr.common.SolrException: Cannot create collection pull_replica_error_handling_test_cant_connect_to_pull_replica. Value of maxShardsPerNode is 1, and the number of nodes currently live or live and part of your createNodeSet is 3. This allows a maximum of 3 to be created. Value of numShards is 2, value of nrtReplicas is 1, value of tlogReplicas is 0 and value of pullReplicas is 1. This requires 4 shards to be created (higher than the allowed number) [junit4] 2> at org.apache.solr.cloud.api.collections.CreateCollectionCmd.call(CreateCollectionCmd.java:198) [junit4] 2> at org.apache.solr.cloud.api.collections.OverseerCollectionMessageHandler.processMessage(OverseerCollectionMessageHandler.java:264) [junit4] 2> at org.apache.solr.cloud.OverseerTaskProcessor$Runner.run(OverseerTaskProcessor.java:517) [junit4] 2> at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor.lambda$execute$0(ExecutorUtil.java:218) [junit4] 2> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [junit4] 2> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [junit4] 2> at java.base/java.lang.Thread.run(Thread.java:832) {noformat} I didn't validate this myself, but from the code: {code:java} for (int i = 0; i < 5; i++) { cluster.expireZkSession(jetty); waitForState("Expecting node to reconnect", collectionName, activeReplicaCount(1, 0, 1)); // We have two active ReplicationHandler with two close hooks each, one for triggering recovery and one for doing interval polling assertEquals(5, core.getCloseHooks().size()); } {code} I believe the problem may be that we are checking for the state immediately after expiration, and maybe the collection state wasn't updated to reflect the dead node. Maybe we could first {{waitForState}} for the node to be down and then up? or have a watch in the live_nodes znode? > ReplicationHandler memory leak through SolrCore.closeHooks > ---------------------------------------------------------- > > Key: SOLR-14940 > URL: https://issues.apache.org/jira/browse/SOLR-14940 > Project: Solr > Issue Type: Bug > Security Level: Public(Default Security Level. Issues are Public) > Components: replication (java) > Environment: Solr Cloud Cluster on v.8.6.2 configured as 3 TLOG nodes > with 2 cores in each JVM. > > Reporter: Anver Sotnikov > Assignee: Mike Drob > Priority: Major > Fix For: master (9.0), 8.8 > > Attachments: Actual references to hooks that in turn hold references > to ReplicationHandlers.png, Memory Analyzer SolrCore.closeHooks .png > > Time Spent: 2h 10m > Remaining Estimate: 0h > > We are experiencing a memory leak in Solr Cloud cluster configured as 3 TLOG > nodes. > Leader does not seem to be affected while Followers are. > > Looking at memory dump we noticed that SolrCore holds lots of references to > ReplicationHandler through anonymous inner classes in SolrCore.closeHooks, > which in turn holds ReplicationHandlers. > ReplicationHandler registers hooks as anonymous inner classes in > SolrCore.closeHooks through ReplicationHandler.inform() -> > ReplicationHandler.registerCloseHook(). > > Whenever ZkController.stopReplicationFromLeader is called - it would shutdown > ReplicationHandler (ReplicationHandler.shutdown()), BUT reference to > ReplicationHandler will stay in SolrCore.closeHooks. Once replication is > started again on same SolrCore - new ReplicationHandler will be created and > registered in closeHooks. > > It looks like there are few scenarios when replication is stopped and > restarted on same core and in our TLOG setup it shows up quite often. > > Potential solutions: > # Allow unregistering SolrCore.closeHooks so it can be used from > ReplicationHandler.shutdown > # Hack but easier - break the link between ReplicationHandler close hooks > and full ReplicationHandler object so ReplicationHandler can be GCed even > when hooks are still registered in SolrCore.closeHooks -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org