[ 
https://issues.apache.org/jira/browse/GEODE-8467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188045#comment-17188045
 ] 

ASF GitHub Bot commented on GEODE-8467:
---------------------------------------

Bill commented on a change in pull request #5490:
URL: https://github.com/apache/geode/pull/5490#discussion_r480467213



##########
File path: 
geode-core/src/main/java/org/apache/geode/internal/cache/GemFireCacheImpl.java
##########
@@ -1177,34 +1178,53 @@ public MeterRegistry getMeterRegistry() {
 
   @Override
   public void saveCacheXmlForReconnect() {
-    // there are two versions of this method so it can be unit-tested
-    boolean sharedConfigEnabled = 
getDistributionManager().getConfig().getUseSharedConfiguration();
+    prepareForReconnect((pw) -> CacheXmlGenerator.generate((Cache) this, pw, 
false));
+  }
 
-    if (!Boolean.getBoolean(GEMFIRE_PREFIX + "autoReconnect-useCacheXMLFile")
-        && !sharedConfigEnabled) {
-      try {
-        logger.info("generating XML to rebuild the cache after reconnect 
completes");
-        StringPrintWriter pw = new StringPrintWriter();
-        CacheXmlGenerator.generate((Cache) this, pw, false);
-        String cacheXML = pw.toString();
-        getCacheConfig().setCacheXMLDescription(cacheXML);
-        logger.info("XML generation completed: {}", cacheXML);
-      } catch (CancelException e) {
-        logger.info("Unable to generate XML description for reconnect of cache 
due to exception",
-            e);
-      }
-    } else if (sharedConfigEnabled && !getCacheServers().isEmpty()) {
-      // we need to retain a cache-server description if this JVM was started 
by gfsh
-      List<CacheServerCreation> list = new 
ArrayList<>(getCacheServers().size());
-      for (Object o : getCacheServers()) {
-        CacheServerImpl cs = (CacheServerImpl) o;
-        if (cs.isDefaultServer()) {
-          CacheServerCreation bsc = new CacheServerCreation(this, cs);
-          list.add(bsc);
+
+  /**
+   * Testable version of saveCacheXmlForReconnect() that allows us to inject 
an XML generator
+   *
+   * @param xmlGenerator a consumer of a PrintWriter that generates a 
description of the Cache
+   */
+  protected void prepareForReconnect(Consumer<PrintWriter> xmlGenerator) {
+    boolean sharedConfigEnabled =
+        getInternalDistributedSystem().getConfig().getUseSharedConfiguration();
+
+    try {
+      if (!Boolean.getBoolean(GEMFIRE_PREFIX + "autoReconnect-useCacheXMLFile")
+          && !sharedConfigEnabled) {
+        try {
+          logger.info("generating XML to rebuild the cache after reconnect 
completes");
+          StringPrintWriter pw = new StringPrintWriter();
+          xmlGenerator.accept(pw);
+          String cacheXML = pw.toString();
+          getCacheConfig().setCacheXMLDescription(cacheXML);
+          logger.info("XML generation completed: {}", cacheXML);
+        } catch (CancelException e) {
+          logger.info("Unable to generate XML description for reconnect of 
cache due to exception",
+              e);
         }
+      } else if (sharedConfigEnabled && !getCacheServers().isEmpty()) {
+        // we need to retain a cache-server description if this JVM was 
started by gfsh
+        logger.info("saving cache server configuration for use with the 
cluster-configuration "
+            + "service on reconnect");
+        List<CacheServerCreation> list = new 
ArrayList<>(getCacheServers().size());
+        for (Object o : getCacheServers()) {
+          CacheServerImpl cs = (CacheServerImpl) o;
+          if (cs.isDefaultServer()) {
+            CacheServerCreation bsc = new CacheServerCreation(this, cs);
+            list.add(bsc);
+          }
+        }
+        getCacheConfig().setCacheServerCreation(list);
+        logger.info("cache server configuration saved");
       }
-      getCacheConfig().setCacheServerCreation(list);
-      logger.info("CacheServer configuration saved");
+    } catch (Throwable throwable) {
+      logger.info("Saving of cache configuration for auto-reconnect has 
failed.  "
+          + "Auto-reconnect will be disabled since the cache cannot be 
rebuilt.", throwable);
+      getInternalDistributedSystem().getConfig().setDisableAutoReconnect(true);
+

Review comment:
       I guess it's ok to not rethrow the `Throwable` here. `saveConfg()` is 
called from two places in `GMSMembership`:
   
   1. `ManagerImpl.forceDisconnect()`
   2. `requestMemberRemoval()`
   
   In both those cases it looks like `setDisableAutoReconnect(true)` will set 
us on a permanent shutdown path. Since we've logged the `Throwable` here we've 
fulfilled our obligation to inform the operator.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> server fails to notify of a ForcedDisconnect and fails to tear down the cache
> -----------------------------------------------------------------------------
>
>                 Key: GEODE-8467
>                 URL: https://issues.apache.org/jira/browse/GEODE-8467
>             Project: Geode
>          Issue Type: Bug
>          Components: membership
>    Affects Versions: 1.10.0, 1.11.0, 1.12.0, 1.13.0, 1.14.0
>            Reporter: Bruce J Schuchardt
>            Assignee: Bruce J Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>
> A test having auto-reconnect enabled failed while restarting a server and 
> hung.  The restarting server was building its cache when it was kicked out of 
> the cluster due to very high load on the test machine.  Membership initiated 
> a forced-disconnect
> {noformat}
> [fatal 2020/08/22 00:51:04.508 PDT <unicast 
> receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] 
> Membership service failure: Member isn't responding to heartbeat requests
> org.apache.geode.distributed.internal.membership.api.MemberDisconnectedException:
>  Member isn't responding to heartbeat requests
>         at 
> org.apache.geode.distributed.internal.membership.gms.GMSMembership$ManagerImpl.forceDisconnect(GMSMembership.java:2012)
>         at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.forceDisconnect(GMSJoinLeave.java:1085)
>         at 
> org.apache.geode.distributed.internal.membership.gms.membership.GMSJoinLeave.processMessage(GMSJoinLeave.java:688)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1331)
>         at 
> org.apache.geode.distributed.internal.membership.gms.messenger.JGroupsMessenger$JGroupsReceiver.receive(JGroupsMessenger.java:1267)
>  {noformat}
>  
> and then logged that it was generating a description of the cache
> {noformat}
> [info 2020/08/22 00:51:05.933 PDT <unicast 
> receiver,rs-GEM-3035-PG2231-2a2i3large-hydra-client-25-42721> tid=0x23] 
> generating XML to rebuild the cache after reconnect completes {noformat}
>  
> but it never logged completion of this step and never forked a thread to tear 
> down the cache.  Any exception thrown by XML generation would have been 
> caught by JGroups code, which logs the problem at a WARNING level.  We have 
> JGroups logging set to FATAL level so you wouldn't see the issue.
> We need to add exception handling around XML generation and, if detected, 
> disable reconnect attempts and have the server shut down.
> The bug isn't easy to hit.  I've run the test that failed over 5000 times 
> without encountering it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to