When I previously submitted a PR to change concurrent calls to Cache.close() to not return until everything is completely stopped (including the DistributedSystem), the change was down-voted by Anil. His reasoning was that the current behavior is expected by users and is de facto correct. Changing it would change the behavior and result in long-time users being surprised by the new behavior.
Below are some stack traces from my notes about what causes the thread calling Cache.close() to return prematurely before the DistributedSystem is disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen in stack #3 below. DiskStoreImpl reacts to the situation by once again invoking Cache.close() once or twice (stack #1 and stack #2) and one of those invocations wins the race against the thread in stack #3. The thread in stack #3 then returns prematurely before the DistributedSystem is disconnected. If thread #3 then attempts to do anything like create a new cache (which quite a few tests do), it can fail and throw DistributedSystemDisconnectedException from cache create. There are two early-outs in GemFireCacheImpl.close() which allows a calling thread to return before any actual work has been completed after closing nothing but the SecurityService. if (isClosed()) { return; } And "isClosed()" returns true when isClosing flag is true (which is set true when closing starts): public boolean isClosed() { return isClosing; } Failed creation of a persistent region is one way DiskStoreImpl can cause multiple threads trying to close the cache to trip all over each other. DiskStoreImpl is problematic at best in the way it's implemented and it's not currently unit tested (or unit testable without lots of refactoring), and I don't plan to revisit this change. I would however be happy to review proposals and PRs related to this. My change was focused on Cache.close() and adding a CountDownLatch to close() -- perhaps the next attempt to "fix" this should focus on DiskStoreImpl -- one could easily argue that closing the cache is NOT a valid responsibility for DiskStoreImpl. But changing the behavior of the persistent "layer" of Geode might require more research (and a LOT more refactoring) than I had time for since my reason for working on this was to fix some flaky dunit tests caused by this race condition. This bug appears to be caused when creation of a persistent region fails and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to close the Cache which succeeds in closing the Cache before the main test thread closes it. The main test thread then early outs because the DiskStore thread is already closing the Cache. The main test thread then tries to create a Cache which collides with the DiskStore thread which is still closing the Cache and DistributedSystem. java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948 at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917) at org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380) at java.lang.Thread.run(Thread.java:748) java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056 at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637) at org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351) at org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917) at org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380) at java.lang.Thread.run(Thread.java:748) java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073 at org.apache.geode.internal.cache.DiskStoreImpl.handleDiskAccessException(DiskStoreImpl.java:3376) at org.apache.geode.internal.cache.PartitionedRegion.createAndValidatePersistentConfig(PartitionedRegion.java:956) at org.apache.geode.internal.cache.PartitionedRegion.initPRInternals(PartitionedRegion.java:999) at org.apache.geode.internal.cache.PartitionedRegion.initialize(PartitionedRegion.java:1179) at org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3043) at org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2931) at org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2918) at org.apache.geode.cache.RegionFactory.create(RegionFactory.java:755) at org.apache.geode.cache.CacheFactoryRecreateRegressionTest.createCacheAndColocatedPRs(CacheFactoryRecreateRegressionTest.java:109) at org.apache.geode.cache.CacheFactoryRecreateRegressionTest.lambda$recreateDoesNotThrowDistributedSystemDisconnectedException$0(CacheFactoryRecreateRegressionTest.java:62) at org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:62) at org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:786) at org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1200) at org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:62) On Tue, Apr 14, 2020 at 4:02 PM John Blum <jb...@pivotal.io> wrote: > Among other problems I encountered, 1 problem that seemed to affect > *Integration > Tests* as I described was that the *Singleton* cache reference was not > cleaned up in a timely manner. Therefore, starting a fresh cache instance > (without coordination) in another *Integration Tests* would occasionally > throw a CacheExistsException (IIRC). > > It would be roughly equivalent to ... > > Cache cache = new CacheFactory().create(); > // create more cache objects (Regions, Indexes, etc) > cache.close(); > cache = new CacheFactory().create(); // EXCEPTION!!! > > There is a lot of stuff (even asynchronous things) going on inside cache > close that can take time. Even deallocation of system resources does not > always happen in a timely manner. > > Kirk will undoubtedly remember other things he encountered as well. > > -j > > > On Tue, Apr 14, 2020 at 3:45 PM Mark Hanson <mhan...@pivotal.io> wrote: > > > I believe it is because of disk persistence among other things. Kirk > would > > know for sure. He fixed the issue and his PR got shutdown. > > I totally support just fixing the bug. > > > > Let’s give Kirk a chance to chime in. > > > > Thanks, > > Mark > > > > > On Apr 14, 2020, at 3:39 PM, Dan Smith <dsm...@pivotal.io> wrote: > > > > > > IMO if it's not currently synchronous, that's just a bug that needs to > be > > > fixed. If folks can provide details on what actually was asynchronous > or > > > the particular test failures they saw, that would be helpful. > > > > > > Previously, when this came up it looks like Kirk found that close would > > not > > > wait for a different call to close() issued by a different thread [1]. > Is > > > that still the bug we are running into? One that thread, it seems like > we > > > also had a consensus we should just fix bugs with Cache.close: > > > > > > -Dan > > > > > > 1. > > > > > > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fx%2Fthread.html%2Ff385a6dd51209e2706c68c9821412a6f22ebef3e809636060ac0bf55%40%253Cdev.geode.apache.org%253E&data=02%7C01%7Chansonm%40vmware.com%7C7a43463ab53c416234d908d7e0c4cc6b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637225008165230328&sdata=GD77kAubDDWfP93zjYsNP61VMN4%2FKBAHcq95GwjeMBc%3D&reserved=0 > > > > > > On Tue, Apr 14, 2020 at 3:23 PM John Blum <jb...@pivotal.io> wrote: > > > > > >> My first thought is cache close (i.e. RegionService.close() should be > > >> synchronous (by default), perhaps, with non-blocking options or > options > > to > > >> wait for a set timeout as Jake suggested. > > >> > > >> This is a problem for *Integration Tests* (that start a peer cache > > >> instance, in-process or standalone) in general and not simply just > > >> "distributed" tests! This is the reason I built support for this in > > >> *Spring > > >> Test for Apache Geode* (STDG) since subsequent tests requiring a peer > > cache > > >> instance (or CacheServer) may conflict with each other, especially > > given 1) > > >> the cache instance is a *Singleton* and 2) it is ideal to not have to > > >> restart the JVM between, even for *Integration Tests*, however, turns > > out > > >> to be a really common practice. *#ugh* However, without proper > > >> coordination this can be a real problem, hence STDG's support. Even > > when > > >> forking JVMs, you still have to be careful to wait in certain cases, > as > > you > > >> could run into other conflicts (e.g. BindExceptions if not varying > port > > >> numbers and such). For all these reasons and more, it is important > that > > >> the cache has fully shutdown and released all its resources. > > >> > > >> Also, while we are on this topic, I think it would be useful to have a > > >> dedicated interface for the cache instance lifecycle. It's > unfortunate > > the > > >> CacheListener interface is already taken for Region events. *#sigh* > > >> > > >> The interface might be something like... > > >> > > >> interface CacheLifecycleListener { > > >> > > >> default void isStarting(CacheEvent event) {} > > >> > > >> default void onStart(CacheEvent event) {} > > >> > > >> default void isClosing(CacheEvent event) {} > > >> > > >> default void onClose(CacheEvent event) {} > > >> > > >> ... > > >> > > >> } > > >> > > >> -j > > >> > > >> On Tue, Apr 14, 2020 at 3:21 PM Jason Huynh <jhu...@pivotal.io> > wrote: > > >> > > >>> The isClosed flag and method might be used currently more as an > > >> isClosing. > > >>> The GemFireCacheImpl.isClosed() method is actually returning > isClosing. > > >>> Whatever change to isClosed that will be made, will have to properly > > >> handle > > >>> cases where it's meant to be treated as isClosing(). > > >>> > > >>> On Tue, Apr 14, 2020 at 3:09 PM Mark Hanson <hans...@vmware.com> > > wrote: > > >>> > > >>>> Hi Jake, > > >>>> > > >>>> For Option 6: We could fix isClosed as well. That is a great > > >> suggestion. > > >>>> Currently, it returns almost immediately. > > >>>> I like your options though.... > > >>>> > > >>>> Any other thoughts? > > >>>> > > >>>> Any preferences? It think any of the current options seem better > than > > >> the > > >>>> current situation as long as we fix isClosed. > > >>>> > > >>>> Thanks, > > >>>> Mark > > >>>> ________________________________ > > >>>> From: Jacob Barrett <jbarr...@pivotal.io> > > >>>> Sent: Tuesday, April 14, 2020 2:30 PM > > >>>> To: dev@geode.apache.org <dev@geode.apache.org> > > >>>> Subject: Re: [Discuss] Cache.close synchronous is not synchronous, > but > > >>>> code still expects it to be.... > > >>>> > > >>>> Option 4: Cache.closeAndWait(long timeout, TimeUnit unit) - Closes > and > > >>>> waits until it is really closed. > > >>>> Option 5: Cache.close(Runnable closedCalleback) - Runs callback > after > > >>>> cache is really close. > > >>>> Option 6: cache.close(); while (!cache.isClosed()); > > >>>> > > >>>> > > >>>>> On Apr 14, 2020, at 2:11 PM, Mark Hanson <mhan...@pivotal.io> > wrote: > > >>>>> > > >>>>> Hi All, > > >>>>> > > >>>>> I know that we have discussed this once before, but I think it > bears > > >>>> repeating. We have test code that assumes cache.close is > synchronous. > > >> It > > >>> is > > >>>> not. Not even close. I would like discuss some possible changes. > > >>>>> > > >>>>> Option 1. Call it what it is. Deprecate Cache.close and create a > new > > >>>> method called asyncClose to replace it. Simple and descriptive. > > >>>>> Option 2. Fix cache close so it is synchronous. Some might say that > > >> we > > >>>> are going to break behavior, but I doubt any user relies on the fact > > >> that > > >>>> it is asynchronous. That would be dangerous in and of itself. > > >> Obviously, > > >>> we > > >>>> don’t want to change behavior, but there have been a number of > > >>> distributed > > >>>> tests that have failed for this. If internal to the code devs don’t > > get > > >>> it > > >>>> right, where does that leave users. > > >>>>> Option 3. Status quo. > > >>>>> > > >>>>> What do you think? Are there other options I am missing? > > >>>>> > > >>>>> Thanks, > > >>>>> Mark > > >>>>> > > >>>> > > >>>> > > >>> > > >> > > >> > > >> -- > > >> -John > > >> Spring Data Team > > >> > > > > > > -- > -John > Spring Data Team >