Thanks Kirk, that’s a really helpful description. As far as expectations go, I agree with John. If a user:
- creates a cache - closes the cache - checks the isClosed flag - and is unable to create a new cache in some cases That sounds like a bug. Preventing the early exit sounds like a promising approach to resolve this. Anthony > On Apr 14, 2020, at 5:06 PM, Kirk Lund <kl...@apache.org> wrote: > > When I previously submitted a PR to change concurrent calls to > Cache.close() to not return until everything is completely stopped > (including the DistributedSystem), the change was down-voted by Anil. His > reasoning was that the current behavior is expected by users and is de > facto correct. Changing it would change the behavior and result in > long-time users being surprised by the new behavior. > > Below are some stack traces from my notes about what causes the thread > calling Cache.close() to return prematurely before the DistributedSystem is > disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen > in stack #3 below. DiskStoreImpl reacts to the situation by once again > invoking Cache.close() once or twice (stack #1 and stack #2) and one of > those invocations wins the race against the thread in stack #3. The thread > in stack #3 then returns prematurely before the DistributedSystem is > disconnected. If thread #3 then attempts to do anything like create a new > cache (which quite a few tests do), it can fail and throw > DistributedSystemDisconnectedException from cache create. > > There are two early-outs in GemFireCacheImpl.close() which allows a calling > thread to return before any actual work has been completed after closing > nothing but the SecurityService. > > if (isClosed()) { > return; > } > > And "isClosed()" returns true when isClosing flag is true (which is set > true when closing starts): > > public boolean isClosed() { > return isClosing; > } > > Failed creation of a persistent region is one way DiskStoreImpl can cause > multiple threads trying to close the cache to trip all over each other. > > DiskStoreImpl is problematic at best in the way it's implemented and it's > not currently unit tested (or unit testable without lots of refactoring), > and I don't plan to revisit this change. I would however be happy to review > proposals and PRs related to this. My change was focused on Cache.close() > and adding a CountDownLatch to close() -- perhaps the next attempt to "fix" > this should focus on DiskStoreImpl -- one could easily argue that closing > the cache is NOT a valid responsibility for DiskStoreImpl. But changing the > behavior of the persistent "layer" of Geode might require more research > (and a LOT more refactoring) than I had time for since my reason for > working on this was to fix some flaky dunit tests caused by this race > condition. > > This bug appears to be caused when creation of a persistent region fails > and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to > close the Cache which succeeds in closing the Cache before the main test > thread closes it. The main test thread then early outs because the > DiskStore thread is already closing the Cache. The main test thread then > tries to create a Cache which collides with the DiskStore thread which is > still closing the Cache and DistributedSystem. > > java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948 > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917) > at > org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380) > at java.lang.Thread.run(Thread.java:748) > > java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056 > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917) > at > org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380) > at java.lang.Thread.run(Thread.java:748) > > java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073 > at > org.apache.geode.internal.cache.DiskStoreImpl.handleDiskAccessException(DiskStoreImpl.java:3376) > at > org.apache.geode.internal.cache.PartitionedRegion.createAndValidatePersistentConfig(PartitionedRegion.java:956) > at > org.apache.geode.internal.cache.PartitionedRegion.initPRInternals(PartitionedRegion.java:999) > at > org.apache.geode.internal.cache.PartitionedRegion.initialize(PartitionedRegion.java:1179) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3043) > at > org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2931) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2918) > at org.apache.geode.cache.RegionFactory.create(RegionFactory.java:755) > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.createCacheAndColocatedPRs(CacheFactoryRecreateRegressionTest.java:109) > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.lambda$recreateDoesNotThrowDistributedSystemDisconnectedException$0(CacheFactoryRecreateRegressionTest.java:62) > at > org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:62) > at > org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:786) > at org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1200) > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:62) > > > On Tue, Apr 14, 2020 at 4:02 PM John Blum <jb...@pivotal.io> wrote: > >> Among other problems I encountered, 1 problem that seemed to affect >> *Integration >> Tests* as I described was that the *Singleton* cache reference was not >> cleaned up in a timely manner. Therefore, starting a fresh cache instance >> (without coordination) in another *Integration Tests* would occasionally >> throw a CacheExistsException (IIRC). >> >> It would be roughly equivalent to ... >> >> Cache cache = new CacheFactory().create(); >> // create more cache objects (Regions, Indexes, etc) >> cache.close(); >> cache = new CacheFactory().create(); // EXCEPTION!!! >> >> There is a lot of stuff (even asynchronous things) going on inside cache >> close that can take time. Even deallocation of system resources does not >> always happen in a timely manner. >> >> Kirk will undoubtedly remember other things he encountered as well. >> >> -j >> >> >> On Tue, Apr 14, 2020 at 3:45 PM Mark Hanson <mhan...@pivotal.io> wrote: >> >>> I believe it is because of disk persistence among other things. Kirk >> would >>> know for sure. He fixed the issue and his PR got shutdown. >>> I totally support just fixing the bug. >>> >>> Let’s give Kirk a chance to chime in. >>> >>> Thanks, >>> Mark >>> >>>> On Apr 14, 2020, at 3:39 PM, Dan Smith <dsm...@pivotal.io> wrote: >>>> >>>> IMO if it's not currently synchronous, that's just a bug that needs to >> be >>>> fixed. If folks can provide details on what actually was asynchronous >> or >>>> the particular test failures they saw, that would be helpful. >>>> >>>> Previously, when this came up it looks like Kirk found that close would >>> not >>>> wait for a different call to close() issued by a different thread [1]. >> Is >>>> that still the bug we are running into? One that thread, it seems like >> we >>>> also had a consensus we should just fix bugs with Cache.close: >>>> >>>> -Dan >>>> >>>> 1. >>>> >>> >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fx%2Fthread.html%2Ff385a6dd51209e2706c68c9821412a6f22ebef3e809636060ac0bf55%40%253Cdev.geode.apache.org%253E&data=02%7C01%7Chansonm%40vmware.com%7C7a43463ab53c416234d908d7e0c4cc6b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637225008165230328&sdata=GD77kAubDDWfP93zjYsNP61VMN4%2FKBAHcq95GwjeMBc%3D&reserved=0 >>>> >>>> On Tue, Apr 14, 2020 at 3:23 PM John Blum <jb...@pivotal.io> wrote: >>>> >>>>> My first thought is cache close (i.e. RegionService.close() should be >>>>> synchronous (by default), perhaps, with non-blocking options or >> options >>> to >>>>> wait for a set timeout as Jake suggested. >>>>> >>>>> This is a problem for *Integration Tests* (that start a peer cache >>>>> instance, in-process or standalone) in general and not simply just >>>>> "distributed" tests! This is the reason I built support for this in >>>>> *Spring >>>>> Test for Apache Geode* (STDG) since subsequent tests requiring a peer >>> cache >>>>> instance (or CacheServer) may conflict with each other, especially >>> given 1) >>>>> the cache instance is a *Singleton* and 2) it is ideal to not have to >>>>> restart the JVM between, even for *Integration Tests*, however, turns >>> out >>>>> to be a really common practice. *#ugh* However, without proper >>>>> coordination this can be a real problem, hence STDG's support. Even >>> when >>>>> forking JVMs, you still have to be careful to wait in certain cases, >> as >>> you >>>>> could run into other conflicts (e.g. BindExceptions if not varying >> port >>>>> numbers and such). For all these reasons and more, it is important >> that >>>>> the cache has fully shutdown and released all its resources. >>>>> >>>>> Also, while we are on this topic, I think it would be useful to have a >>>>> dedicated interface for the cache instance lifecycle. It's >> unfortunate >>> the >>>>> CacheListener interface is already taken for Region events. *#sigh* >>>>> >>>>> The interface might be something like... >>>>> >>>>> interface CacheLifecycleListener { >>>>> >>>>> default void isStarting(CacheEvent event) {} >>>>> >>>>> default void onStart(CacheEvent event) {} >>>>> >>>>> default void isClosing(CacheEvent event) {} >>>>> >>>>> default void onClose(CacheEvent event) {} >>>>> >>>>> ... >>>>> >>>>> } >>>>> >>>>> -j >>>>> >>>>> On Tue, Apr 14, 2020 at 3:21 PM Jason Huynh <jhu...@pivotal.io> >> wrote: >>>>> >>>>>> The isClosed flag and method might be used currently more as an >>>>> isClosing. >>>>>> The GemFireCacheImpl.isClosed() method is actually returning >> isClosing. >>>>>> Whatever change to isClosed that will be made, will have to properly >>>>> handle >>>>>> cases where it's meant to be treated as isClosing(). >>>>>> >>>>>> On Tue, Apr 14, 2020 at 3:09 PM Mark Hanson <hans...@vmware.com> >>> wrote: >>>>>> >>>>>>> Hi Jake, >>>>>>> >>>>>>> For Option 6: We could fix isClosed as well. That is a great >>>>> suggestion. >>>>>>> Currently, it returns almost immediately. >>>>>>> I like your options though.... >>>>>>> >>>>>>> Any other thoughts? >>>>>>> >>>>>>> Any preferences? It think any of the current options seem better >> than >>>>> the >>>>>>> current situation as long as we fix isClosed. >>>>>>> >>>>>>> Thanks, >>>>>>> Mark >>>>>>> ________________________________ >>>>>>> From: Jacob Barrett <jbarr...@pivotal.io> >>>>>>> Sent: Tuesday, April 14, 2020 2:30 PM >>>>>>> To: dev@geode.apache.org <dev@geode.apache.org> >>>>>>> Subject: Re: [Discuss] Cache.close synchronous is not synchronous, >> but >>>>>>> code still expects it to be.... >>>>>>> >>>>>>> Option 4: Cache.closeAndWait(long timeout, TimeUnit unit) - Closes >> and >>>>>>> waits until it is really closed. >>>>>>> Option 5: Cache.close(Runnable closedCalleback) - Runs callback >> after >>>>>>> cache is really close. >>>>>>> Option 6: cache.close(); while (!cache.isClosed()); >>>>>>> >>>>>>> >>>>>>>> On Apr 14, 2020, at 2:11 PM, Mark Hanson <mhan...@pivotal.io> >> wrote: >>>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> I know that we have discussed this once before, but I think it >> bears >>>>>>> repeating. We have test code that assumes cache.close is >> synchronous. >>>>> It >>>>>> is >>>>>>> not. Not even close. I would like discuss some possible changes. >>>>>>>> >>>>>>>> Option 1. Call it what it is. Deprecate Cache.close and create a >> new >>>>>>> method called asyncClose to replace it. Simple and descriptive. >>>>>>>> Option 2. Fix cache close so it is synchronous. Some might say that >>>>> we >>>>>>> are going to break behavior, but I doubt any user relies on the fact >>>>> that >>>>>>> it is asynchronous. That would be dangerous in and of itself. >>>>> Obviously, >>>>>> we >>>>>>> don’t want to change behavior, but there have been a number of >>>>>> distributed >>>>>>> tests that have failed for this. If internal to the code devs don’t >>> get >>>>>> it >>>>>>> right, where does that leave users. >>>>>>>> Option 3. Status quo. >>>>>>>> >>>>>>>> What do you think? Are there other options I am missing? >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Mark >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> -John >>>>> Spring Data Team >>>>> >>> >>> >> >> -- >> -John >> Spring Data Team >>