I revived my branch by rebasing it on develop and filed a new draft PR: https://github.com/apache/geode/pull/4963
Unfortunately, IntegrationTest exceeds timeout every time I trigger it. The cause does not appear to be a specific test or hang. I think IntegrationTest has already been running very close to the timeout and is exceeding it fairly often even without my changes in #4963. Should we increase the timeout for IntegrationTest? (Anyone know how to increase it?) On Tue, Apr 14, 2020 at 5:06 PM Kirk Lund <kl...@apache.org> wrote: > When I previously submitted a PR to change concurrent calls to > Cache.close() to not return until everything is completely stopped > (including the DistributedSystem), the change was down-voted by Anil. His > reasoning was that the current behavior is expected by users and is de > facto correct. Changing it would change the behavior and result in > long-time users being surprised by the new behavior. > > Below are some stack traces from my notes about what causes the thread > calling Cache.close() to return prematurely before the DistributedSystem is > disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen > in stack #3 below. DiskStoreImpl reacts to the situation by once again > invoking Cache.close() once or twice (stack #1 and stack #2) and one of > those invocations wins the race against the thread in stack #3. The thread > in stack #3 then returns prematurely before the DistributedSystem is > disconnected. If thread #3 then attempts to do anything like create a new > cache (which quite a few tests do), it can fail and throw > DistributedSystemDisconnectedException from cache create. > > There are two early-outs in GemFireCacheImpl.close() which allows a > calling thread to return before any actual work has been completed after > closing nothing but the SecurityService. > > if (isClosed()) { > return; > } > > And "isClosed()" returns true when isClosing flag is true (which is set > true when closing starts): > > public boolean isClosed() { > return isClosing; > } > > Failed creation of a persistent region is one way DiskStoreImpl can cause > multiple threads trying to close the cache to trip all over each other. > > DiskStoreImpl is problematic at best in the way it's implemented and it's > not currently unit tested (or unit testable without lots of refactoring), > and I don't plan to revisit this change. I would however be happy to review > proposals and PRs related to this. My change was focused on Cache.close() > and adding a CountDownLatch to close() -- perhaps the next attempt to "fix" > this should focus on DiskStoreImpl -- one could easily argue that closing > the cache is NOT a valid responsibility for DiskStoreImpl. But changing the > behavior of the persistent "layer" of Geode might require more research > (and a LOT more refactoring) than I had time for since my reason for > working on this was to fix some flaky dunit tests caused by this race > condition. > > This bug appears to be caused when creation of a persistent region fails > and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to > close the Cache which succeeds in closing the Cache before the main test > thread closes it. The main test thread then early outs because the > DiskStore thread is already closing the Cache. The main test thread then > tries to create a Cache which collides with the DiskStore thread which is > still closing the Cache and DistributedSystem. > > java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948 > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917) > at > org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380) > at java.lang.Thread.run(Thread.java:748) > > java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056 > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637) > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351) > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917) > at > org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380) > at java.lang.Thread.run(Thread.java:748) > > java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073 > at > org.apache.geode.internal.cache.DiskStoreImpl.handleDiskAccessException(DiskStoreImpl.java:3376) > at > org.apache.geode.internal.cache.PartitionedRegion.createAndValidatePersistentConfig(PartitionedRegion.java:956) > at > org.apache.geode.internal.cache.PartitionedRegion.initPRInternals(PartitionedRegion.java:999) > at > org.apache.geode.internal.cache.PartitionedRegion.initialize(PartitionedRegion.java:1179) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3043) > at > org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2931) > at > org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2918) > at org.apache.geode.cache.RegionFactory.create(RegionFactory.java:755) > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.createCacheAndColocatedPRs(CacheFactoryRecreateRegressionTest.java:109) > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.lambda$recreateDoesNotThrowDistributedSystemDisconnectedException$0(CacheFactoryRecreateRegressionTest.java:62) > at > org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:62) > at > org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:786) > at org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1200) > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:62) > > > On Tue, Apr 14, 2020 at 4:02 PM John Blum <jb...@pivotal.io> wrote: > >> Among other problems I encountered, 1 problem that seemed to affect >> *Integration >> Tests* as I described was that the *Singleton* cache reference was not >> cleaned up in a timely manner. Therefore, starting a fresh cache instance >> (without coordination) in another *Integration Tests* would occasionally >> throw a CacheExistsException (IIRC). >> >> It would be roughly equivalent to ... >> >> Cache cache = new CacheFactory().create(); >> // create more cache objects (Regions, Indexes, etc) >> cache.close(); >> cache = new CacheFactory().create(); // EXCEPTION!!! >> >> There is a lot of stuff (even asynchronous things) going on inside cache >> close that can take time. Even deallocation of system resources does not >> always happen in a timely manner. >> >> Kirk will undoubtedly remember other things he encountered as well. >> >> -j >> >> >> On Tue, Apr 14, 2020 at 3:45 PM Mark Hanson <mhan...@pivotal.io> wrote: >> >> > I believe it is because of disk persistence among other things. Kirk >> would >> > know for sure. He fixed the issue and his PR got shutdown. >> > I totally support just fixing the bug. >> > >> > Let’s give Kirk a chance to chime in. >> > >> > Thanks, >> > Mark >> > >> > > On Apr 14, 2020, at 3:39 PM, Dan Smith <dsm...@pivotal.io> wrote: >> > > >> > > IMO if it's not currently synchronous, that's just a bug that needs >> to be >> > > fixed. If folks can provide details on what actually was asynchronous >> or >> > > the particular test failures they saw, that would be helpful. >> > > >> > > Previously, when this came up it looks like Kirk found that close >> would >> > not >> > > wait for a different call to close() issued by a different thread >> [1]. Is >> > > that still the bug we are running into? One that thread, it seems >> like we >> > > also had a consensus we should just fix bugs with Cache.close: >> > > >> > > -Dan >> > > >> > > 1. >> > > >> > >> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fx%2Fthread.html%2Ff385a6dd51209e2706c68c9821412a6f22ebef3e809636060ac0bf55%40%253Cdev.geode.apache.org%253E&data=02%7C01%7Chansonm%40vmware.com%7C7a43463ab53c416234d908d7e0c4cc6b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637225008165230328&sdata=GD77kAubDDWfP93zjYsNP61VMN4%2FKBAHcq95GwjeMBc%3D&reserved=0 >> > > >> > > On Tue, Apr 14, 2020 at 3:23 PM John Blum <jb...@pivotal.io> wrote: >> > > >> > >> My first thought is cache close (i.e. RegionService.close() should >> be >> > >> synchronous (by default), perhaps, with non-blocking options or >> options >> > to >> > >> wait for a set timeout as Jake suggested. >> > >> >> > >> This is a problem for *Integration Tests* (that start a peer cache >> > >> instance, in-process or standalone) in general and not simply just >> > >> "distributed" tests! This is the reason I built support for this in >> > >> *Spring >> > >> Test for Apache Geode* (STDG) since subsequent tests requiring a peer >> > cache >> > >> instance (or CacheServer) may conflict with each other, especially >> > given 1) >> > >> the cache instance is a *Singleton* and 2) it is ideal to not have to >> > >> restart the JVM between, even for *Integration Tests*, however, turns >> > out >> > >> to be a really common practice. *#ugh* However, without proper >> > >> coordination this can be a real problem, hence STDG's support. Even >> > when >> > >> forking JVMs, you still have to be careful to wait in certain cases, >> as >> > you >> > >> could run into other conflicts (e.g. BindExceptions if not varying >> port >> > >> numbers and such). For all these reasons and more, it is important >> that >> > >> the cache has fully shutdown and released all its resources. >> > >> >> > >> Also, while we are on this topic, I think it would be useful to have >> a >> > >> dedicated interface for the cache instance lifecycle. It's >> unfortunate >> > the >> > >> CacheListener interface is already taken for Region events. *#sigh* >> > >> >> > >> The interface might be something like... >> > >> >> > >> interface CacheLifecycleListener { >> > >> >> > >> default void isStarting(CacheEvent event) {} >> > >> >> > >> default void onStart(CacheEvent event) {} >> > >> >> > >> default void isClosing(CacheEvent event) {} >> > >> >> > >> default void onClose(CacheEvent event) {} >> > >> >> > >> ... >> > >> >> > >> } >> > >> >> > >> -j >> > >> >> > >> On Tue, Apr 14, 2020 at 3:21 PM Jason Huynh <jhu...@pivotal.io> >> wrote: >> > >> >> > >>> The isClosed flag and method might be used currently more as an >> > >> isClosing. >> > >>> The GemFireCacheImpl.isClosed() method is actually returning >> isClosing. >> > >>> Whatever change to isClosed that will be made, will have to properly >> > >> handle >> > >>> cases where it's meant to be treated as isClosing(). >> > >>> >> > >>> On Tue, Apr 14, 2020 at 3:09 PM Mark Hanson <hans...@vmware.com> >> > wrote: >> > >>> >> > >>>> Hi Jake, >> > >>>> >> > >>>> For Option 6: We could fix isClosed as well. That is a great >> > >> suggestion. >> > >>>> Currently, it returns almost immediately. >> > >>>> I like your options though.... >> > >>>> >> > >>>> Any other thoughts? >> > >>>> >> > >>>> Any preferences? It think any of the current options seem better >> than >> > >> the >> > >>>> current situation as long as we fix isClosed. >> > >>>> >> > >>>> Thanks, >> > >>>> Mark >> > >>>> ________________________________ >> > >>>> From: Jacob Barrett <jbarr...@pivotal.io> >> > >>>> Sent: Tuesday, April 14, 2020 2:30 PM >> > >>>> To: dev@geode.apache.org <dev@geode.apache.org> >> > >>>> Subject: Re: [Discuss] Cache.close synchronous is not synchronous, >> but >> > >>>> code still expects it to be.... >> > >>>> >> > >>>> Option 4: Cache.closeAndWait(long timeout, TimeUnit unit) - Closes >> and >> > >>>> waits until it is really closed. >> > >>>> Option 5: Cache.close(Runnable closedCalleback) - Runs callback >> after >> > >>>> cache is really close. >> > >>>> Option 6: cache.close(); while (!cache.isClosed()); >> > >>>> >> > >>>> >> > >>>>> On Apr 14, 2020, at 2:11 PM, Mark Hanson <mhan...@pivotal.io> >> wrote: >> > >>>>> >> > >>>>> Hi All, >> > >>>>> >> > >>>>> I know that we have discussed this once before, but I think it >> bears >> > >>>> repeating. We have test code that assumes cache.close is >> synchronous. >> > >> It >> > >>> is >> > >>>> not. Not even close. I would like discuss some possible changes. >> > >>>>> >> > >>>>> Option 1. Call it what it is. Deprecate Cache.close and create a >> new >> > >>>> method called asyncClose to replace it. Simple and descriptive. >> > >>>>> Option 2. Fix cache close so it is synchronous. Some might say >> that >> > >> we >> > >>>> are going to break behavior, but I doubt any user relies on the >> fact >> > >> that >> > >>>> it is asynchronous. That would be dangerous in and of itself. >> > >> Obviously, >> > >>> we >> > >>>> don’t want to change behavior, but there have been a number of >> > >>> distributed >> > >>>> tests that have failed for this. If internal to the code devs don’t >> > get >> > >>> it >> > >>>> right, where does that leave users. >> > >>>>> Option 3. Status quo. >> > >>>>> >> > >>>>> What do you think? Are there other options I am missing? >> > >>>>> >> > >>>>> Thanks, >> > >>>>> Mark >> > >>>>> >> > >>>> >> > >>>> >> > >>> >> > >> >> > >> >> > >> -- >> > >> -John >> > >> Spring Data Team >> > >> >> > >> > >> >> -- >> -John >> Spring Data Team >> >