The time out is increased in the shared Jinja variables file On Thu, Apr 16, 2020, 10:48 Kirk Lund <kl...@apache.org> wrote:
> I revived my branch by rebasing it on develop and filed a new draft PR: > > https://github.com/apache/geode/pull/4963 > > Unfortunately, IntegrationTest exceeds timeout every time I trigger it. The > cause does not appear to be a specific test or hang. I > think IntegrationTest has already been running very close to the timeout > and is exceeding it fairly often even without my changes in #4963. > > Should we increase the timeout for IntegrationTest? (Anyone know how to > increase it?) > > On Tue, Apr 14, 2020 at 5:06 PM Kirk Lund <kl...@apache.org> wrote: > > > When I previously submitted a PR to change concurrent calls to > > Cache.close() to not return until everything is completely stopped > > (including the DistributedSystem), the change was down-voted by Anil. His > > reasoning was that the current behavior is expected by users and is de > > facto correct. Changing it would change the behavior and result in > > long-time users being surprised by the new behavior. > > > > Below are some stack traces from my notes about what causes the thread > > calling Cache.close() to return prematurely before the DistributedSystem > is > > disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen > > in stack #3 below. DiskStoreImpl reacts to the situation by once again > > invoking Cache.close() once or twice (stack #1 and stack #2) and one of > > those invocations wins the race against the thread in stack #3. The > thread > > in stack #3 then returns prematurely before the DistributedSystem is > > disconnected. If thread #3 then attempts to do anything like create a new > > cache (which quite a few tests do), it can fail and throw > > DistributedSystemDisconnectedException from cache create. > > > > There are two early-outs in GemFireCacheImpl.close() which allows a > > calling thread to return before any actual work has been completed after > > closing nothing but the SecurityService. > > > > if (isClosed()) { > > return; > > } > > > > And "isClosed()" returns true when isClosing flag is true (which is set > > true when closing starts): > > > > public boolean isClosed() { > > return isClosing; > > } > > > > Failed creation of a persistent region is one way DiskStoreImpl can cause > > multiple threads trying to close the cache to trip all over each other. > > > > DiskStoreImpl is problematic at best in the way it's implemented and it's > > not currently unit tested (or unit testable without lots of refactoring), > > and I don't plan to revisit this change. I would however be happy to > review > > proposals and PRs related to this. My change was focused on Cache.close() > > and adding a CountDownLatch to close() -- perhaps the next attempt to > "fix" > > this should focus on DiskStoreImpl -- one could easily argue that closing > > the cache is NOT a valid responsibility for DiskStoreImpl. But changing > the > > behavior of the persistent "layer" of Geode might require more research > > (and a LOT more refactoring) than I had time for since my reason for > > working on this was to fix some flaky dunit tests caused by this race > > condition. > > > > This bug appears to be caused when creation of a persistent region fails > > and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to > > close the Cache which succeeds in closing the Cache before the main test > > thread closes it. The main test thread then early outs because the > > DiskStore thread is already closing the Cache. The main test thread then > > tries to create a Cache which collides with the DiskStore thread which is > > still closing the Cache and DistributedSystem. > > > > java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948 > > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365) > > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917) > > at > org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380) > > at java.lang.Thread.run(Thread.java:748) > > > > java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056 > > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637) > > at > org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225) > > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351) > > at > org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917) > > at > org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380) > > at java.lang.Thread.run(Thread.java:748) > > > > java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073 > > at > org.apache.geode.internal.cache.DiskStoreImpl.handleDiskAccessException(DiskStoreImpl.java:3376) > > at > org.apache.geode.internal.cache.PartitionedRegion.createAndValidatePersistentConfig(PartitionedRegion.java:956) > > at > org.apache.geode.internal.cache.PartitionedRegion.initPRInternals(PartitionedRegion.java:999) > > at > org.apache.geode.internal.cache.PartitionedRegion.initialize(PartitionedRegion.java:1179) > > at > org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3043) > > at > org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2931) > > at > org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2918) > > at > org.apache.geode.cache.RegionFactory.create(RegionFactory.java:755) > > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.createCacheAndColocatedPRs(CacheFactoryRecreateRegressionTest.java:109) > > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.lambda$recreateDoesNotThrowDistributedSystemDisconnectedException$0(CacheFactoryRecreateRegressionTest.java:62) > > at > org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:62) > > at > org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:786) > > at > org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1200) > > at > org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:62) > > > > > > On Tue, Apr 14, 2020 at 4:02 PM John Blum <jb...@pivotal.io> wrote: > > > >> Among other problems I encountered, 1 problem that seemed to affect > >> *Integration > >> Tests* as I described was that the *Singleton* cache reference was not > >> cleaned up in a timely manner. Therefore, starting a fresh cache > instance > >> (without coordination) in another *Integration Tests* would occasionally > >> throw a CacheExistsException (IIRC). > >> > >> It would be roughly equivalent to ... > >> > >> Cache cache = new CacheFactory().create(); > >> // create more cache objects (Regions, Indexes, etc) > >> cache.close(); > >> cache = new CacheFactory().create(); // EXCEPTION!!! > >> > >> There is a lot of stuff (even asynchronous things) going on inside cache > >> close that can take time. Even deallocation of system resources does > not > >> always happen in a timely manner. > >> > >> Kirk will undoubtedly remember other things he encountered as well. > >> > >> -j > >> > >> > >> On Tue, Apr 14, 2020 at 3:45 PM Mark Hanson <mhan...@pivotal.io> wrote: > >> > >> > I believe it is because of disk persistence among other things. Kirk > >> would > >> > know for sure. He fixed the issue and his PR got shutdown. > >> > I totally support just fixing the bug. > >> > > >> > Let’s give Kirk a chance to chime in. > >> > > >> > Thanks, > >> > Mark > >> > > >> > > On Apr 14, 2020, at 3:39 PM, Dan Smith <dsm...@pivotal.io> wrote: > >> > > > >> > > IMO if it's not currently synchronous, that's just a bug that needs > >> to be > >> > > fixed. If folks can provide details on what actually was > asynchronous > >> or > >> > > the particular test failures they saw, that would be helpful. > >> > > > >> > > Previously, when this came up it looks like Kirk found that close > >> would > >> > not > >> > > wait for a different call to close() issued by a different thread > >> [1]. Is > >> > > that still the bug we are running into? One that thread, it seems > >> like we > >> > > also had a consensus we should just fix bugs with Cache.close: > >> > > > >> > > -Dan > >> > > > >> > > 1. > >> > > > >> > > >> > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fx%2Fthread.html%2Ff385a6dd51209e2706c68c9821412a6f22ebef3e809636060ac0bf55%40%253Cdev.geode.apache.org%253E&data=02%7C01%7Chansonm%40vmware.com%7C7a43463ab53c416234d908d7e0c4cc6b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637225008165230328&sdata=GD77kAubDDWfP93zjYsNP61VMN4%2FKBAHcq95GwjeMBc%3D&reserved=0 > >> > > > >> > > On Tue, Apr 14, 2020 at 3:23 PM John Blum <jb...@pivotal.io> wrote: > >> > > > >> > >> My first thought is cache close (i.e. RegionService.close() should > >> be > >> > >> synchronous (by default), perhaps, with non-blocking options or > >> options > >> > to > >> > >> wait for a set timeout as Jake suggested. > >> > >> > >> > >> This is a problem for *Integration Tests* (that start a peer cache > >> > >> instance, in-process or standalone) in general and not simply just > >> > >> "distributed" tests! This is the reason I built support for this > in > >> > >> *Spring > >> > >> Test for Apache Geode* (STDG) since subsequent tests requiring a > peer > >> > cache > >> > >> instance (or CacheServer) may conflict with each other, especially > >> > given 1) > >> > >> the cache instance is a *Singleton* and 2) it is ideal to not have > to > >> > >> restart the JVM between, even for *Integration Tests*, however, > turns > >> > out > >> > >> to be a really common practice. *#ugh* However, without proper > >> > >> coordination this can be a real problem, hence STDG's support. > Even > >> > when > >> > >> forking JVMs, you still have to be careful to wait in certain > cases, > >> as > >> > you > >> > >> could run into other conflicts (e.g. BindExceptions if not varying > >> port > >> > >> numbers and such). For all these reasons and more, it is important > >> that > >> > >> the cache has fully shutdown and released all its resources. > >> > >> > >> > >> Also, while we are on this topic, I think it would be useful to > have > >> a > >> > >> dedicated interface for the cache instance lifecycle. It's > >> unfortunate > >> > the > >> > >> CacheListener interface is already taken for Region events. *#sigh* > >> > >> > >> > >> The interface might be something like... > >> > >> > >> > >> interface CacheLifecycleListener { > >> > >> > >> > >> default void isStarting(CacheEvent event) {} > >> > >> > >> > >> default void onStart(CacheEvent event) {} > >> > >> > >> > >> default void isClosing(CacheEvent event) {} > >> > >> > >> > >> default void onClose(CacheEvent event) {} > >> > >> > >> > >> ... > >> > >> > >> > >> } > >> > >> > >> > >> -j > >> > >> > >> > >> On Tue, Apr 14, 2020 at 3:21 PM Jason Huynh <jhu...@pivotal.io> > >> wrote: > >> > >> > >> > >>> The isClosed flag and method might be used currently more as an > >> > >> isClosing. > >> > >>> The GemFireCacheImpl.isClosed() method is actually returning > >> isClosing. > >> > >>> Whatever change to isClosed that will be made, will have to > properly > >> > >> handle > >> > >>> cases where it's meant to be treated as isClosing(). > >> > >>> > >> > >>> On Tue, Apr 14, 2020 at 3:09 PM Mark Hanson <hans...@vmware.com> > >> > wrote: > >> > >>> > >> > >>>> Hi Jake, > >> > >>>> > >> > >>>> For Option 6: We could fix isClosed as well. That is a great > >> > >> suggestion. > >> > >>>> Currently, it returns almost immediately. > >> > >>>> I like your options though.... > >> > >>>> > >> > >>>> Any other thoughts? > >> > >>>> > >> > >>>> Any preferences? It think any of the current options seem better > >> than > >> > >> the > >> > >>>> current situation as long as we fix isClosed. > >> > >>>> > >> > >>>> Thanks, > >> > >>>> Mark > >> > >>>> ________________________________ > >> > >>>> From: Jacob Barrett <jbarr...@pivotal.io> > >> > >>>> Sent: Tuesday, April 14, 2020 2:30 PM > >> > >>>> To: dev@geode.apache.org <dev@geode.apache.org> > >> > >>>> Subject: Re: [Discuss] Cache.close synchronous is not > synchronous, > >> but > >> > >>>> code still expects it to be.... > >> > >>>> > >> > >>>> Option 4: Cache.closeAndWait(long timeout, TimeUnit unit) - > Closes > >> and > >> > >>>> waits until it is really closed. > >> > >>>> Option 5: Cache.close(Runnable closedCalleback) - Runs callback > >> after > >> > >>>> cache is really close. > >> > >>>> Option 6: cache.close(); while (!cache.isClosed()); > >> > >>>> > >> > >>>> > >> > >>>>> On Apr 14, 2020, at 2:11 PM, Mark Hanson <mhan...@pivotal.io> > >> wrote: > >> > >>>>> > >> > >>>>> Hi All, > >> > >>>>> > >> > >>>>> I know that we have discussed this once before, but I think it > >> bears > >> > >>>> repeating. We have test code that assumes cache.close is > >> synchronous. > >> > >> It > >> > >>> is > >> > >>>> not. Not even close. I would like discuss some possible changes. > >> > >>>>> > >> > >>>>> Option 1. Call it what it is. Deprecate Cache.close and create > a > >> new > >> > >>>> method called asyncClose to replace it. Simple and descriptive. > >> > >>>>> Option 2. Fix cache close so it is synchronous. Some might say > >> that > >> > >> we > >> > >>>> are going to break behavior, but I doubt any user relies on the > >> fact > >> > >> that > >> > >>>> it is asynchronous. That would be dangerous in and of itself. > >> > >> Obviously, > >> > >>> we > >> > >>>> don’t want to change behavior, but there have been a number of > >> > >>> distributed > >> > >>>> tests that have failed for this. If internal to the code devs > don’t > >> > get > >> > >>> it > >> > >>>> right, where does that leave users. > >> > >>>>> Option 3. Status quo. > >> > >>>>> > >> > >>>>> What do you think? Are there other options I am missing? > >> > >>>>> > >> > >>>>> Thanks, > >> > >>>>> Mark > >> > >>>>> > >> > >>>> > >> > >>>> > >> > >>> > >> > >> > >> > >> > >> > >> -- > >> > >> -John > >> > >> Spring Data Team > >> > >> > >> > > >> > > >> > >> -- > >> -John > >> Spring Data Team > >> > > >