Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

Kirk Lund Thu, 16 Apr 2020 10:48:14 -0700

I revived my branch by rebasing it on develop and filed a new draft PR:

https://github.com/apache/geode/pull/4963


Unfortunately, IntegrationTest exceeds timeout every time I trigger it. The
cause does not appear to be a specific test or hang. I
think IntegrationTest has already been running very close to the timeout
and is exceeding it fairly often even without my changes in #4963.

Should we increase the timeout for IntegrationTest? (Anyone know how to
increase it?)

On Tue, Apr 14, 2020 at 5:06 PM Kirk Lund <kl...@apache.org> wrote:

> When I previously submitted a PR to change concurrent calls to
> Cache.close() to not return until everything is completely stopped
> (including the DistributedSystem), the change was down-voted by Anil. His
> reasoning was that the current behavior is expected by users and is de
> facto correct. Changing it would change the behavior and result in
> long-time users being surprised by the new behavior.
>
> Below are some stack traces from my notes about what causes the thread
> calling Cache.close() to return prematurely before the DistributedSystem is
> disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen
> in stack #3 below. DiskStoreImpl reacts to the situation by once again
> invoking Cache.close() once or twice (stack #1 and stack #2) and one of
> those invocations wins the race against the thread in stack #3. The thread
> in stack #3 then returns prematurely before the DistributedSystem is
> disconnected. If thread #3 then attempts to do anything like create a new
> cache (which quite a few tests do), it can fail and throw
> DistributedSystemDisconnectedException from cache create.
>
> There are two early-outs in GemFireCacheImpl.close() which allows a
> calling thread to return before any actual work has been completed after
> closing nothing but the SecurityService.
>
>     if (isClosed()) {
>       return;
>     }
>
> And "isClosed()" returns true when isClosing flag is true (which is set
> true when closing starts):
>
>   public boolean isClosed() {
>     return isClosing;
>   }
>
> Failed creation of a persistent region is one way DiskStoreImpl can cause
> multiple threads trying to close the cache to trip all over each other.
>
> DiskStoreImpl is problematic at best in the way it's implemented and it's
> not currently unit tested (or unit testable without lots of refactoring),
> and I don't plan to revisit this change. I would however be happy to review
> proposals and PRs related to this. My change was focused on Cache.close()
> and adding a CountDownLatch to close() -- perhaps the next attempt to "fix"
> this should focus on DiskStoreImpl -- one could easily argue that closing
> the cache is NOT a valid responsibility for DiskStoreImpl. But changing the
> behavior of the persistent "layer" of Geode might require more research
> (and a LOT more refactoring) than I had time for since my reason for
> working on this was to fix some flaky dunit tests caused by this race
> condition.
>
> This bug appears to be caused when creation of a persistent region fails
> and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to
> close the Cache which succeeds in closing the Cache before the main test
> thread closes it. The main test thread then early outs because the
> DiskStore thread is already closing the Cache. The main test thread then
> tries to create a Cache which collides with the DiskStore thread which is
> still closing the Cache and DistributedSystem.
>
> java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
>       at 
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
>       at java.lang.Thread.run(Thread.java:748)
>
> java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
>       at 
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
>       at java.lang.Thread.run(Thread.java:748)
>
> java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073
>       at 
> org.apache.geode.internal.cache.DiskStoreImpl.handleDiskAccessException(DiskStoreImpl.java:3376)
>       at 
> org.apache.geode.internal.cache.PartitionedRegion.createAndValidatePersistentConfig(PartitionedRegion.java:956)
>       at 
> org.apache.geode.internal.cache.PartitionedRegion.initPRInternals(PartitionedRegion.java:999)
>       at 
> org.apache.geode.internal.cache.PartitionedRegion.initialize(PartitionedRegion.java:1179)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3043)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2931)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2918)
>       at org.apache.geode.cache.RegionFactory.create(RegionFactory.java:755)
>       at 
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.createCacheAndColocatedPRs(CacheFactoryRecreateRegressionTest.java:109)
>       at 
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.lambda$recreateDoesNotThrowDistributedSystemDisconnectedException$0(CacheFactoryRecreateRegressionTest.java:62)
>       at 
> org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:62)
>       at 
> org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:786)
>       at org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1200)
>       at 
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:62)
>
>
> On Tue, Apr 14, 2020 at 4:02 PM John Blum <jb...@pivotal.io> wrote:
>
>> Among other problems I encountered, 1 problem that seemed to affect
>> *Integration
>> Tests* as I described was that the *Singleton* cache reference was not
>> cleaned up in a timely manner. Therefore, starting a fresh cache instance
>> (without coordination) in another *Integration Tests* would occasionally
>> throw a CacheExistsException (IIRC).
>>
>> It would be roughly equivalent to ...
>>
>> Cache cache = new CacheFactory().create();
>> // create more cache objects (Regions, Indexes, etc)
>> cache.close();
>> cache = new CacheFactory().create();  // EXCEPTION!!!
>>
>> There is a lot of stuff (even asynchronous things) going on inside cache
>> close that can take time.  Even deallocation of system resources does not
>> always happen in a timely manner.
>>
>> Kirk will undoubtedly remember other things he encountered as well.
>>
>> -j
>>
>>
>> On Tue, Apr 14, 2020 at 3:45 PM Mark Hanson <mhan...@pivotal.io> wrote:
>>
>> > I believe it is because of disk persistence among other things. Kirk
>> would
>> > know for sure. He fixed the issue and his PR got shutdown.
>> > I totally support just fixing the bug.
>> >
>> > Let’s give Kirk a chance to chime in.
>> >
>> > Thanks,
>> > Mark
>> >
>> > > On Apr 14, 2020, at 3:39 PM, Dan Smith <dsm...@pivotal.io> wrote:
>> > >
>> > > IMO if it's not currently synchronous, that's just a bug that needs
>> to be
>> > > fixed. If folks can provide details on what actually was asynchronous
>> or
>> > > the particular test failures they saw, that would be helpful.
>> > >
>> > > Previously, when this came up it looks like Kirk found that close
>> would
>> > not
>> > > wait for a different call to close() issued by a different thread
>> [1]. Is
>> > > that still the bug we are running into? One that thread, it seems
>> like we
>> > > also had a consensus we should just fix bugs with Cache.close:
>> > >
>> > > -Dan
>> > >
>> > > 1.
>> > >
>> >
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fx%2Fthread.html%2Ff385a6dd51209e2706c68c9821412a6f22ebef3e809636060ac0bf55%40%253Cdev.geode.apache.org%253E&amp;data=02%7C01%7Chansonm%40vmware.com%7C7a43463ab53c416234d908d7e0c4cc6b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637225008165230328&amp;sdata=GD77kAubDDWfP93zjYsNP61VMN4%2FKBAHcq95GwjeMBc%3D&amp;reserved=0
>> > >
>> > > On Tue, Apr 14, 2020 at 3:23 PM John Blum <jb...@pivotal.io> wrote:
>> > >
>> > >> My first thought is cache close (i.e. RegionService.close()  should
>> be
>> > >> synchronous (by default), perhaps, with non-blocking options or
>> options
>> > to
>> > >> wait for a set timeout as Jake suggested.
>> > >>
>> > >> This is a problem for *Integration Tests* (that start a peer cache
>> > >> instance, in-process or standalone) in general and not simply just
>> > >> "distributed" tests!  This is the reason I built support for this in
>> > >> *Spring
>> > >> Test for Apache Geode* (STDG) since subsequent tests requiring a peer
>> > cache
>> > >> instance (or CacheServer) may conflict with each other, especially
>> > given 1)
>> > >> the cache instance is a *Singleton* and 2) it is ideal to not have to
>> > >> restart the JVM between, even for *Integration Tests*, however, turns
>> > out
>> > >> to be a really common practice. *#ugh*  However, without proper
>> > >> coordination this can be a real problem, hence STDG's support.  Even
>> > when
>> > >> forking JVMs, you still have to be careful to wait in certain cases,
>> as
>> > you
>> > >> could run into other conflicts (e.g. BindExceptions if not varying
>> port
>> > >> numbers and such).  For all these reasons and more, it is important
>> that
>> > >> the cache has fully shutdown and released all its resources.
>> > >>
>> > >> Also, while we are on this topic, I think it would be useful to have
>> a
>> > >> dedicated interface for the cache instance lifecycle.  It's
>> unfortunate
>> > the
>> > >> CacheListener interface is already taken for Region events. *#sigh*
>> > >>
>> > >> The interface might be something like...
>> > >>
>> > >> interface CacheLifecycleListener {
>> > >>
>> > >>  default void isStarting(CacheEvent event) {}
>> > >>
>> > >>  default void onStart(CacheEvent event) {}
>> > >>
>> > >>  default void isClosing(CacheEvent event) {}
>> > >>
>> > >>  default void onClose(CacheEvent event) {}
>> > >>
>> > >>  ...
>> > >>
>> > >> }
>> > >>
>> > >> -j
>> > >>
>> > >> On Tue, Apr 14, 2020 at 3:21 PM Jason Huynh <jhu...@pivotal.io>
>> wrote:
>> > >>
>> > >>> The isClosed flag and method might be used currently more as an
>> > >> isClosing.
>> > >>> The GemFireCacheImpl.isClosed() method is actually returning
>> isClosing.
>> > >>> Whatever change to isClosed that will be made, will have to properly
>> > >> handle
>> > >>> cases where it's meant to be treated as isClosing().
>> > >>>
>> > >>> On Tue, Apr 14, 2020 at 3:09 PM Mark Hanson <hans...@vmware.com>
>> > wrote:
>> > >>>
>> > >>>> Hi Jake,
>> > >>>>
>> > >>>> For Option 6: We could fix isClosed as well. That is a great
>> > >> suggestion.
>> > >>>> Currently, it returns almost immediately.
>> > >>>> I like your options though....
>> > >>>>
>> > >>>> Any other thoughts?
>> > >>>>
>> > >>>> Any preferences? It think any of the current options seem better
>> than
>> > >> the
>> > >>>> current situation as long as we fix isClosed.
>> > >>>>
>> > >>>> Thanks,
>> > >>>> Mark
>> > >>>> ________________________________
>> > >>>> From: Jacob Barrett <jbarr...@pivotal.io>
>> > >>>> Sent: Tuesday, April 14, 2020 2:30 PM
>> > >>>> To: dev@geode.apache.org <dev@geode.apache.org>
>> > >>>> Subject: Re: [Discuss] Cache.close synchronous is not synchronous,
>> but
>> > >>>> code still expects it to be....
>> > >>>>
>> > >>>> Option 4: Cache.closeAndWait(long timeout, TimeUnit unit) - Closes
>> and
>> > >>>> waits until it is really closed.
>> > >>>> Option 5: Cache.close(Runnable closedCalleback) - Runs callback
>> after
>> > >>>> cache is really close.
>> > >>>> Option 6: cache.close(); while (!cache.isClosed());
>> > >>>>
>> > >>>>
>> > >>>>> On Apr 14, 2020, at 2:11 PM, Mark Hanson <mhan...@pivotal.io>
>> wrote:
>> > >>>>>
>> > >>>>> Hi All,
>> > >>>>>
>> > >>>>> I know that we have discussed this once before, but I think it
>> bears
>> > >>>> repeating. We have test code that assumes cache.close is
>> synchronous.
>> > >> It
>> > >>> is
>> > >>>> not. Not even close. I would like discuss some possible changes.
>> > >>>>>
>> > >>>>> Option 1. Call it what it is.  Deprecate Cache.close and create a
>> new
>> > >>>> method called asyncClose to replace it. Simple and descriptive.
>> > >>>>> Option 2. Fix cache close so it is synchronous. Some might say
>> that
>> > >> we
>> > >>>> are going to break behavior, but I doubt any user relies on the
>> fact
>> > >> that
>> > >>>> it is asynchronous. That would be dangerous in and of itself.
>> > >> Obviously,
>> > >>> we
>> > >>>> don’t want to change behavior, but there have been a number of
>> > >>> distributed
>> > >>>> tests that have failed for this. If internal to the code devs don’t
>> > get
>> > >>> it
>> > >>>> right, where does that leave users.
>> > >>>>> Option 3. Status quo.
>> > >>>>>
>> > >>>>> What do you think? Are there other options I am missing?
>> > >>>>>
>> > >>>>> Thanks,
>> > >>>>> Mark
>> > >>>>>
>> > >>>>
>> > >>>>
>> > >>>
>> > >>
>> > >>
>> > >> --
>> > >> -John
>> > >> Spring Data Team
>> > >>
>> >
>> >
>>
>> --
>> -John
>> Spring Data Team
>>
>

Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

Reply via email to