Thanks Kirk, that’s a really helpful description.  As far as expectations go, I 
agree with John.  If a user:

- creates a cache
- closes the cache
- checks the isClosed flag
- and is unable to create a new cache in some cases

That sounds like a bug.  Preventing the early exit sounds like a promising 
approach to resolve this.


Anthony


> On Apr 14, 2020, at 5:06 PM, Kirk Lund <kl...@apache.org> wrote:
> 
> When I previously submitted a PR to change concurrent calls to
> Cache.close() to not return until everything is completely stopped
> (including the DistributedSystem), the change was down-voted by Anil. His
> reasoning was that the current behavior is expected by users and is de
> facto correct. Changing it would change the behavior and result in
> long-time users being surprised by the new behavior.
> 
> Below are some stack traces from my notes about what causes the thread
> calling Cache.close() to return prematurely before the DistributedSystem is
> disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen
> in stack #3 below. DiskStoreImpl reacts to the situation by once again
> invoking Cache.close() once or twice (stack #1 and stack #2) and one of
> those invocations wins the race against the thread in stack #3. The thread
> in stack #3 then returns prematurely before the DistributedSystem is
> disconnected. If thread #3 then attempts to do anything like create a new
> cache (which quite a few tests do), it can fail and throw
> DistributedSystemDisconnectedException from cache create.
> 
> There are two early-outs in GemFireCacheImpl.close() which allows a calling
> thread to return before any actual work has been completed after closing
> nothing but the SecurityService.
> 
>    if (isClosed()) {
>      return;
>    }
> 
> And "isClosed()" returns true when isClosing flag is true (which is set
> true when closing starts):
> 
>  public boolean isClosed() {
>    return isClosing;
>  }
> 
> Failed creation of a persistent region is one way DiskStoreImpl can cause
> multiple threads trying to close the cache to trip all over each other.
> 
> DiskStoreImpl is problematic at best in the way it's implemented and it's
> not currently unit tested (or unit testable without lots of refactoring),
> and I don't plan to revisit this change. I would however be happy to review
> proposals and PRs related to this. My change was focused on Cache.close()
> and adding a CountDownLatch to close() -- perhaps the next attempt to "fix"
> this should focus on DiskStoreImpl -- one could easily argue that closing
> the cache is NOT a valid responsibility for DiskStoreImpl. But changing the
> behavior of the persistent "layer" of Geode might require more research
> (and a LOT more refactoring) than I had time for since my reason for
> working on this was to fix some flaky dunit tests caused by this race
> condition.
> 
> This bug appears to be caused when creation of a persistent region fails
> and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to
> close the Cache which succeeds in closing the Cache before the main test
> thread closes it. The main test thread then early outs because the
> DiskStore thread is already closing the Cache. The main test thread then
> tries to create a Cache which collides with the DiskStore thread which is
> still closing the Cache and DistributedSystem.
> 
> java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
>       at 
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
>       at java.lang.Thread.run(Thread.java:748)
> 
> java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637)
>       at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
>       at 
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
>       at java.lang.Thread.run(Thread.java:748)
> 
> java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073
>       at 
> org.apache.geode.internal.cache.DiskStoreImpl.handleDiskAccessException(DiskStoreImpl.java:3376)
>       at 
> org.apache.geode.internal.cache.PartitionedRegion.createAndValidatePersistentConfig(PartitionedRegion.java:956)
>       at 
> org.apache.geode.internal.cache.PartitionedRegion.initPRInternals(PartitionedRegion.java:999)
>       at 
> org.apache.geode.internal.cache.PartitionedRegion.initialize(PartitionedRegion.java:1179)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3043)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2931)
>       at 
> org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2918)
>       at org.apache.geode.cache.RegionFactory.create(RegionFactory.java:755)
>       at 
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.createCacheAndColocatedPRs(CacheFactoryRecreateRegressionTest.java:109)
>       at 
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.lambda$recreateDoesNotThrowDistributedSystemDisconnectedException$0(CacheFactoryRecreateRegressionTest.java:62)
>       at 
> org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:62)
>       at 
> org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:786)
>       at org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1200)
>       at 
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:62)
> 
> 
> On Tue, Apr 14, 2020 at 4:02 PM John Blum <jb...@pivotal.io> wrote:
> 
>> Among other problems I encountered, 1 problem that seemed to affect
>> *Integration
>> Tests* as I described was that the *Singleton* cache reference was not
>> cleaned up in a timely manner. Therefore, starting a fresh cache instance
>> (without coordination) in another *Integration Tests* would occasionally
>> throw a CacheExistsException (IIRC).
>> 
>> It would be roughly equivalent to ...
>> 
>> Cache cache = new CacheFactory().create();
>> // create more cache objects (Regions, Indexes, etc)
>> cache.close();
>> cache = new CacheFactory().create();  // EXCEPTION!!!
>> 
>> There is a lot of stuff (even asynchronous things) going on inside cache
>> close that can take time.  Even deallocation of system resources does not
>> always happen in a timely manner.
>> 
>> Kirk will undoubtedly remember other things he encountered as well.
>> 
>> -j
>> 
>> 
>> On Tue, Apr 14, 2020 at 3:45 PM Mark Hanson <mhan...@pivotal.io> wrote:
>> 
>>> I believe it is because of disk persistence among other things. Kirk
>> would
>>> know for sure. He fixed the issue and his PR got shutdown.
>>> I totally support just fixing the bug.
>>> 
>>> Let’s give Kirk a chance to chime in.
>>> 
>>> Thanks,
>>> Mark
>>> 
>>>> On Apr 14, 2020, at 3:39 PM, Dan Smith <dsm...@pivotal.io> wrote:
>>>> 
>>>> IMO if it's not currently synchronous, that's just a bug that needs to
>> be
>>>> fixed. If folks can provide details on what actually was asynchronous
>> or
>>>> the particular test failures they saw, that would be helpful.
>>>> 
>>>> Previously, when this came up it looks like Kirk found that close would
>>> not
>>>> wait for a different call to close() issued by a different thread [1].
>> Is
>>>> that still the bug we are running into? One that thread, it seems like
>> we
>>>> also had a consensus we should just fix bugs with Cache.close:
>>>> 
>>>> -Dan
>>>> 
>>>> 1.
>>>> 
>>> 
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fx%2Fthread.html%2Ff385a6dd51209e2706c68c9821412a6f22ebef3e809636060ac0bf55%40%253Cdev.geode.apache.org%253E&amp;data=02%7C01%7Chansonm%40vmware.com%7C7a43463ab53c416234d908d7e0c4cc6b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637225008165230328&amp;sdata=GD77kAubDDWfP93zjYsNP61VMN4%2FKBAHcq95GwjeMBc%3D&amp;reserved=0
>>>> 
>>>> On Tue, Apr 14, 2020 at 3:23 PM John Blum <jb...@pivotal.io> wrote:
>>>> 
>>>>> My first thought is cache close (i.e. RegionService.close()  should be
>>>>> synchronous (by default), perhaps, with non-blocking options or
>> options
>>> to
>>>>> wait for a set timeout as Jake suggested.
>>>>> 
>>>>> This is a problem for *Integration Tests* (that start a peer cache
>>>>> instance, in-process or standalone) in general and not simply just
>>>>> "distributed" tests!  This is the reason I built support for this in
>>>>> *Spring
>>>>> Test for Apache Geode* (STDG) since subsequent tests requiring a peer
>>> cache
>>>>> instance (or CacheServer) may conflict with each other, especially
>>> given 1)
>>>>> the cache instance is a *Singleton* and 2) it is ideal to not have to
>>>>> restart the JVM between, even for *Integration Tests*, however, turns
>>> out
>>>>> to be a really common practice. *#ugh*  However, without proper
>>>>> coordination this can be a real problem, hence STDG's support.  Even
>>> when
>>>>> forking JVMs, you still have to be careful to wait in certain cases,
>> as
>>> you
>>>>> could run into other conflicts (e.g. BindExceptions if not varying
>> port
>>>>> numbers and such).  For all these reasons and more, it is important
>> that
>>>>> the cache has fully shutdown and released all its resources.
>>>>> 
>>>>> Also, while we are on this topic, I think it would be useful to have a
>>>>> dedicated interface for the cache instance lifecycle.  It's
>> unfortunate
>>> the
>>>>> CacheListener interface is already taken for Region events. *#sigh*
>>>>> 
>>>>> The interface might be something like...
>>>>> 
>>>>> interface CacheLifecycleListener {
>>>>> 
>>>>> default void isStarting(CacheEvent event) {}
>>>>> 
>>>>> default void onStart(CacheEvent event) {}
>>>>> 
>>>>> default void isClosing(CacheEvent event) {}
>>>>> 
>>>>> default void onClose(CacheEvent event) {}
>>>>> 
>>>>> ...
>>>>> 
>>>>> }
>>>>> 
>>>>> -j
>>>>> 
>>>>> On Tue, Apr 14, 2020 at 3:21 PM Jason Huynh <jhu...@pivotal.io>
>> wrote:
>>>>> 
>>>>>> The isClosed flag and method might be used currently more as an
>>>>> isClosing.
>>>>>> The GemFireCacheImpl.isClosed() method is actually returning
>> isClosing.
>>>>>> Whatever change to isClosed that will be made, will have to properly
>>>>> handle
>>>>>> cases where it's meant to be treated as isClosing().
>>>>>> 
>>>>>> On Tue, Apr 14, 2020 at 3:09 PM Mark Hanson <hans...@vmware.com>
>>> wrote:
>>>>>> 
>>>>>>> Hi Jake,
>>>>>>> 
>>>>>>> For Option 6: We could fix isClosed as well. That is a great
>>>>> suggestion.
>>>>>>> Currently, it returns almost immediately.
>>>>>>> I like your options though....
>>>>>>> 
>>>>>>> Any other thoughts?
>>>>>>> 
>>>>>>> Any preferences? It think any of the current options seem better
>> than
>>>>> the
>>>>>>> current situation as long as we fix isClosed.
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Mark
>>>>>>> ________________________________
>>>>>>> From: Jacob Barrett <jbarr...@pivotal.io>
>>>>>>> Sent: Tuesday, April 14, 2020 2:30 PM
>>>>>>> To: dev@geode.apache.org <dev@geode.apache.org>
>>>>>>> Subject: Re: [Discuss] Cache.close synchronous is not synchronous,
>> but
>>>>>>> code still expects it to be....
>>>>>>> 
>>>>>>> Option 4: Cache.closeAndWait(long timeout, TimeUnit unit) - Closes
>> and
>>>>>>> waits until it is really closed.
>>>>>>> Option 5: Cache.close(Runnable closedCalleback) - Runs callback
>> after
>>>>>>> cache is really close.
>>>>>>> Option 6: cache.close(); while (!cache.isClosed());
>>>>>>> 
>>>>>>> 
>>>>>>>> On Apr 14, 2020, at 2:11 PM, Mark Hanson <mhan...@pivotal.io>
>> wrote:
>>>>>>>> 
>>>>>>>> Hi All,
>>>>>>>> 
>>>>>>>> I know that we have discussed this once before, but I think it
>> bears
>>>>>>> repeating. We have test code that assumes cache.close is
>> synchronous.
>>>>> It
>>>>>> is
>>>>>>> not. Not even close. I would like discuss some possible changes.
>>>>>>>> 
>>>>>>>> Option 1. Call it what it is.  Deprecate Cache.close and create a
>> new
>>>>>>> method called asyncClose to replace it. Simple and descriptive.
>>>>>>>> Option 2. Fix cache close so it is synchronous. Some might say that
>>>>> we
>>>>>>> are going to break behavior, but I doubt any user relies on the fact
>>>>> that
>>>>>>> it is asynchronous. That would be dangerous in and of itself.
>>>>> Obviously,
>>>>>> we
>>>>>>> don’t want to change behavior, but there have been a number of
>>>>>> distributed
>>>>>>> tests that have failed for this. If internal to the code devs don’t
>>> get
>>>>>> it
>>>>>>> right, where does that leave users.
>>>>>>>> Option 3. Status quo.
>>>>>>>> 
>>>>>>>> What do you think? Are there other options I am missing?
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Mark
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> -John
>>>>> Spring Data Team
>>>>> 
>>> 
>>> 
>> 
>> --
>> -John
>> Spring Data Team
>> 

Reply via email to