Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

Robert Houghton Thu, 16 Apr 2020 11:00:24 -0700

The time out is increased in the shared Jinja variables file

On Thu, Apr 16, 2020, 10:48 Kirk Lund <kl...@apache.org> wrote:


> I revived my branch by rebasing it on develop and filed a new draft PR:
>
> https://github.com/apache/geode/pull/4963
>
> Unfortunately, IntegrationTest exceeds timeout every time I trigger it. The
> cause does not appear to be a specific test or hang. I
> think IntegrationTest has already been running very close to the timeout
> and is exceeding it fairly often even without my changes in #4963.
>
> Should we increase the timeout for IntegrationTest? (Anyone know how to
> increase it?)
>
> On Tue, Apr 14, 2020 at 5:06 PM Kirk Lund <kl...@apache.org> wrote:
>
> > When I previously submitted a PR to change concurrent calls to
> > Cache.close() to not return until everything is completely stopped
> > (including the DistributedSystem), the change was down-voted by Anil. His
> > reasoning was that the current behavior is expected by users and is de
> > facto correct. Changing it would change the behavior and result in
> > long-time users being surprised by the new behavior.
> >
> > Below are some stack traces from my notes about what causes the thread
> > calling Cache.close() to return prematurely before the DistributedSystem
> is
> > disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen
> > in stack #3 below. DiskStoreImpl reacts to the situation by once again
> > invoking Cache.close() once or twice (stack #1 and stack #2) and one of
> > those invocations wins the race against the thread in stack #3. The
> thread
> > in stack #3 then returns prematurely before the DistributedSystem is
> > disconnected. If thread #3 then attempts to do anything like create a new
> > cache (which quite a few tests do), it can fail and throw
> > DistributedSystemDisconnectedException from cache create.
> >
> > There are two early-outs in GemFireCacheImpl.close() which allows a
> > calling thread to return before any actual work has been completed after
> > closing nothing but the SecurityService.
> >
> >     if (isClosed()) {
> >       return;
> >     }
> >
> > And "isClosed()" returns true when isClosing flag is true (which is set
> > true when closing starts):
> >
> >   public boolean isClosed() {
> >     return isClosing;
> >   }
> >
> > Failed creation of a persistent region is one way DiskStoreImpl can cause
> > multiple threads trying to close the cache to trip all over each other.
> >
> > DiskStoreImpl is problematic at best in the way it's implemented and it's
> > not currently unit tested (or unit testable without lots of refactoring),
> > and I don't plan to revisit this change. I would however be happy to
> review
> > proposals and PRs related to this. My change was focused on Cache.close()
> > and adding a CountDownLatch to close() -- perhaps the next attempt to
> "fix"
> > this should focus on DiskStoreImpl -- one could easily argue that closing
> > the cache is NOT a valid responsibility for DiskStoreImpl. But changing
> the
> > behavior of the persistent "layer" of Geode might require more research
> > (and a LOT more refactoring) than I had time for since my reason for
> > working on this was to fix some flaky dunit tests caused by this race
> > condition.
> >
> > This bug appears to be caused when creation of a persistent region fails
> > and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to
> > close the Cache which succeeds in closing the Cache before the main test
> > thread closes it. The main test thread then early outs because the
> > DiskStore thread is already closing the Cache. The main test thread then
> > tries to create a Cache which collides with the DiskStore thread which is
> > still closing the Cache and DistributedSystem.
> >
> > java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948
> >       at
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365)
> >       at
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
> >       at
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
> >       at java.lang.Thread.run(Thread.java:748)
> >
> > java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056
> >       at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637)
> >       at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225)
> >       at
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351)
> >       at
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
> >       at
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
> >       at java.lang.Thread.run(Thread.java:748)
> >
> > java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073
> >       at
> org.apache.geode.internal.cache.DiskStoreImpl.handleDiskAccessException(DiskStoreImpl.java:3376)
> >       at
> org.apache.geode.internal.cache.PartitionedRegion.createAndValidatePersistentConfig(PartitionedRegion.java:956)
> >       at
> org.apache.geode.internal.cache.PartitionedRegion.initPRInternals(PartitionedRegion.java:999)
> >       at
> org.apache.geode.internal.cache.PartitionedRegion.initialize(PartitionedRegion.java:1179)
> >       at
> org.apache.geode.internal.cache.GemFireCacheImpl.createVMRegion(GemFireCacheImpl.java:3043)
> >       at
> org.apache.geode.internal.cache.GemFireCacheImpl.basicCreateRegion(GemFireCacheImpl.java:2931)
> >       at
> org.apache.geode.internal.cache.GemFireCacheImpl.createRegion(GemFireCacheImpl.java:2918)
> >       at
> org.apache.geode.cache.RegionFactory.create(RegionFactory.java:755)
> >       at
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.createCacheAndColocatedPRs(CacheFactoryRecreateRegressionTest.java:109)
> >       at
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.lambda$recreateDoesNotThrowDistributedSystemDisconnectedException$0(CacheFactoryRecreateRegressionTest.java:62)
> >       at
> org.assertj.core.api.ThrowableAssert.catchThrowable(ThrowableAssert.java:62)
> >       at
> org.assertj.core.api.AssertionsForClassTypes.catchThrowable(AssertionsForClassTypes.java:786)
> >       at
> org.assertj.core.api.Assertions.catchThrowable(Assertions.java:1200)
> >       at
> org.apache.geode.cache.CacheFactoryRecreateRegressionTest.recreateDoesNotThrowDistributedSystemDisconnectedException(CacheFactoryRecreateRegressionTest.java:62)
> >
> >
> > On Tue, Apr 14, 2020 at 4:02 PM John Blum <jb...@pivotal.io> wrote:
> >
> >> Among other problems I encountered, 1 problem that seemed to affect
> >> *Integration
> >> Tests* as I described was that the *Singleton* cache reference was not
> >> cleaned up in a timely manner. Therefore, starting a fresh cache
> instance
> >> (without coordination) in another *Integration Tests* would occasionally
> >> throw a CacheExistsException (IIRC).
> >>
> >> It would be roughly equivalent to ...
> >>
> >> Cache cache = new CacheFactory().create();
> >> // create more cache objects (Regions, Indexes, etc)
> >> cache.close();
> >> cache = new CacheFactory().create();  // EXCEPTION!!!
> >>
> >> There is a lot of stuff (even asynchronous things) going on inside cache
> >> close that can take time.  Even deallocation of system resources does
> not
> >> always happen in a timely manner.
> >>
> >> Kirk will undoubtedly remember other things he encountered as well.
> >>
> >> -j
> >>
> >>
> >> On Tue, Apr 14, 2020 at 3:45 PM Mark Hanson <mhan...@pivotal.io> wrote:
> >>
> >> > I believe it is because of disk persistence among other things. Kirk
> >> would
> >> > know for sure. He fixed the issue and his PR got shutdown.
> >> > I totally support just fixing the bug.
> >> >
> >> > Let’s give Kirk a chance to chime in.
> >> >
> >> > Thanks,
> >> > Mark
> >> >
> >> > > On Apr 14, 2020, at 3:39 PM, Dan Smith <dsm...@pivotal.io> wrote:
> >> > >
> >> > > IMO if it's not currently synchronous, that's just a bug that needs
> >> to be
> >> > > fixed. If folks can provide details on what actually was
> asynchronous
> >> or
> >> > > the particular test failures they saw, that would be helpful.
> >> > >
> >> > > Previously, when this came up it looks like Kirk found that close
> >> would
> >> > not
> >> > > wait for a different call to close() issued by a different thread
> >> [1]. Is
> >> > > that still the bug we are running into? One that thread, it seems
> >> like we
> >> > > also had a consensus we should just fix bugs with Cache.close:
> >> > >
> >> > > -Dan
> >> > >
> >> > > 1.
> >> > >
> >> >
> >>
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fx%2Fthread.html%2Ff385a6dd51209e2706c68c9821412a6f22ebef3e809636060ac0bf55%40%253Cdev.geode.apache.org%253E&amp;data=02%7C01%7Chansonm%40vmware.com%7C7a43463ab53c416234d908d7e0c4cc6b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637225008165230328&amp;sdata=GD77kAubDDWfP93zjYsNP61VMN4%2FKBAHcq95GwjeMBc%3D&amp;reserved=0
> >> > >
> >> > > On Tue, Apr 14, 2020 at 3:23 PM John Blum <jb...@pivotal.io> wrote:
> >> > >
> >> > >> My first thought is cache close (i.e. RegionService.close()  should
> >> be
> >> > >> synchronous (by default), perhaps, with non-blocking options or
> >> options
> >> > to
> >> > >> wait for a set timeout as Jake suggested.
> >> > >>
> >> > >> This is a problem for *Integration Tests* (that start a peer cache
> >> > >> instance, in-process or standalone) in general and not simply just
> >> > >> "distributed" tests!  This is the reason I built support for this
> in
> >> > >> *Spring
> >> > >> Test for Apache Geode* (STDG) since subsequent tests requiring a
> peer
> >> > cache
> >> > >> instance (or CacheServer) may conflict with each other, especially
> >> > given 1)
> >> > >> the cache instance is a *Singleton* and 2) it is ideal to not have
> to
> >> > >> restart the JVM between, even for *Integration Tests*, however,
> turns
> >> > out
> >> > >> to be a really common practice. *#ugh*  However, without proper
> >> > >> coordination this can be a real problem, hence STDG's support.
> Even
> >> > when
> >> > >> forking JVMs, you still have to be careful to wait in certain
> cases,
> >> as
> >> > you
> >> > >> could run into other conflicts (e.g. BindExceptions if not varying
> >> port
> >> > >> numbers and such).  For all these reasons and more, it is important
> >> that
> >> > >> the cache has fully shutdown and released all its resources.
> >> > >>
> >> > >> Also, while we are on this topic, I think it would be useful to
> have
> >> a
> >> > >> dedicated interface for the cache instance lifecycle.  It's
> >> unfortunate
> >> > the
> >> > >> CacheListener interface is already taken for Region events. *#sigh*
> >> > >>
> >> > >> The interface might be something like...
> >> > >>
> >> > >> interface CacheLifecycleListener {
> >> > >>
> >> > >>  default void isStarting(CacheEvent event) {}
> >> > >>
> >> > >>  default void onStart(CacheEvent event) {}
> >> > >>
> >> > >>  default void isClosing(CacheEvent event) {}
> >> > >>
> >> > >>  default void onClose(CacheEvent event) {}
> >> > >>
> >> > >>  ...
> >> > >>
> >> > >> }
> >> > >>
> >> > >> -j
> >> > >>
> >> > >> On Tue, Apr 14, 2020 at 3:21 PM Jason Huynh <jhu...@pivotal.io>
> >> wrote:
> >> > >>
> >> > >>> The isClosed flag and method might be used currently more as an
> >> > >> isClosing.
> >> > >>> The GemFireCacheImpl.isClosed() method is actually returning
> >> isClosing.
> >> > >>> Whatever change to isClosed that will be made, will have to
> properly
> >> > >> handle
> >> > >>> cases where it's meant to be treated as isClosing().
> >> > >>>
> >> > >>> On Tue, Apr 14, 2020 at 3:09 PM Mark Hanson <hans...@vmware.com>
> >> > wrote:
> >> > >>>
> >> > >>>> Hi Jake,
> >> > >>>>
> >> > >>>> For Option 6: We could fix isClosed as well. That is a great
> >> > >> suggestion.
> >> > >>>> Currently, it returns almost immediately.
> >> > >>>> I like your options though....
> >> > >>>>
> >> > >>>> Any other thoughts?
> >> > >>>>
> >> > >>>> Any preferences? It think any of the current options seem better
> >> than
> >> > >> the
> >> > >>>> current situation as long as we fix isClosed.
> >> > >>>>
> >> > >>>> Thanks,
> >> > >>>> Mark
> >> > >>>> ________________________________
> >> > >>>> From: Jacob Barrett <jbarr...@pivotal.io>
> >> > >>>> Sent: Tuesday, April 14, 2020 2:30 PM
> >> > >>>> To: dev@geode.apache.org <dev@geode.apache.org>
> >> > >>>> Subject: Re: [Discuss] Cache.close synchronous is not
> synchronous,
> >> but
> >> > >>>> code still expects it to be....
> >> > >>>>
> >> > >>>> Option 4: Cache.closeAndWait(long timeout, TimeUnit unit) -
> Closes
> >> and
> >> > >>>> waits until it is really closed.
> >> > >>>> Option 5: Cache.close(Runnable closedCalleback) - Runs callback
> >> after
> >> > >>>> cache is really close.
> >> > >>>> Option 6: cache.close(); while (!cache.isClosed());
> >> > >>>>
> >> > >>>>
> >> > >>>>> On Apr 14, 2020, at 2:11 PM, Mark Hanson <mhan...@pivotal.io>
> >> wrote:
> >> > >>>>>
> >> > >>>>> Hi All,
> >> > >>>>>
> >> > >>>>> I know that we have discussed this once before, but I think it
> >> bears
> >> > >>>> repeating. We have test code that assumes cache.close is
> >> synchronous.
> >> > >> It
> >> > >>> is
> >> > >>>> not. Not even close. I would like discuss some possible changes.
> >> > >>>>>
> >> > >>>>> Option 1. Call it what it is.  Deprecate Cache.close and create
> a
> >> new
> >> > >>>> method called asyncClose to replace it. Simple and descriptive.
> >> > >>>>> Option 2. Fix cache close so it is synchronous. Some might say
> >> that
> >> > >> we
> >> > >>>> are going to break behavior, but I doubt any user relies on the
> >> fact
> >> > >> that
> >> > >>>> it is asynchronous. That would be dangerous in and of itself.
> >> > >> Obviously,
> >> > >>> we
> >> > >>>> don’t want to change behavior, but there have been a number of
> >> > >>> distributed
> >> > >>>> tests that have failed for this. If internal to the code devs
> don’t
> >> > get
> >> > >>> it
> >> > >>>> right, where does that leave users.
> >> > >>>>> Option 3. Status quo.
> >> > >>>>>
> >> > >>>>> What do you think? Are there other options I am missing?
> >> > >>>>>
> >> > >>>>> Thanks,
> >> > >>>>> Mark
> >> > >>>>>
> >> > >>>>
> >> > >>>>
> >> > >>>
> >> > >>
> >> > >>
> >> > >> --
> >> > >> -John
> >> > >> Spring Data Team
> >> > >>
> >> >
> >> >
> >>
> >> --
> >> -John
> >> Spring Data Team
> >>
> >
>

Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

Reply via email to