Re: Reviewers for GEODE-7971 Gateway sender to deliver transaction events atomically to gateway receivers

2020-04-16 Thread Alberto Gomez
Hi,

Friendly reminder - No reviewers yet for this PR.

-Alberto G.

From: Alberto Gomez 
Sent: Tuesday, April 14, 2020 3:34 PM
To: dev@geode.apache.org 
Subject: Reviewers for GEODE-7971 Gateway sender to deliver transaction events 
atomically to gateway receivers

Hi,

Could I ask for a review on https://github.com/apache/geode/pull/4928?

This PR is about "Gateway sender to deliver transaction events atomically to 
gateway receivers" (https://issues.apache.org/jira/browse/GEODE-7971).


Thanks,

/Alberto G.



About Geode rolling downgrade

2020-04-16 Thread Alberto Gomez
Hi,

Some months ago I posted a question on this list (see [1]) about the 
possibility of supporting "rolling downgrade" in Geode in order to downgrade a 
Geode system to an older version, similar to the "rolling upgrade" currently 
supported.
With your answers and my investigations my conclusion was that the main 
stumbling block to support "rolling downgrades" was the compatibility of 
persistent files which was very hard to achieve because old members would 
require to be prepared to support newer versions of persistent files.

We have come up with a new approach to support rolling downgrades in Geode 
which consists of the following procedure:

- For each locator:
  - Stop locator
  - Remove locator files
  - Start locator in older version

- For each server:
  - Stop server
  - Remove server files
  - Revoke missing-disk-stores for server
  - Start server in older version

Some extra details about this procedure:
- The starting and stopping of processes may not be able to be done using gfsh 
as gfsh does not allow to manage members in a different version than its own.
- Redundancy in servers is required
- More than one locator is required
- The allow_old_members_to_join_for_testing needs to be passed to the members.

I would like to ask two questions regarding this procedure:
- Do you see any issue not considered by this procedure or any alternative to 
it?
- Would it be reasonable to make public the 
"allow_old_members_to_join_for_testing" parameter (with a new name) so that it 
might be valid option for production systems to support, for example, the 
procedure proposed?

Thanks in advance for your answers.

Best regards,

-Alberto G.


[1]
 
http://mail-archives.apache.org/mod_mbox/geode-dev/201910.mbox/%3cb080e98c-5df4-e494-dcbd-383f6d979...@est.tech%3E


Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

2020-04-16 Thread Bruce Schuchardt
I've run into the CacheExistsException problem that John mentioned many times, 
or the associated "A connection to a distributed system already exists in this 
VM" exception thrown by InternalDistributedSystem.  

On 4/14/20, 4:02 PM, "John Blum"  wrote:

Among other problems I encountered, 1 problem that seemed to affect
*Integration
Tests* as I described was that the *Singleton* cache reference was not
cleaned up in a timely manner. Therefore, starting a fresh cache instance
(without coordination) in another *Integration Tests* would occasionally
throw a CacheExistsException (IIRC).

It would be roughly equivalent to ...

Cache cache = new CacheFactory().create();
// create more cache objects (Regions, Indexes, etc)
cache.close();
cache = new CacheFactory().create();  // EXCEPTION!!!

There is a lot of stuff (even asynchronous things) going on inside cache
close that can take time.  Even deallocation of system resources does not
always happen in a timely manner.

Kirk will undoubtedly remember other things he encountered as well.

-j


On Tue, Apr 14, 2020 at 3:45 PM Mark Hanson  wrote:

> I believe it is because of disk persistence among other things. Kirk would
> know for sure. He fixed the issue and his PR got shutdown.
> I totally support just fixing the bug.
>
> Let’s give Kirk a chance to chime in.
>
> Thanks,
> Mark
>
> > On Apr 14, 2020, at 3:39 PM, Dan Smith  wrote:
> >
> > IMO if it's not currently synchronous, that's just a bug that needs to 
be
> > fixed. If folks can provide details on what actually was asynchronous or
> > the particular test failures they saw, that would be helpful.
> >
> > Previously, when this came up it looks like Kirk found that close would
> not
> > wait for a different call to close() issued by a different thread [1]. 
Is
> > that still the bug we are running into? One that thread, it seems like 
we
> > also had a consensus we should just fix bugs with Cache.close:
> >
> > -Dan
> >
> > 1.
> >
> 
https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.apache.org%2Fx%2Fthread.html%2Ff385a6dd51209e2706c68c9821412a6f22ebef3e809636060ac0bf55%40%253Cdev.geode.apache.org%253E&data=02%7C01%7Chansonm%40vmware.com%7C7a43463ab53c416234d908d7e0c4cc6b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637225008165230328&sdata=GD77kAubDDWfP93zjYsNP61VMN4%2FKBAHcq95GwjeMBc%3D&reserved=0
> >
> > On Tue, Apr 14, 2020 at 3:23 PM John Blum  wrote:
> >
> >> My first thought is cache close (i.e. RegionService.close()  should be
> >> synchronous (by default), perhaps, with non-blocking options or options
> to
> >> wait for a set timeout as Jake suggested.
> >>
> >> This is a problem for *Integration Tests* (that start a peer cache
> >> instance, in-process or standalone) in general and not simply just
> >> "distributed" tests!  This is the reason I built support for this in
> >> *Spring
> >> Test for Apache Geode* (STDG) since subsequent tests requiring a peer
> cache
> >> instance (or CacheServer) may conflict with each other, especially
> given 1)
> >> the cache instance is a *Singleton* and 2) it is ideal to not have to
> >> restart the JVM between, even for *Integration Tests*, however, turns
> out
> >> to be a really common practice. *#ugh*  However, without proper
> >> coordination this can be a real problem, hence STDG's support.  Even
> when
> >> forking JVMs, you still have to be careful to wait in certain cases, as
> you
> >> could run into other conflicts (e.g. BindExceptions if not varying port
> >> numbers and such).  For all these reasons and more, it is important 
that
> >> the cache has fully shutdown and released all its resources.
> >>
> >> Also, while we are on this topic, I think it would be useful to have a
> >> dedicated interface for the cache instance lifecycle.  It's unfortunate
> the
> >> CacheListener interface is already taken for Region events. *#sigh*
> >>
> >> The interface might be something like...
> >>
> >> interface CacheLifecycleListener {
> >>
> >>  default void isStarting(CacheEvent event) {}
> >>
> >>  default void onStart(CacheEvent event) {}
> >>
> >>  default void isClosing(CacheEvent event) {}
> >>
> >>  default void onClose(CacheEvent event) {}
> >>
> >>  ...
> >>
> >> }
> >>
> >> -j
> >>
> >> On Tue, Apr 14, 2020 at 3:21 PM Jason Huynh  wrote:
> >>
> >>> The isClosed flag and method might be used currently more as an
> >> isClosing.
> >>> The GemFireCacheImpl.isClosed() method is actually returning 
isClosing.
> >>> Whatever change to isClosed that will be made, will have to properly
> >> handle
> >>> c

Re: About Geode rolling downgrade

2020-04-16 Thread Bruce Schuchardt
-1

Another reason that we should not support rolling downgrade is that it makes it 
impossible to upgrade distributed algorithms.

When we added rolling upgrade support we pretty much immediately ran into a 
distributed hang when a test started a Locator using an older version.  In that 
release we also introduced the cluster configuration service and along with 
that we needed to upgrade the distributed lock service's notion of the "elder" 
member of the cluster.  Prior to that change a Locator could not fill this 
role, but the CCS needed to be able to use locking and needed a Locator to be 
able to fill this role.  During upgrade we used the old "elder" algorithm but 
once the upgrade was finished we switched to the new algorithm.  If you 
introduced an older Locator into this upgraded cluster it wouldn't think that 
it should be the "elder" but the rest of the cluster would expect it to be the 
elder.

You could support rolling downgrade in this scenario with extra logic and extra 
testing, but I don't think that will always be the case.  Rolling downgrade 
support would place an immense burden on developers in extra development and 
testing in order to ensure that older algorithms could always be brought back 
on-line.

On 4/16/20, 4:24 AM, "Alberto Gomez"  wrote:

Hi,

Some months ago I posted a question on this list (see [1]) about the 
possibility of supporting "rolling downgrade" in Geode in order to downgrade a 
Geode system to an older version, similar to the "rolling upgrade" currently 
supported.
With your answers and my investigations my conclusion was that the main 
stumbling block to support "rolling downgrades" was the compatibility of 
persistent files which was very hard to achieve because old members would 
require to be prepared to support newer versions of persistent files.

We have come up with a new approach to support rolling downgrades in Geode 
which consists of the following procedure:

- For each locator:
  - Stop locator
  - Remove locator files
  - Start locator in older version

- For each server:
  - Stop server
  - Remove server files
  - Revoke missing-disk-stores for server
  - Start server in older version

Some extra details about this procedure:
- The starting and stopping of processes may not be able to be done using 
gfsh as gfsh does not allow to manage members in a different version than its 
own.
- Redundancy in servers is required
- More than one locator is required
- The allow_old_members_to_join_for_testing needs to be passed to the 
members.

I would like to ask two questions regarding this procedure:
- Do you see any issue not considered by this procedure or any alternative 
to it?
- Would it be reasonable to make public the 
"allow_old_members_to_join_for_testing" parameter (with a new name) so that it 
might be valid option for production systems to support, for example, the 
procedure proposed?

Thanks in advance for your answers.

Best regards,

-Alberto G.


[1]
 
http://mail-archives.apache.org/mod_mbox/geode-dev/201910.mbox/%3cb080e98c-5df4-e494-dcbd-383f6d979...@est.tech%3E





Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

2020-04-16 Thread Kirk Lund
I revived my branch by rebasing it on develop and filed a new draft PR:

https://github.com/apache/geode/pull/4963

Unfortunately, IntegrationTest exceeds timeout every time I trigger it. The
cause does not appear to be a specific test or hang. I
think IntegrationTest has already been running very close to the timeout
and is exceeding it fairly often even without my changes in #4963.

Should we increase the timeout for IntegrationTest? (Anyone know how to
increase it?)

On Tue, Apr 14, 2020 at 5:06 PM Kirk Lund  wrote:

> When I previously submitted a PR to change concurrent calls to
> Cache.close() to not return until everything is completely stopped
> (including the DistributedSystem), the change was down-voted by Anil. His
> reasoning was that the current behavior is expected by users and is de
> facto correct. Changing it would change the behavior and result in
> long-time users being surprised by the new behavior.
>
> Below are some stack traces from my notes about what causes the thread
> calling Cache.close() to return prematurely before the DistributedSystem is
> disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen
> in stack #3 below. DiskStoreImpl reacts to the situation by once again
> invoking Cache.close() once or twice (stack #1 and stack #2) and one of
> those invocations wins the race against the thread in stack #3. The thread
> in stack #3 then returns prematurely before the DistributedSystem is
> disconnected. If thread #3 then attempts to do anything like create a new
> cache (which quite a few tests do), it can fail and throw
> DistributedSystemDisconnectedException from cache create.
>
> There are two early-outs in GemFireCacheImpl.close() which allows a
> calling thread to return before any actual work has been completed after
> closing nothing but the SecurityService.
>
> if (isClosed()) {
>   return;
> }
>
> And "isClosed()" returns true when isClosing flag is true (which is set
> true when closing starts):
>
>   public boolean isClosed() {
> return isClosing;
>   }
>
> Failed creation of a persistent region is one way DiskStoreImpl can cause
> multiple threads trying to close the cache to trip all over each other.
>
> DiskStoreImpl is problematic at best in the way it's implemented and it's
> not currently unit tested (or unit testable without lots of refactoring),
> and I don't plan to revisit this change. I would however be happy to review
> proposals and PRs related to this. My change was focused on Cache.close()
> and adding a CountDownLatch to close() -- perhaps the next attempt to "fix"
> this should focus on DiskStoreImpl -- one could easily argue that closing
> the cache is NOT a valid responsibility for DiskStoreImpl. But changing the
> behavior of the persistent "layer" of Geode might require more research
> (and a LOT more refactoring) than I had time for since my reason for
> working on this was to fix some flaky dunit tests caused by this race
> condition.
>
> This bug appears to be caused when creation of a persistent region fails
> and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to
> close the Cache which succeeds in closing the Cache before the main test
> thread closes it. The main test thread then early outs because the
> DiskStore thread is already closing the Cache. The main test thread then
> tries to create a Cache which collides with the DiskStore thread which is
> still closing the Cache and DistributedSystem.
>
> java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
>   at 
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
>   at java.lang.Thread.run(Thread.java:748)
>
> java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637)
>   at 
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351)
>   at 
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
>   at 
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
>   at java.lang.Thread.run(Thread.java:748)
>
> java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073
>   at 
> org.apache.geode.internal.cache.DiskStoreImpl.handleDiskAccessException(DiskStoreImpl.java:3376)
>   at 
> org.apache.geode.internal.cache.PartitionedRegion.createAndValidatePersistentConfig(PartitionedRegion.java:956)
>   at 
> org.apache.geode.internal.cache.PartitionedRegion.initPRInternals(PartitionedRegion.

Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

2020-04-16 Thread Robert Houghton
The time out is increased in the shared Jinja variables file

On Thu, Apr 16, 2020, 10:48 Kirk Lund  wrote:

> I revived my branch by rebasing it on develop and filed a new draft PR:
>
> https://github.com/apache/geode/pull/4963
>
> Unfortunately, IntegrationTest exceeds timeout every time I trigger it. The
> cause does not appear to be a specific test or hang. I
> think IntegrationTest has already been running very close to the timeout
> and is exceeding it fairly often even without my changes in #4963.
>
> Should we increase the timeout for IntegrationTest? (Anyone know how to
> increase it?)
>
> On Tue, Apr 14, 2020 at 5:06 PM Kirk Lund  wrote:
>
> > When I previously submitted a PR to change concurrent calls to
> > Cache.close() to not return until everything is completely stopped
> > (including the DistributedSystem), the change was down-voted by Anil. His
> > reasoning was that the current behavior is expected by users and is de
> > facto correct. Changing it would change the behavior and result in
> > long-time users being surprised by the new behavior.
> >
> > Below are some stack traces from my notes about what causes the thread
> > calling Cache.close() to return prematurely before the DistributedSystem
> is
> > disconnected. Thread #3 (junit main thread) invokes Cache.close() as seen
> > in stack #3 below. DiskStoreImpl reacts to the situation by once again
> > invoking Cache.close() once or twice (stack #1 and stack #2) and one of
> > those invocations wins the race against the thread in stack #3. The
> thread
> > in stack #3 then returns prematurely before the DistributedSystem is
> > disconnected. If thread #3 then attempts to do anything like create a new
> > cache (which quite a few tests do), it can fail and throw
> > DistributedSystemDisconnectedException from cache create.
> >
> > There are two early-outs in GemFireCacheImpl.close() which allows a
> > calling thread to return before any actual work has been completed after
> > closing nothing but the SecurityService.
> >
> > if (isClosed()) {
> >   return;
> > }
> >
> > And "isClosed()" returns true when isClosing flag is true (which is set
> > true when closing starts):
> >
> >   public boolean isClosed() {
> > return isClosing;
> >   }
> >
> > Failed creation of a persistent region is one way DiskStoreImpl can cause
> > multiple threads trying to close the cache to trip all over each other.
> >
> > DiskStoreImpl is problematic at best in the way it's implemented and it's
> > not currently unit tested (or unit testable without lots of refactoring),
> > and I don't plan to revisit this change. I would however be happy to
> review
> > proposals and PRs related to this. My change was focused on Cache.close()
> > and adding a CountDownLatch to close() -- perhaps the next attempt to
> "fix"
> > this should focus on DiskStoreImpl -- one could easily argue that closing
> > the cache is NOT a valid responsibility for DiskStoreImpl. But changing
> the
> > behavior of the persistent "layer" of Geode might require more research
> > (and a LOT more refactoring) than I had time for since my reason for
> > working on this was to fix some flaky dunit tests caused by this race
> > condition.
> >
> > This bug appears to be caused when creation of a persistent region fails
> > and DiskStoreImpl.lambda$handleDiskAccessException forks a new Thread to
> > close the Cache which succeeds in closing the Cache before the main test
> > thread closes it. The main test thread then early outs because the
> > DiskStore thread is already closing the Cache. The main test thread then
> > tries to create a Cache which collides with the DiskStore thread which is
> > still closing the Cache and DistributedSystem.
> >
> > java.lang.Throwable: KIRK GemFireCacheImpl closed 632230948
> >   at
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2365)
> >   at
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
> >   at
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
> >   at java.lang.Thread.run(Thread.java:748)
> >
> > java.lang.Throwable: KIRK InternalDistributedSystem closed 306674056
> >   at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1637)
> >   at
> org.apache.geode.distributed.internal.InternalDistributedSystem.disconnect(InternalDistributedSystem.java:1225)
> >   at
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:2351)
> >   at
> org.apache.geode.internal.cache.GemFireCacheImpl.close(GemFireCacheImpl.java:1917)
> >   at
> org.apache.geode.internal.cache.DiskStoreImpl.lambda$handleDiskAccessException$2(DiskStoreImpl.java:3380)
> >   at java.lang.Thread.run(Thread.java:748)
> >
> > java.lang.Throwable: KIRK DiskStoreImpl closing cache 1555793073
> >   at
> org.apache.geode.internal.cache.DiskStoreImp

Re: Data ingestion with predefined buckets

2020-04-16 Thread Anilkumar Gingade
>> PutAllPRMessage.*

These are internal APIs/message protocols used to handle PartitionedRegin
messages.
The messages are sent from originator node to peer nodes to operate on a
given partitioned region; not intended as application APIs.

We could consider, looking at the code, which determines bucket-id for each
of putAll keys. If there is routing info that identifies a common data
store (bucket); the code could be optimized there...

My recommendation is still using the existing APIs and trying to tune the
putAll map size. By reducing the map size, you will be pushing small chunks
of data to the server, while remaining data is acted upon (at client);
which keeps both client and server busy at the same time. You can also look
at tuning socket buffer size, to fit your data size so that the data is
written/read in a single chunk.

-Anil


On Wed, Apr 15, 2020 at 7:01 PM steve mathew 
wrote:

> Anil, yes its a kind of custom hash (which involves calculating hash on all
> fields of row). Have to stick to the predefined mechanism based on which
> source files are generated.
>
> It would be great help if some-one guide me about any available
> *server-side
> internal API that provides bucket level data-ingestion if any*. While
> exploring came across "*PartitionRegion.sendMsgByBucket(bucketId,
> PutAllPRMessage)*"..Can this API internally takes care of redundancy
> (ingestion into secondary buckets on peer nodes)..?
>
> Can someone explain about
> *PutAllPRMessage.operateOnPartitionedRegion(ClusterDistributionManager
> dm, PartitionedRegion pr,..)*, it seems this handles putAll msg from peer..
> When is this required..?
>
> Thanks
>
> Steve M.
>
> On Wed, Apr 15, 2020 at 11:06 PM Anilkumar Gingade 
> wrote:
>
> > About api: I would not recommend using bucketId in api, as it is internal
> > and there are other internal/external apis that rely on bucket id
> > calculations; which could be compromised here.
> >
> > Instead of adding new APIs, probably looking at minimizing/reducing the
> > time spent may be a good start.
> >
> > BucketRegin.waitUntilLocked - A putAll thread could spend time here, when
> > there are multiple threads acting upon the same thread; one way to reduce
> > this is by tuning the putall size, can you try changing our putall size
> > (say start with 100).
> >
> > I am wondering about the time spent in hashcode(); is it a custom code?
> >
> > If you want to create the buckets upfront, you can try calling the
> method:
> > PartitionRegionHelper.assignBucketsToPartitions().
> >
> > -Anil
> >
> >
> > On Wed, Apr 15, 2020 at 8:37 AM steve mathew 
> > wrote:
> >
> > > Thanks Den, Anil and Udo for your inputs. Extremely sorry for late rely
> > as
> > > I took bit of time to explore and understand geode internals.
> > >
> > > It seems BucketRegion/Bucket terminology is not exposed to user but
> > still i
> > > am trying to achieve something that is uncommon and for which client
> API
> > is
> > > not exposed.
> > >
> > > *Details about Use-case/Client *
> > > - MultiThreadClient - Each task perform data-ingestion on specific
> > bucket.
> > > Each task knows the bucket number to ingest data. In-short client knows
> > > task-->bucket mapping.
> > > - Each task iteratively ingest-data into batch (configurable) of 1000
> > > records to the bucket assigned to it.
> > > - Parallelism is achieved by running multiple tasks concurrently.
> > >
> > >
> > > *When i tried with exisitng R.putAll() API, observed slow performance
> and
> > > related observations are* - Few tasks takes quite a longer time
> > (ThreaDump
> > > shows--> Thread WAITING on BucketRegin.waitUntilLocked), hence overall
> > > client takes longer time.
> > >  - Code profiling shows good amount of time spent during hash-code
> > > calculation. It seems key.hashCode() gets calculated in on both client
> > and
> > > server, which is not required for my use-case as task-->bucket mapping
> > > known before.
> > >  - putAll() client implementation takes care of Parallelism (using
> > > PRMetadata enabled thread-pool and reshuffle the keys internally), but
> in
> > > my-case that's taken care by multiple tasks each per buckrt within my
> > > client.
> > >
> > > *I have forked the Geode codebase and trying to extend it by providing
> a
> > > client API like, *
> > > //Region.java
> > > /**
> > >  * putAll records in specified bucket
> > >  */
> > > *public void putAll(int bucketId, map) *
> > >
> > > Already added client side message and related code (similar to putAllOp
> > and
> > > its impl) , I am adding server-side code/BaseCommand, similar to putAll
> > > code-path (cmdExecute()/virtualPut() etc),* is there any API (internal)
> > > that provides bucket specific putAll and take care of redundancy -
> > > secondary bucket ingestion - as well and i can use/hook it directly
> ..?*
> > >
> > > It seems, if i isolate bucket-creation and actual put flow (create
> bucket
> > > prior to putAll call) it may work better in my scenario, hence
> > > *Is

Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

2020-04-16 Thread Owen Nichols
Kirk, most IntegrationTest jobs run in 25-30 minutes, but I did see one 

 that came in just under 45 minutes but did succeed.  It would be nice to know 
what test is occasionally taking longer and why…

Here’s an example of a previous timeout increase (Note that both the job 
timeout and the callstack timeout should be increased by the same amount): 
https://github.com/apache/geode/pull/4231

> On Apr 16, 2020, at 10:47 AM, Kirk Lund  wrote:
> 
> Unfortunately, IntegrationTest exceeds timeout every time I trigger it. The
> cause does not appear to be a specific test or hang. I
> think IntegrationTest has already been running very close to the timeout
> and is exceeding it fairly often even without my changes in #4963.
> 
> Should we increase the timeout for IntegrationTest? (Anyone know how to
> increase it?)



Re: [Discuss] Cache.close synchronous is not synchronous, but code still expects it to be....

2020-04-16 Thread Kirk Lund
It timed out while running OldFreeListOffHeapRegionJUnitTest but I think
the tests before it were responsible for the timeout being exceeded. I
looked through all of the previously run tests and how long each but
without having some sort of database with how long each test takes, it's
impossible to know which test or tests take longer in any given PR.

The IntegrationTest job that exceeded the timeout:
https://concourse.apachegeode-ci.info/builds/147866

The Test Summary for the above IntegrationTest job with Duration for each
test:
http://files.apachegeode-ci.info/builds/apache-develop-pr/geode-pr-4963/test-results/integrationTest/1587061092/

Unless we want to start tracking each test class/method and its Duration in
a database, I don't see how we could look for trends or changes to identify
test(s) that suddenly start taking longer. All of the tests take less than
3 minutes each, so unless one suddenly spikes to 10 minutes or more,
there's really no way to find the test(s).

On Thu, Apr 16, 2020 at 12:52 PM Owen Nichols  wrote:

> Kirk, most IntegrationTest jobs run in 25-30 minutes, but I did see one <
> https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-pr/jobs/IntegrationTestOpenJDK11/builds/7202>
> that came in just under 45 minutes but did succeed.  It would be nice to
> know what test is occasionally taking longer and why…
>
> Here’s an example of a previous timeout increase (Note that both the job
> timeout and the callstack timeout should be increased by the same amount):
> https://github.com/apache/geode/pull/4231
>
> > On Apr 16, 2020, at 10:47 AM, Kirk Lund  wrote:
> >
> > Unfortunately, IntegrationTest exceeds timeout every time I trigger it.
> The
> > cause does not appear to be a specific test or hang. I
> > think IntegrationTest has already been running very close to the timeout
> > and is exceeding it fairly often even without my changes in #4963.
> >
> > Should we increase the timeout for IntegrationTest? (Anyone know how to
> > increase it?)
>
>


Concurrent tests hitting OOME, hangs, etc

2020-04-16 Thread Kirk Lund
The concurrent tests (tests using the ConcurrentTestRunner from
geode-concurrency-test) are currently blocking both of my PRs.

1st PR is upgrading Mockito -- several Concurrent Tests use Mockito but the
runner executes too many tests in parallel and Mockito can't keep up with
it -- they hit OOME and hang the PR job

2nd PR is changing Cache.close() to block -- same as above except that
Mockito isn't upgraded in this PR

Other devs have recommended that I move the concurrent tests to a new src
set for various reasons -- such as: these tests try very hard to use all
CPUs even when running in parallel and most of them take several minutes to
run, and now they're blocking changes like upgrading Mockito. But creating
a new src set and new CI job is also a huge headache and I really don't
want to do it.

Anyone else up for moving them to new src sets? Unfortunately, this alone
won't make the tests work with the newer versions of Mockito -- and guess
what, they only OOME in the cloud (not locally).

Should I disable these tests?

Should I remove the use of ConcurrentTestRunner from these tests
(converting them to simple junit tests)?

I spent some time studying ConcurrentTestRunner and I'm sorry, but I don't
really see the value in this runner, which is why I want to remove the use
of it and convert these tests to plain junit tests.

I also already tried changing the tests to use Mockito features stubOnly,
resetMocks, and clearInlineMocks -- these changes prevented OOME in
HARegionQueueIntegrationTest which was the only
non-ConcurrentTestRunner-test that had problems with upgrading Mockito.
Unfortunately these changes have done nothing to improve the running of
the ConcurrentTestRunner tests.

Related Mockito links:
* https://code.google.com/archive/p/mockito/issues/84
* https://github.com/mockito/mockito/issues/1825

-Kirk


Re: Concurrent tests hitting OOME, hangs, etc

2020-04-16 Thread Jacob Barrett



> On Apr 16, 2020, at 2:16 PM, Kirk Lund  wrote:

> Anyone else up for moving them to new src sets? Unfortunately, this alone
> won't make the tests work with the newer versions of Mockito -- and guess
> what, they only OOME in the cloud (not locally).

I could take a stab at it since I split up all the test types originally. I 
should be able to brush off those cobwebs.

> Should I disable these tests?

In an effort to get the Mockito stuff through, yes disable them.

> Should I remove the use of ConcurrentTestRunner from these tests
> (converting them to simple junit tests)?
> 
> I spent some time studying ConcurrentTestRunner and I'm sorry, but I don't
> really see the value in this runner, which is why I want to remove the use
> of it and convert these tests to plain junit tests.

They have less value as unit tests since there are already unit tests that 
cover them. It does raise a good questions about the value and implementation 
of micro concurrent testing like this. I think isolating them away is a good 
step before starting this conversation.




Re: Concurrent tests hitting OOME, hangs, etc

2020-04-16 Thread Dan Smith
I like the idea of a separate concurrency test suite. We have other tests
of concurrency that aren't using this runner that could also go there -
maybe we could establish some good conventions and figure out how to give
more CPU time to these tests. I'd actually like to see *more* tests of
concurrency at a lower level, so we can catch race conditions sooner.

> Should I disable these tests?

I don't think we should be marking any tests as ignored. Let's figure out
what is going on and make the appropriate fixes.

Can you clarify why these tests are blocking your Cache.close PR? Also,
which tests are getting an OOME with the mockito PR?

For a little clarity on the ConcurrentTestRunner, it just runs a test with
multiple threads some number of times. For example
FilterProfileConcurrencyTest.serializationOfFilterProfileWithConcurrentUpdateShouldSucceed
is testing a bug some of our customers hit where serializing an object that
was being concurrently modified failed. I think there is definitely value
in having those sorts of tests given the type of code we have - a lot of
classes that are mutable and need to be thread safe.

-Dan


On Thu, Apr 16, 2020 at 2:47 PM Jacob Barrett  wrote:

>
>
> > On Apr 16, 2020, at 2:16 PM, Kirk Lund  wrote:
>
> > Anyone else up for moving them to new src sets? Unfortunately, this alone
> > won't make the tests work with the newer versions of Mockito -- and guess
> > what, they only OOME in the cloud (not locally).
>
> I could take a stab at it since I split up all the test types originally.
> I should be able to brush off those cobwebs.
>
> > Should I disable these tests?
>
> In an effort to get the Mockito stuff through, yes disable them.
>
> > Should I remove the use of ConcurrentTestRunner from these tests
> > (converting them to simple junit tests)?
> >
> > I spent some time studying ConcurrentTestRunner and I'm sorry, but I
> don't
> > really see the value in this runner, which is why I want to remove the
> use
> > of it and convert these tests to plain junit tests.
>
> They have less value as unit tests since there are already unit tests that
> cover them. It does raise a good questions about the value and
> implementation of micro concurrent testing like this. I think isolating
> them away is a good step before starting this conversation.
>
>
>