Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Jason Young Wed, 20 Mar 2019 16:05:34 -0700

Mikael, sorry I do not appear to have permission to view the link.

I did some digging in the last couple of days. I see that the parent
process reads from stdin. I could not find anywhere that we are using
stdin. FWIW the failures nearly always happen at least 15m into a ~20m test
run, so perf is a likely culprit.


I see also that ForkedBooter reads commands from stdin in one thread, and
uses an executor service to check for a past ping in
ForkedBooter.listenToShutdownCommands(..). When it checks, it also sets
pingDone to false. The executor is configured to run up to 2 threads
concurrently to handle the workload, and is set to run at a fixed rate (not
a fixed delay). If the test suite is busy with testing and GC and has lots
of threads running, it's entirely possible that a thread won't have a
chance to run for a long time (e.g. 5s). Maybe instead of a 30s delay, the
VM gets around to checking for a ping every 35s over a long span of time.
Because we're running at a "fixed rate" and not a "fixed delay", then after
a couple of minutes we might be a full 30s behind schedule. It's possible
the executor will create another thread to run the scheduled task because
it's running behind schedule. This new thread checks for a ping, finds it,
and sets pingDone to false. But then the original thread also runs, say, 2
seconds afterwards, checks pingDone, and finds it is false.

So to mitigate the problem, can we a) make the executor run only 1 thread
and b) schedule the task at a fixed rate? For that matter, is there another
scheduled executor we can reuse? I understand why checking for ping
requires a separate executor. Should I ask in github?

Regarding a previous question, I found out that Alpine's Maven package
comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS -Xmx512m"`
which cannot be undone by setting `MAVEN_OPTS` at the command line; you end
up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven (parent)
process, not the surefire/failsafe (child) process.)

On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <[email protected]>
wrote:

> I guess a timeout caused by FullGC can happen with TCP as well. Increasing
> the timeout might not be nice but does look like it would help in both
> cases. (Problems with stdout are more related to unexpected JVM messages I
> guess)
>
> Gruss
> Bernd
> --
> http://bernd.eckenfels.net
>
> ________________________________
> Von: Mikael Åsberg <[email protected]>
> Gesendet: Mittwoch, März 20, 2019 9:40 AM
> An: Maven Users List
> Betreff: Re: Failsafe: Killing self fork JVM. PING timeout elapsed.
>
> These issues regarding communication with forked JVMs, won't they be
> resolved once surefire moves to interprocess communication using
> tcp/ip sockets? This happens to be the target feature to be included
> in the next surefire 3.0.0 milestone:
> https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668
>
> There are soooo many issues relating to surefire reading stdout of
> forked processes (which is my understanding that it is currently
> doing). Many of us are really looking forward to the next milestone.
>
> On Tue, Mar 19, 2019 at 8:59 PM Jason Young <[email protected]>
> wrote:
> >
> > Getting back to my original questions, I know that "ping" means to see
> if a
> > process is there, and "NOOP" implies it's not a command to do anything.
> But
> > what do the terms "ping" and "NOOP" mean in this context, i.e. how do the
> > processes communicate? I assume they don't sonar. Do other processes also
> > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with TCP?
> >
> > I'm confused about what I should do regarding GC pauses. Previously I had
> > code that would write the amount of remaining heap space (or something
> like
> > that) to stdout after every test to troubleshoot OOMEs. Can writing to
> > stdout cause the communication failure somehow?
> >
> > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <[email protected]>
> > wrote:
> >
> > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we encountered
> > > the same exception. (Still using -Xmx5g, will switch to OpenJ9 soon in
> case
> > > that helps.)
> > >
> > > BTW I also asked on StackOverflow previously, for anyone interested:
> > >
> https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed
> > >
> > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young <
> [email protected]>
> > > wrote:
> > >
> > >> Thanks again for the information.
> > >>
> > >> We had increased the RAM to 3g some time ago to prevent OOMEs. More
> > >> recently, I increased the RAM again to 5g for extra headroom since we
> had
> > >> more headroom available; the problem hasn't happened since, but it
> hasn't
> > >> been very long.
> > >>
> > >> We use a more customized image based on Alpine 3.8.2. The JDK and
> Maven
> > >> are obtained via apk.
> > >>
> > >> I will try upgrading failsafe (and surefire while I'm at it) sooner,
> and
> > >> probably do some experimentation with JVMs another time (not pressing
> for
> > >> me ATM).
> > >>
> > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <[email protected]
> >
> > >> wrote:
> > >>
> > >>> >> I'll try to enable some logging about GC pauses to see what's up
> > >>>
> > >>> Pls do not keep such setting after tuning the GC because this may
> > >>> sometime
> > >>> break the interprocess communication between Maven process and
> surefire
> > >>> process.
> > >>> It's worth to list GC information in a file and not in the console
> logs.
> > >>> This can be configured, I guess.
> > >>>
> > >>> >> Do you think the value is simply too low?
> > >>>
> > >>> GCing many objects may take some time and I remember we had a user
> who
> > >>> had
> > >>> this problem a year or two ago.
> > >>> We check every third NOOP (which is 3 x 10 sec) as a fix instead of
> every
> > >>> NOP. So 30 seconds looked satisfactory.
> > >>> I think you use old version 2.20 or something like that. The fixes
> for
> > >>> docker have been done so far, so please use the latest version
> 3.0.0-M3.
> > >>> See this page
> > >>> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html,
> we
> > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did you
> use?
> > >>>
> > >>> Cheers
> > >>> Tibor
> > >>>
> > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young <
> [email protected]>
> > >>> wrote:
> > >>>
> > >>> > Thanks for the information. It's good to see someone understands a
> > >>> little
> > >>> > about this.
> > >>> >
> > >>> > Incidentally, we have been looking at other GCs and VMs for the
> > >>> application
> > >>> > in production environments, so I'll look into how these affect
> tests as
> > >>> > well. I'll try to enable some logging about GC pauses to see
> what's up.
> > >>> >
> > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the value is
> > >>> simply
> > >>> > too low?
> > >>> >
> > >>> > FWIW we're running the Maven build in an Alpine-based Docker
> container.
> > >>> >
> > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana <
> [email protected]>
> > >>> > wrote:
> > >>> >
> > >>> > > Hi Jason,
> > >>> > >
> > >>> > > We spoke about this issue on our chat in ASF Slack:
> > >>> > > "I think his tests have been paused for a long GC periods and
> timed
> > >>> out
> > >>> > 3x
> > >>> > > PING period = 30 seconds. After this period forked JVM supposed
> the
> > >>> Maven
> > >>> > > process was killed by JenkinsCI and therefore all surefire
> processes
> > >>> are
> > >>> > > killed as well and all the file handlers and memory consumptions
> are
> > >>> > > freed."
> > >>> > >
> > >>> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see
> > >>> > >
> > >>> > >
> > >>> >
> > >>>
> https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html
> > >>> > > "
> > >>> > >
> > >>> > > You are using java-1.8-openjdk. I guess you should use
> Shenandoah GC
> > >>> > which
> > >>> > > is an experimental algorithm in JVM 1.8. This would significantly
> > >>> short
> > >>> > > the GC cycles.
> > >>> > >
> > >>> > > We should of cource provide a new configuration parameter to give
> > >>> you a
> > >>> > > chance to prolong the PING.
> > >>> > >
> > >>> > > Cheers
> > >>> > > Tibor
> > >>> > >
> > >>> >
> > >>> >
> > >>> > --
> > >>> >
> > >>> > Jason Young
> > >>> >
> > >>>
> > >>
> > >>
> >
> > --
> > Jason Young
> > Software Engineer | PROCENTIVE
> > [image: Phone] 715 245 8000 x7609
> > [image: Mobile] 706 870 3540
> > [image: Web] procentive.com
> > Confidentiality Notice: This message is intended for the sole use of the
> > individual and entity to which it is addressed, and may contain
> information
> > that is privileged, confidential and exempt from disclosure under
> > applicable law. Any unauthorized review, use, disclosure or distribution
> of
> > this email message, including any attachment, is prohibited. If you are
> not
> > the intended recipient, please advise the sender by reply email and
> destroy
> > all copies of the original message.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Re: Failsafe: Killing self fork JVM. PING timeout elapsed.

Reply via email to