Mikael, sorry I do not appear to have permission to view the link. I did some digging in the last couple of days. I see that the parent process reads from stdin. I could not find anywhere that we are using stdin. FWIW the failures nearly always happen at least 15m into a ~20m test run, so perf is a likely culprit.
I see also that ForkedBooter reads commands from stdin in one thread, and uses an executor service to check for a past ping in ForkedBooter.listenToShutdownCommands(..). When it checks, it also sets pingDone to false. The executor is configured to run up to 2 threads concurrently to handle the workload, and is set to run at a fixed rate (not a fixed delay). If the test suite is busy with testing and GC and has lots of threads running, it's entirely possible that a thread won't have a chance to run for a long time (e.g. 5s). Maybe instead of a 30s delay, the VM gets around to checking for a ping every 35s over a long span of time. Because we're running at a "fixed rate" and not a "fixed delay", then after a couple of minutes we might be a full 30s behind schedule. It's possible the executor will create another thread to run the scheduled task because it's running behind schedule. This new thread checks for a ping, finds it, and sets pingDone to false. But then the original thread also runs, say, 2 seconds afterwards, checks pingDone, and finds it is false. So to mitigate the problem, can we a) make the executor run only 1 thread and b) schedule the task at a fixed rate? For that matter, is there another scheduled executor we can reuse? I understand why checking for ping requires a separate executor. Should I ask in github? Regarding a previous question, I found out that Alpine's Maven package comes with an /etc/mavenrc that sets `MAVEN_OPTS="$MAVEN_OPTS -Xmx512m"` which cannot be undone by setting `MAVEN_OPTS` at the command line; you end up with e.g. `-Xmx1g -Xmx512m`. (Note this applies to the Maven (parent) process, not the surefire/failsafe (child) process.) On Wed, Mar 20, 2019 at 3:46 AM Bernd Eckenfels <[email protected]> wrote: > I guess a timeout caused by FullGC can happen with TCP as well. Increasing > the timeout might not be nice but does look like it would help in both > cases. (Problems with stdout are more related to unexpected JVM messages I > guess) > > Gruss > Bernd > -- > http://bernd.eckenfels.net > > ________________________________ > Von: Mikael Åsberg <[email protected]> > Gesendet: Mittwoch, März 20, 2019 9:40 AM > An: Maven Users List > Betreff: Re: Failsafe: Killing self fork JVM. PING timeout elapsed. > > These issues regarding communication with forked JVMs, won't they be > resolved once surefire moves to interprocess communication using > tcp/ip sockets? This happens to be the target feature to be included > in the next surefire 3.0.0 milestone: > https://issues.apache.org/jira/projects/SUREFIRE/versions/12344668 > > There are soooo many issues relating to surefire reading stdout of > forked processes (which is my understanding that it is currently > doing). Many of us are really looking forward to the next milestone. > > On Tue, Mar 19, 2019 at 8:59 PM Jason Young <[email protected]> > wrote: > > > > Getting back to my original questions, I know that "ping" means to see > if a > > process is there, and "NOOP" implies it's not a command to do anything. > But > > what do the terms "ping" and "NOOP" mean in this context, i.e. how do the > > processes communicate? I assume they don't sonar. Do other processes also > > ping NOOPs? Can I PING Chrome with a NOOP from bash? Is it with TCP? > > > > I'm confused about what I should do regarding GC pauses. Previously I had > > code that would write the amount of remaining heap space (or something > like > > that) to stdout after every test to troubleshoot OOMEs. Can writing to > > stdout cause the communication failure somehow? > > > > On Wed, Mar 13, 2019 at 5:57 PM Jason Young <[email protected]> > > wrote: > > > > > I upgraded failsafe and surefire to 3.0.0-M3 as advised; we encountered > > > the same exception. (Still using -Xmx5g, will switch to OpenJ9 soon in > case > > > that helps.) > > > > > > BTW I also asked on StackOverflow previously, for anyone interested: > > > > https://stackoverflow.com/questions/54755846/killing-self-fork-jvm-ping-timeout-elapsed > > > > > > On Tue, Feb 26, 2019 at 6:40 PM Jason Young < > [email protected]> > > > wrote: > > > > > >> Thanks again for the information. > > >> > > >> We had increased the RAM to 3g some time ago to prevent OOMEs. More > > >> recently, I increased the RAM again to 5g for extra headroom since we > had > > >> more headroom available; the problem hasn't happened since, but it > hasn't > > >> been very long. > > >> > > >> We use a more customized image based on Alpine 3.8.2. The JDK and > Maven > > >> are obtained via apk. > > >> > > >> I will try upgrading failsafe (and surefire while I'm at it) sooner, > and > > >> probably do some experimentation with JVMs another time (not pressing > for > > >> me ATM). > > >> > > >> On Tue, Feb 26, 2019 at 12:20 PM Tibor Digana <[email protected] > > > > >> wrote: > > >> > > >>> >> I'll try to enable some logging about GC pauses to see what's up > > >>> > > >>> Pls do not keep such setting after tuning the GC because this may > > >>> sometime > > >>> break the interprocess communication between Maven process and > surefire > > >>> process. > > >>> It's worth to list GC information in a file and not in the console > logs. > > >>> This can be configured, I guess. > > >>> > > >>> >> Do you think the value is simply too low? > > >>> > > >>> GCing many objects may take some time and I remember we had a user > who > > >>> had > > >>> this problem a year or two ago. > > >>> We check every third NOOP (which is 3 x 10 sec) as a fix instead of > every > > >>> NOP. So 30 seconds looked satisfactory. > > >>> I think you use old version 2.20 or something like that. The fixes > for > > >>> docker have been done so far, so please use the latest version > 3.0.0-M3. > > >>> See this page > > >>> https://maven.apache.org/surefire/maven-surefire-plugin/docker.html, > we > > >>> used maven:3.5.3-jdk-8-alpine in this test. Which base image did you > use? > > >>> > > >>> Cheers > > >>> Tibor > > >>> > > >>> On Tue, Feb 26, 2019 at 5:24 PM Jason Young < > [email protected]> > > >>> wrote: > > >>> > > >>> > Thanks for the information. It's good to see someone understands a > > >>> little > > >>> > about this. > > >>> > > > >>> > Incidentally, we have been looking at other GCs and VMs for the > > >>> application > > >>> > in production environments, so I'll look into how these affect > tests as > > >>> > well. I'll try to enable some logging about GC pauses to see > what's up. > > >>> > > > >>> > How would `-Xmx3g` cause long GC cycles? Do you think the value is > > >>> simply > > >>> > too low? > > >>> > > > >>> > FWIW we're running the Maven build in an Alpine-based Docker > container. > > >>> > > > >>> > On Sat, Feb 23, 2019 at 6:36 AM Tibor Digana < > [email protected]> > > >>> > wrote: > > >>> > > > >>> > > Hi Jason, > > >>> > > > > >>> > > We spoke about this issue on our chat in ASF Slack: > > >>> > > "I think his tests have been paused for a long GC periods and > timed > > >>> out > > >>> > 3x > > >>> > > PING period = 30 seconds. After this period forked JVM supposed > the > > >>> Maven > > >>> > > process was killed by JenkinsCI and therefore all surefire > processes > > >>> are > > >>> > > killed as well and all the file handlers and memory consumptions > are > > >>> > > freed." > > >>> > > > > >>> > > "But I have to say that `-Xmx3g` may cause long GC cycles, see > > >>> > > > > >>> > > > > >>> > > > >>> > https://maven.apache.org/surefire/maven-surefire-plugin/examples/shutdown.html > > >>> > > " > > >>> > > > > >>> > > You are using java-1.8-openjdk. I guess you should use > Shenandoah GC > > >>> > which > > >>> > > is an experimental algorithm in JVM 1.8. This would significantly > > >>> short > > >>> > > the GC cycles. > > >>> > > > > >>> > > We should of cource provide a new configuration parameter to give > > >>> you a > > >>> > > chance to prolong the PING. > > >>> > > > > >>> > > Cheers > > >>> > > Tibor > > >>> > > > > >>> > > > >>> > > > >>> > -- > > >>> > > > >>> > Jason Young > > >>> > > > >>> > > >> > > >> > > > > -- > > Jason Young > > Software Engineer | PROCENTIVE > > [image: Phone] 715 245 8000 x7609 > > [image: Mobile] 706 870 3540 > > [image: Web] procentive.com > > Confidentiality Notice: This message is intended for the sole use of the > > individual and entity to which it is addressed, and may contain > information > > that is privileged, confidential and exempt from disclosure under > > applicable law. Any unauthorized review, use, disclosure or distribution > of > > this email message, including any attachment, is prohibited. If you are > not > > the intended recipient, please advise the sender by reply email and > destroy > > all copies of the original message. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
