[ https://issues.apache.org/jira/browse/SUREFIRE-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tibor Digana updated SUREFIRE-1719: ----------------------------------- Fix Version/s: 3.0.0-M5 > Race condition results in "VM crash or System.exit called?" failure > ------------------------------------------------------------------- > > Key: SUREFIRE-1719 > URL: https://issues.apache.org/jira/browse/SUREFIRE-1719 > Project: Maven Surefire > Issue Type: Bug > Components: Maven Surefire Plugin > Affects Versions: 2.20, 2.20.1, 2.21.0, 2.22.0, 2.22.1, 2.22.2, 3.0.0-M2, > 3.0.0-M1, 3.0.0-M3 > Reporter: Paul Millar > Priority: Major > Fix For: 3.0.0-M5 > > Attachments: build-error-debug.out, build.out, pom.xml > > > After upgrading surefire in our project (dCache) from 2.19.1 to 3.0.0-M3, > unit tests started to fail with the message "ExecutionException The forked VM > terminated without properly saying goodbye. VM crash or System.exit called?" > For reference, the command I am using to verify this problem is "mvn -am -pl > modules/common clean package" and the surefire configuration is: > {{<plugin>}} > {{ <groupId>org.apache.maven.plugins</groupId>}} > {{ <artifactId>maven-surefire-plugin</artifactId>}} > {{ <configuration>}} > {{ <includes>}} > {{ <include>**/*Test.class</include>}} > {{ <include>**/*Tests.class</include>}} > {{ </includes>}} > {{ <!-- dCache uses the singleton anti-pattern in way}} > {{ too many places. That unfortunately means we have}} > {{ to accept the overhead of forking each test run. -->}} > {{ <forkCount>1C</forkCount>}} > {{ <reuseForks>false</reuseForks>}} > {{ </configuration>}} > {{ </plugin>}} > [The complete pom.xml is attached.] > This problem is not always present. On our build machine, I've seen the > problem appear 6 out of 10 times when running the above mvn command. There is > (apparently) little that seems to influence whether the build will succeed or > fail. > [I've attached the complete output from running the above mvn command, both > the normal output and including the -e -X options.] > The problem seems to appear only on machines with a "large" number of cores. > Our build machine has 24 cores, and I've seen a report of a similar problem > where building dCache on a 48 core machine. On the other side, I have been > unable to reproduce the problem with my desktop machine (8 core) or on my > laptop (4 cores). > What seems to matter is the number of actually running JVM instances. > I have not been able to reproduce the problem by increasing the forkCount on > a machine with a small number of cores. However, I've noticed that, on an 8 > core machine, increasing the forkCount does not actually result in that many > more JVM instances running. > Similarly, experience shows that reducing the number of concurrent JVM > instances "fixes" the problem. A forkCount of 6 seems to bring the likelihood > of a problem below 10% (0 failures with 10 builds) on our build machine. On > this machine, the default configuration would try to run 24 JVM instances > concurrently (forkCount of "1C" on a 24 core machine). > The problem appears to have been introduced in surefire v2.20. When building > with surefire v2.19.1, the above mvn command is always successful on our > build machine. Building with surefire v2.20 results in intermittent failures > (~60% failure rate). > Using git bisection (and with the criterion for "good" as zero failures in 10 > build attempts), I was able to determine that commit da7ff6aa2 "SUREFIRE-1342 > Acknowledge normal exit of JVM and drain shared memory between processes" is > the first commit where surefire has this intermittent failure behaviour. > From a causal scan through the patch, my guess is that the BYE_ACK support it > introduces is somehow racy (for example, reading or updating a field-member > outside of a monitor) and problems are triggered if there are a large number > of JVMs exiting concurrently. So, with increased number of concurrent JVMs > there is an increased risk of a thread loosing the race, and so triggering > this error. > Such a problem would be consistent with observed behaviour. However, I don't > have any strong evidence that this is what is happening. -- This message was sent by Atlassian Jira (v8.3.4#803005)