[ https://issues.apache.org/jira/browse/GEODE-5238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Dan Smith resolved GEODE-5238. ------------------------------ Resolution: Fixed We reduced the parallelism of DistributedTest in 1d8c322574f52ca35. That was causing some of these loss of quorum issues. > CI failure: > SerialWANPropagationsFeatureDUnitTest.testSerialPropagationWithFilter fails > with Possible loss of quorum suspect string > ----------------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-5238 > URL: https://issues.apache.org/jira/browse/GEODE-5238 > Project: Geode > Issue Type: Bug > Components: membership > Reporter: Barry Oglesby > Priority: Major > Labels: swat > Attachments: > TEST-org.apache.geode.internal.cache.wan.serial.SerialWANPropagationsFeatureDUnitTest.xml > > > The members startup: > {noformat} > [vm1] [info 2018/05/19 02:18:54.055 UTC <RMI TCP Connection(3)-172.17.0.5> > tid=27] Starting DistributionManager 172.17.0.5(154:locator)<ec><v0>:32771. > (took 49 ms) > [vm2] [info 2018/05/19 02:18:54.479 UTC <RMI TCP Connection(2)-172.17.0.5> > tid=27] Starting DistributionManager 172.17.0.5(159)<v1>:32772. (took 336 ms) > [vm3] [info 2018/05/19 02:18:55.477 UTC <RMI TCP Connection(3)-172.17.0.5> > tid=27] Starting DistributionManager 172.17.0.5(168)<v2>:32773. (took 401 ms) > [vm3] [info 2018/05/19 02:18:55.478 UTC <RMI TCP Connection(3)-172.17.0.5> > tid=27] Initial (distribution manager) view = > View[172.17.0.5(154:locator)<ec><v0>:32771|2] members: > [172.17.0.5(154:locator)<ec><v0>:32771, 172.17.0.5(159)<v1>:32772\{lead}, > 172.17.0.5(168)<v2>:32773] > {noformat} > So the members are: > {noformat} > vm1 - locator, coordinator (weight=3) > vm2 - lead member (weight=15) > vm3 - member (weight=10)`````` > vm1 - locator, coordinator (weight=3) > vm2 - lead member (weight=15) > vm3 - member (weight=10) > {noformat} > The WANTestBase cleanupVM method shuts down the members simultaneously. > Log messages show vm1 (locator) shutting down first (and the other two > members realize it): > {noformat} > [vm1] [info 2018/05/19 02:19:10.400 UTC <RMI TCP Connection(4)-172.17.0.5> > tid=27] Shutting down DistributionManager > 172.17.0.5(154:locator)<ec><v0>:32771. > [vm1] [info 2018/05/19 02:19:10.513 UTC <RMI TCP Connection(4)-172.17.0.5> > tid=27] Now closing distribution for 172.17.0.5(154:locator)<ec><v0>:32771 > [vm1] [info 2018/05/19 02:19:10.513 UTC <RMI TCP Connection(4)-172.17.0.5> > tid=27] Stopping membership services > [vm1] [info 2018/05/19 02:19:10.513 UTC <RMI TCP Connection(4)-172.17.0.5> > tid=27] Now closing distribution for 172.17.0.5(154:locator)<ec><v0>:32771 > [vm2] [info 2018/05/19 02:19:10.420 UTC <Pooled High Priority Message > Processor 1> tid=310] Member at 172.17.0.5(154:locator)<ec><v0>:32771 > gracefully left the distributed cache: shutdown message received > [vm3] [info 2018/05/19 02:19:10.622 UTC <Pooled High Priority Message > Processor 1> tid=310] Member at 172.17.0.5(154:locator)<ec><v0>:32771 > gracefully left the distributed cache: shutdown message received > {noformat} > Then vm2 becomes coordinator: > {noformat} > [vm2] [info 2018/05/19 02:19:10.402 UTC <Pooled High Priority Message > Processor 1> tid=310] This member is becoming the membership coordinator with > address 172.17.0.5(159)<v1>:32772 > {noformat} > Then vm2 shuts down its DistributionManager: > {noformat} > [vm2] [info 2018/05/19 02:19:11.239 UTC <RMI TCP Connection(2)-172.17.0.5> > tid=27] Shutting down DistributionManager 172.17.0.5(159)<v1>:32772. > [vm2] [info 2018/05/19 02:19:11.352 UTC <RMI TCP Connection(2)-172.17.0.5> > tid=27] Now closing distribution for 172.17.0.5(159)<v1>:32772 > [vm2] [info 2018/05/19 02:19:11.352 UTC <RMI TCP Connection(2)-172.17.0.5> > tid=27] Stopping membership services > [vm2] [info 2018/05/19 02:19:11.377 UTC <RMI TCP Connection(2)-172.17.0.5> > tid=27] DistributionManager stopped in 138ms. > [vm2] [info 2018/05/19 02:19:11.377 UTC <RMI TCP Connection(2)-172.17.0.5> > tid=27] Marking DistributionManager 172.17.0.5(159)<v1>:32772 as closed. > {noformat} > Then vm3 does some suspect checking and logs the quorum loss warning: > {noformat} > [vm3] [info 2018/05/19 02:19:11.414 UTC <P2P message reader for > 172.17.0.5(159)<v1>:32772 shared unordered uid=14 port=35502> tid=296] > Performing final check for suspect member 172.17.0.5(159)<v1>:32772 > reason=member unexpectedly shut down shared, unordered connection > [vm3] [info 2018/05/19 02:19:11.434 UTC <P2P message reader for > 172.17.0.5(159)<v1>:32772 shared unordered uid=14 port=35502> tid=296] This > member is becoming the membership coordinator with address > 172.17.0.5(168)<v2>:32773 > [vm3] [info 2018/05/19 02:19:11.436 UTC <Geode Membership View Creator> > tid=360] View Creator thread is starting > [vm3] [info 2018/05/19 02:19:11.436 UTC <Geode Membership View Creator> > tid=360] 172.17.0.5(159)<v1>:32772 had a weight of 15 > [vm3] [warn 2018/05/19 02:19:11.436 UTC <Geode Membership View Creator> > tid=360] total weight lost in this view change is 15 of 25. Quorum has been > lost! > [vm3] [fatal 2018/05/19 02:19:11.438 UTC <Geode Membership View Creator> > tid=360] Possible loss of quorum due to the loss of 1 cache processes: > [172.17.0.5(159)<v1>:32772] > {noformat} > Then vm3 shuts down its DistributionManager: > {noformat} > [vm3] [info 2018/05/19 02:19:11.448 UTC <RMI TCP Connection(3)-172.17.0.5> > tid=27] Shutting down DistributionManager 172.17.0.5(168)<v2>:32773. > [vm3] [info 2018/05/19 02:19:11.588 UTC <RMI TCP Connection(3)-172.17.0.5> > tid=27] Now closing distribution for 172.17.0.5(168)<v2>:32773 > [vm3] [info 2018/05/19 02:19:11.588 UTC <RMI TCP Connection(3)-172.17.0.5> > tid=27] Stopping membership services > [vm3] [info 2018/05/19 02:19:11.617 UTC <RMI TCP Connection(3)-172.17.0.5> > tid=27] DistributionManager stopped in 168ms. > [vm3] [info 2018/05/19 02:19:11.617 UTC <RMI TCP Connection(3)-172.17.0.5> > tid=27] Marking DistributionManager 172.17.0.5(168)<v2>:32773 as closed. > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)