Thanks Bruce! I'll change the test to use crashDistributedSystem. On Mon, Mar 26, 2018 at 8:49 AM, Bruce Schuchardt <bschucha...@pivotal.io> wrote:
> The problem is that the test is shutting down the cache in a high-priority > thread pool. When the ClusterDistributionManager tries to close its thread > pools it waits a while for their threads to exit, so the thread ends up > waiting for itself to exit & that's not going to happen. During that time > its membership services are still active. I don't see the duplicate > "gracefully left" messages you noted but I do see the other behavior. This > is normal and ensures that a member that gets into a strange state during > shutdown like this is ejected from the distributed system. > > Changing the tests to use > DistributedTestUtils.crashDistributedSystem(dm.getSystem()) > will clear up this problem and will more correctly match the intent of the > tests. > > On 3/22/18 7:58 PM, Kirk Lund wrote: > >> If anyone really wants to look at and run the test directly, it's >> *Bug41733DUnitTest* on develop. >> >> Alternatively, you could pull the branch GEODE-1279-rename-regressionTe >> sts >> from my fork (see *https://github.com/apache/geode/pull/1672 >> <https://github.com/apache/geode/pull/1672>*) where I've renamed the test >> as *BucketCreationRequesterCrashHARegressionTest*, cleaned it up and >> >> temporarily added a locator to play with membership config (to see what >> effect that has on the membership logging and execution time). >> >> On Thu, Mar 22, 2018 at 7:40 PM, Kirk Lund <kl...@apache.org> wrote: >> >> I'm looking through logs from a DUnit test that takes longer to run than I >>> think it should. It looks to me like membership is chewing up more time >>> than it should, and I have questions about some membership log messages. >>> >>> Specifically, I'm worried about: >>> 1) double-logging by vm1 when shutdown message is received from vm0 >>> >>> 2) double-logging by GMSHealthMonitor in vm0 AFTER it disconnected >>> >>> 3) vm1 KEEPS suspecting vm0 well after a) vm1 received shutdown message >>> from vm0, b) processed a view with vm0 departed and c) even logged that >>> vm0 >>> gracefully left. (vm1 seems a bit schizophrenic about vm0) >>> >>> Below are the log messages with my comments interspersed. Please help me >>> understand what's actually going on here and if we're seeing undesired >>> behavior (ie minor bugs)... >>> >>> KIRK: vm0 is shutting down gracefully >>> >>> [vm0] [info 2018/03/22 19:16:27.010 PDT <Pooled High Priority Message >>> Processor 2> tid=87] Shutting down DistributionManager >>> 192.168.1.18(34057)<ec><v1>:32771. >>> >>> [vm1] [info 2018/03/22 19:16:27.014 PDT <Pooled High Priority Message >>> Processor 2> tid=66] received leave request from >>> 192.168.1.18(34057)<ec><v1>:32771 for 192.168.1.18(34057)<ec><v1>:32771 >>> >>> KIRK: vm1 states that vm0 gracefully left and logs it twice which seems >>> like a minor bug >>> >>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message >>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771 >>> gracefully left the distributed cache: shutdown message received >>> >>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message >>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771 >>> gracefully left the distributed cache: shutdown message received >>> >>> [vm1] [info 2018/03/22 19:16:27.332 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] received new view: >>> View[192.168.1.18(34059:locator)<ec><v0>:32770|3] members: >>> [192.168.1.18(34059:locator)<ec><v0>:32770, >>> 192.168.1.18(34058)<ec><v2>:32772{lead}] >>> shutdown: [192.168.1.18(34057)<ec><v1>:32771] >>> [vm1] old view is: View[192.168.1.18(34059:locator)<ec><v0>:32770|2] >>> members: [192.168.1.18(34059:locator)<ec><v0>:32770, >>> 192.168.1.18(34057)<ec><v1>:32771{lead}, 192.168.1.18(34058)<ec><v2>: >>> 32772] >>> >>> KIRK: even tho vm1 just received a new view it initiates suspect >>> processing for vm0 which also seems like a minor bug >>> >>> [vm1] [info 2018/03/22 19:16:28.186 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect >>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest >>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't >>> responding >>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771 >>> >>> [vm1] [info 2018/03/22 19:16:28.683 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect >>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest >>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't >>> responding >>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771 >>> >>> [vm1] [info 2018/03/22 19:16:29.180 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect >>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest >>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't >>> responding >>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771 >>> >>> [vm1] [info 2018/03/22 19:16:29.681 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect >>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest >>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't >>> responding >>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771 >>> >>> [vm1] [info 2018/03/22 19:16:30.183 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect >>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest >>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't >>> responding to heartbeat requests]]] from non-member >>> 192.168.1.18<ec><v1>:32771 >>> >>> KIRK: vm0 surprisingly continues logging -- why is it suspecting other >>> members? why does it still have GMSHealthMonitor running? why does it log >>> the message twice? and again logging it twice looks like a minor bug to >>> me >>> unless all of this double logging that I'm seeing means membership is >>> doing >>> twice the work which would be horrible (please put my mind to rest about >>> this) >>> >>> [vm0] [info 2018/03/22 19:06:40.319 PDT <Geode Failure Detection >>> Scheduler> tid=41] All other members are suspect at this point >>> >>> [vm0] [info 2018/03/22 19:06:40.320 PDT <Geode Failure Detection >>> Scheduler> tid=41] All other members are suspect at this point >>> >>> KIRK: after the above 2 messages vm1 continues logging about suspect >>> (why? >>> vm0 gracefully left!) >>> >>> [vm1] [info 2018/03/22 19:16:30.682 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect >>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest >>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't >>> responding to heartbeat requests]]] from non-member >>> 192.168.1.18<ec><v1>:32771 >>> >>> [vm1] [info 2018/03/22 19:16:31.181 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect >>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest >>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't >>> responding to heartbeat requests]]] from non-member >>> 192.168.1.18<ec><v1>:32771 >>> >>> [vm1] [info 2018/03/22 19:16:31.680 PDT <unicast >>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect >>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest >>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't >>> responding to heartbeat requests]]] from non-member >>> 192.168.1.18<ec><v1>:32771 >>> >>> >>> >