The problem is that the test is shutting down the cache in a high-priority thread pool.  When the ClusterDistributionManager tries to close its thread pools it waits a while for their threads to exit, so the thread ends up waiting for itself to exit & that's not going to happen.  During that time its membership services are still active.  I don't see the duplicate "gracefully left" messages you noted but I do see the other behavior.  This is normal and ensures that a member that gets into a strange state during shutdown like this is ejected from the distributed system.

Changing the tests to use DistributedTestUtils.crashDistributedSystem(dm.getSystem()) will clear up this problem and will more correctly match the intent of the tests.

On 3/22/18 7:58 PM, Kirk Lund wrote:
If anyone really wants to look at and run the test directly, it's
*Bug41733DUnitTest* on develop.

Alternatively, you could pull the branch GEODE-1279-rename-regressionTests
from my fork (see *https://github.com/apache/geode/pull/1672
<https://github.com/apache/geode/pull/1672>*) where I've renamed the test
as *BucketCreationRequesterCrashHARegressionTest*, cleaned it up and
temporarily added a locator to play with membership config (to see what
effect that has on the membership logging and execution time).

On Thu, Mar 22, 2018 at 7:40 PM, Kirk Lund <kl...@apache.org> wrote:

I'm looking through logs from a DUnit test that takes longer to run than I
think it should. It looks to me like membership is chewing up more time
than it should, and I have questions about some membership log messages.

Specifically, I'm worried about:
1) double-logging by vm1 when shutdown message is received from vm0

2) double-logging by GMSHealthMonitor in vm0 AFTER it disconnected

3) vm1 KEEPS suspecting vm0 well after a) vm1 received shutdown message
from vm0, b) processed a view with vm0 departed and c) even logged that vm0
gracefully left. (vm1 seems a bit schizophrenic about vm0)

Below are the log messages with my comments interspersed. Please help me
understand what's actually going on here and if we're seeing undesired
behavior (ie minor bugs)...

KIRK: vm0 is shutting down gracefully

[vm0] [info 2018/03/22 19:16:27.010 PDT <Pooled High Priority Message
Processor 2> tid=87] Shutting down DistributionManager
192.168.1.18(34057)<ec><v1>:32771.

[vm1] [info 2018/03/22 19:16:27.014 PDT <Pooled High Priority Message
Processor 2> tid=66] received leave request from
192.168.1.18(34057)<ec><v1>:32771 for 192.168.1.18(34057)<ec><v1>:32771

KIRK: vm1 states that vm0 gracefully left and logs it twice which seems
like a minor bug

[vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
gracefully left the distributed cache: shutdown message received

[vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
gracefully left the distributed cache: shutdown message received

[vm1] [info 2018/03/22 19:16:27.332 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] received new view:
View[192.168.1.18(34059:locator)<ec><v0>:32770|3] members:
[192.168.1.18(34059:locator)<ec><v0>:32770, 
192.168.1.18(34058)<ec><v2>:32772{lead}]
shutdown: [192.168.1.18(34057)<ec><v1>:32771]
[vm1] old view is: View[192.168.1.18(34059:locator)<ec><v0>:32770|2]
members: [192.168.1.18(34059:locator)<ec><v0>:32770,
192.168.1.18(34057)<ec><v1>:32771{lead}, 192.168.1.18(34058)<ec><v2>:
32772]

KIRK: even tho vm1 just received a new view it initiates suspect
processing for vm0 which also seems like a minor bug

[vm1] [info 2018/03/22 19:16:28.186 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:28.683 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:29.180 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:29.681 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't responding
to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:30.183 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
responding to heartbeat requests]]] from non-member
192.168.1.18<ec><v1>:32771

KIRK: vm0 surprisingly continues logging -- why is it suspecting other
members? why does it still have GMSHealthMonitor running? why does it log
the message twice? and again logging it twice looks like a minor bug to me
unless all of this double logging that I'm seeing means membership is doing
twice the work which would be horrible (please put my mind to rest about
this)

[vm0] [info 2018/03/22 19:06:40.319 PDT <Geode Failure Detection
Scheduler> tid=41] All other members are suspect at this point

[vm0] [info 2018/03/22 19:06:40.320 PDT <Geode Failure Detection
Scheduler> tid=41] All other members are suspect at this point

KIRK: after the above 2 messages vm1 continues logging about suspect (why?
vm0 gracefully left!)

[vm1] [info 2018/03/22 19:16:30.682 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
responding to heartbeat requests]]] from non-member
192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:31.181 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
responding to heartbeat requests]]] from non-member
192.168.1.18<ec><v1>:32771

[vm1] [info 2018/03/22 19:16:31.680 PDT <unicast
receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
request for SuspectMembersMessage [suspectRequests=[SuspectRequest
[member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
responding to heartbeat requests]]] from non-member
192.168.1.18<ec><v1>:32771



Reply via email to