Re: Membership log message questions

Kirk Lund Mon, 26 Mar 2018 09:29:21 -0700

Thanks Bruce! I'll change the test to use crashDistributedSystem.

On Mon, Mar 26, 2018 at 8:49 AM, Bruce Schuchardt <bschucha...@pivotal.io>
wrote:


> The problem is that the test is shutting down the cache in a high-priority
> thread pool.  When the ClusterDistributionManager tries to close its thread
> pools it waits a while for their threads to exit, so the thread ends up
> waiting for itself to exit & that's not going to happen.  During that time
> its membership services are still active.  I don't see the duplicate
> "gracefully left" messages you noted but I do see the other behavior.  This
> is normal and ensures that a member that gets into a strange state during
> shutdown like this is ejected from the distributed system.
>
> Changing the tests to use 
> DistributedTestUtils.crashDistributedSystem(dm.getSystem())
> will clear up this problem and will more correctly match the intent of the
> tests.
>
> On 3/22/18 7:58 PM, Kirk Lund wrote:
>
>> If anyone really wants to look at and run the test directly, it's
>> *Bug41733DUnitTest* on develop.
>>
>> Alternatively, you could pull the branch GEODE-1279-rename-regressionTe
>> sts
>> from my fork (see *https://github.com/apache/geode/pull/1672
>> <https://github.com/apache/geode/pull/1672>*) where I've renamed the test
>> as *BucketCreationRequesterCrashHARegressionTest*, cleaned it up and
>>
>> temporarily added a locator to play with membership config (to see what
>> effect that has on the membership logging and execution time).
>>
>> On Thu, Mar 22, 2018 at 7:40 PM, Kirk Lund <kl...@apache.org> wrote:
>>
>> I'm looking through logs from a DUnit test that takes longer to run than I
>>> think it should. It looks to me like membership is chewing up more time
>>> than it should, and I have questions about some membership log messages.
>>>
>>> Specifically, I'm worried about:
>>> 1) double-logging by vm1 when shutdown message is received from vm0
>>>
>>> 2) double-logging by GMSHealthMonitor in vm0 AFTER it disconnected
>>>
>>> 3) vm1 KEEPS suspecting vm0 well after a) vm1 received shutdown message
>>> from vm0, b) processed a view with vm0 departed and c) even logged that
>>> vm0
>>> gracefully left. (vm1 seems a bit schizophrenic about vm0)
>>>
>>> Below are the log messages with my comments interspersed. Please help me
>>> understand what's actually going on here and if we're seeing undesired
>>> behavior (ie minor bugs)...
>>>
>>> KIRK: vm0 is shutting down gracefully
>>>
>>> [vm0] [info 2018/03/22 19:16:27.010 PDT <Pooled High Priority Message
>>> Processor 2> tid=87] Shutting down DistributionManager
>>> 192.168.1.18(34057)<ec><v1>:32771.
>>>
>>> [vm1] [info 2018/03/22 19:16:27.014 PDT <Pooled High Priority Message
>>> Processor 2> tid=66] received leave request from
>>> 192.168.1.18(34057)<ec><v1>:32771 for 192.168.1.18(34057)<ec><v1>:32771
>>>
>>> KIRK: vm1 states that vm0 gracefully left and logs it twice which seems
>>> like a minor bug
>>>
>>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
>>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
>>> gracefully left the distributed cache: shutdown message received
>>>
>>> [vm1] [info 2018/03/22 19:16:27.016 PDT <Pooled High Priority Message
>>> Processor 2> tid=66] Member at 192.168.1.18(34057)<ec><v1>:32771
>>> gracefully left the distributed cache: shutdown message received
>>>
>>> [vm1] [info 2018/03/22 19:16:27.332 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] received new view:
>>> View[192.168.1.18(34059:locator)<ec><v0>:32770|3] members:
>>> [192.168.1.18(34059:locator)<ec><v0>:32770,
>>> 192.168.1.18(34058)<ec><v2>:32772{lead}]
>>> shutdown: [192.168.1.18(34057)<ec><v1>:32771]
>>> [vm1] old view is: View[192.168.1.18(34059:locator)<ec><v0>:32770|2]
>>> members: [192.168.1.18(34059:locator)<ec><v0>:32770,
>>> 192.168.1.18(34057)<ec><v1>:32771{lead}, 192.168.1.18(34058)<ec><v2>:
>>> 32772]
>>>
>>> KIRK: even tho vm1 just received a new view it initiates suspect
>>> processing for vm0 which also seems like a minor bug
>>>
>>> [vm1] [info 2018/03/22 19:16:28.186 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't
>>> responding
>>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:28.683 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't
>>> responding
>>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:29.180 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't
>>> responding
>>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:29.681 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34058)<ec><v2>:32772, reason=Member isn't
>>> responding
>>> to heartbeat requests]]] from non-member 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:30.183 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>>> responding to heartbeat requests]]] from non-member
>>> 192.168.1.18<ec><v1>:32771
>>>
>>> KIRK: vm0 surprisingly continues logging -- why is it suspecting other
>>> members? why does it still have GMSHealthMonitor running? why does it log
>>> the message twice? and again logging it twice looks like a minor bug to
>>> me
>>> unless all of this double logging that I'm seeing means membership is
>>> doing
>>> twice the work which would be horrible (please put my mind to rest about
>>> this)
>>>
>>> [vm0] [info 2018/03/22 19:06:40.319 PDT <Geode Failure Detection
>>> Scheduler> tid=41] All other members are suspect at this point
>>>
>>> [vm0] [info 2018/03/22 19:06:40.320 PDT <Geode Failure Detection
>>> Scheduler> tid=41] All other members are suspect at this point
>>>
>>> KIRK: after the above 2 messages vm1 continues logging about suspect
>>> (why?
>>> vm0 gracefully left!)
>>>
>>> [vm1] [info 2018/03/22 19:16:30.682 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>>> responding to heartbeat requests]]] from non-member
>>> 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:31.181 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>>> responding to heartbeat requests]]] from non-member
>>> 192.168.1.18<ec><v1>:32771
>>>
>>> [vm1] [info 2018/03/22 19:16:31.680 PDT <unicast
>>> receiver,Kirks-MacBook-Pro-22774> tid=41] Membership ignoring suspect
>>> request for SuspectMembersMessage [suspectRequests=[SuspectRequest
>>> [member=192.168.1.18(34059:locator)<ec><v0>:32770, reason=Member isn't
>>> responding to heartbeat requests]]] from non-member
>>> 192.168.1.18<ec><v1>:32771
>>>
>>>
>>>
>

Re: Membership log message questions

Reply via email to