[jira] [Commented] (GEODE-4802) Geode cluster hanged after network problems

Anthony Baker (JIRA) Thu, 08 Mar 2018 08:35:30 -0800

    [ 
https://issues.apache.org/jira/browse/GEODE-4802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16391489#comment-16391489
 ]


Anthony Baker commented on GEODE-4802:
--------------------------------------

TCP offers reliable but not guaranteed packet delivery.  If packets get lost 
along the way they will be retransmitted.

TCP will attempt to deliver packets for a long time (15min by default), based 
on the settings for {{retries}} and {{retries2}}.  You would need to wait for 
TCP to *really* declare a packet as lost.  Also, suspect processing for a 
member in this state won't start until at least 15sec and based on the settings 
of {{member-timeout}} and {{ack-severe-alert-threshold}}.

Suspect processing in Geode is used to fence off unresponsive members from the 
cluster.  That allows us to maintain consistency and predictable availability.  
Cluster settings can be tuned to meet availability SLA's.


> Geode cluster hanged after network problems
> -------------------------------------------
>
>                 Key: GEODE-4802
>                 URL: https://issues.apache.org/jira/browse/GEODE-4802
>             Project: Geode
>          Issue Type: Bug
>            Reporter: Eugene Nedzvetsky
>            Priority: Major
>         Attachments: clumsy2.jpg, threaddump.log
>
>
> Test preparation:
>  # create file bin/server1/gemfire.properties with property 
> membership-port-range=2025-2030
>  # create file bin/server2/gemfire.propertieswith property 
> membership-port-range=2035-2040
>  # Download network problems emulator [https://jagt.github.io/clumsy]
>  # Fill field 'filtering' in Clumsy: tcp and (tcp.DstPort == 2025 or 
> tcp.DstPort == 2026 or tcp.DstPort == 2027 or tcp.DstPort == 2028 or 
> tcp.DstPort == 2029 or tcp.DstPort == 2030). Select function 'Drop' and set 
> Chance=100%. See clumsy2.jpg
> Steps to reproduce
>  # Start gfsh
>  # start locator --name=locator1
>  # start server --name=server1 --server-port=40411
>  # start server --name=server2 --server-port=40412
>  # create region --name=regionA --type=REPLICATE
>  # put --region=regionA --key="1" --value="one"
>  # Click on 'start' button in Clumsy
>  # put --region=regionA --key="1" --value="onev2"
>  # Wait *15s* and click on 'stop' in Clumsy
> Gfsh console has hung.
> bin\server1\server1.log:
> [warning 2018/03/07 18:02:50.360 PST server1 <Function Execution Processor1> 
> tid=0x4b] 15 seconds have elapsed while waiting for replies: 
> <DistributedCacheOperation$CacheOperationReplyProcessor 22 waiting for 1 
> replies from [192.168.100.109(server2:12804)<v2>:2035]> on 
> 192.168.100.109(server1:14416)<v1>:2045 whose current membership list is: 
> [[192.168.100.109(server2:12804)<v2>:2035, 
> 192.168.100.109(locator1:15628:locator)<ec><v0>:1024, 
> 192.168.100.109(server1:14416)<v1>:2045]]
> Pulse has shown 'normal' status for both servers.
> Gfsh works again if server1 process was killed.
> Also  i've reproduced another issue with the same scenario on my test 
> environment(see [^threaddump.log])
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (GEODE-4802) Geode cluster hanged after network problems

Reply via email to