[ https://issues.apache.org/jira/browse/GEODE-7031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Bruce J Schuchardt updated GEODE-7031: -------------------------------------- Fix Version/s: 1.11.0 > Attempts to send messages to alert listeners delays network partition > detection > ------------------------------------------------------------------------------- > > Key: GEODE-7031 > URL: https://issues.apache.org/jira/browse/GEODE-7031 > Project: Geode > Issue Type: Improvement > Components: membership > Reporter: Bruce J Schuchardt > Assignee: Bruce J Schuchardt > Priority: Major > Fix For: 1.11.0 > > Time Spent: 2h 50m > Remaining Estimate: 0h > > In a number of recent regression test runs in AWS we have seen network > partition detection tests fail to detect the partition in a reasonable amount > of time. Logs show membership services attempting to send alerts to other > processes that are no longer reachable. Each attempt takes 6 * the > member-timeout setting - that's 30 seconds for each attempt. It would be > nice to have a different connection-formation timeout for something like this > since alert notification is built into the logging system that membership > services have to use. Since the alert system is also dependent on membership > services functioning properly this creates a circular dependency that has > historically caused hangs and delays such as the one described here. > {noformat} > [debug 2019/07/29 14:35:03.824 PDT <Geode Failure Detection thread 5> > tid=0xc3] Sending (Alert "Unable to send message to > 10.32.108.136(gemfire3_host2_12249:12249)<v3>:41003" level WARNING) to 1 > peers ([10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001]) via > tcp/ip > [debug 2019/07/29 14:35:03.825 PDT <Geode Failure Detection thread 5> > tid=0xc3] created PendingConnection > org.apache.geode.internal.tcp.ConnectionTable$PendingConnection@4f4c8630 > created by Geode Failure Detection thread 5 > [info 2019/07/29 14:35:33.847 PDT <Geode Failure Detection thread 5> > tid=0xc3] Connection: shared=true ordered=true failed to connect to peer > 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001 because: > java.net.SocketTimeoutException > [debug 2019/07/29 14:35:33.852 PDT <Geode Failure Detection thread 5> > tid=0xc3] Giving up connecting to alert listener > 10.32.108.136(gemfire4_host2_12220:12220:locator)<ec><v1>:41001{noformat} > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005)