[ https://issues.apache.org/jira/browse/GEODE-9299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17352893#comment-17352893 ]
Barrett Oglesby commented on GEODE-9299: ---------------------------------------- The failing assertion is verifying the number of entries in the local secondary queues is 100 (which matches the number of puts). Instead, it is 101. {noformat} int localServer1QueueSize = localServer1.invoke(() -> getQueueRegionSize(senderId, false)); int localServer2QueueSize = localServer2.invoke(() -> getQueueRegionSize(senderId, false)); assertEquals(numPuts, localServer1QueueSize + localServer2QueueSize); {noformat} Here is some logging that shows the behavior in this test. Client Starts: {noformat} [vm3_v1.12.2] [info 2021/05/21 21:12:16.982 GMT <RMI TCP Connection(2)-172.17.0.2> tid=0x22] Received method: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$146/0x00000001008afc40.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$146/0x00000001008afc40@59079e6c [vm3_v1.12.2] [info 2021/05/21 21:12:17.599 GMT <RMI TCP Connection(2)-172.17.0.2> tid=0x22] Using org.apache.geode.logging.log4j.internal.impl.Log4jLoggingProvider from ServiceLoader for service org.apache.geode.logging.internal.spi.LoggingProvider [vm3_v1.12.2] [info 2021/05/21 21:12:24.490 GMT <poolTimer-DEFAULT-3> tid=0x32] Updating membership port. Port changed from 0 to 46166. ID is now 7e72072330df(13685:loner):0:6094c590 [vm3_v1.12.2] [info 2021/05/21 21:12:24.526 GMT <RMI TCP Connection(2)-172.17.0.2> tid=0x22] Got result: null [vm3_v1.12.2] from org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$146/0x00000001008afc40.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$146/0x00000001008afc40@59079e6c (took 7538 ms) {noformat} Client does 100 puts in 22069ms with a SocketTimeoutException: {noformat} [vm3_v1.12.2] [info 2021/05/21 21:12:24.567 GMT <RMI TCP Connection(2)-172.17.0.2> tid=0x22] Received method: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$339/0x0000000100959840.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$339/0x0000000100959840@2e8f97c1 [vm3_v1.12.2] [warn 2021/05/21 21:12:42.233 GMT <RMI TCP Connection(2)-172.17.0.2> tid=0x22] Pool unexpected socket timed out on client connection=Pooled Connection to 7e72072330df:21250: Connection[7e72072330df:21250]@93891194) [vm3_v1.12.2] [info 2021/05/21 21:12:46.638 GMT <RMI TCP Connection(2)-172.17.0.2> tid=0x22] Got result: null [vm3_v1.12.2] from org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$339/0x0000000100959840.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$339/0x0000000100959840@2e8f97c1 (took 22069 ms) {noformat} The SocketTimeoutException means the client retried the put. That ends up being 2 puts for the same event. Server 1 returns secondary queue size: {noformat} [vm1_v1.12.2] [info 2021/05/21 21:12:46.668 GMT <RMI TCP Connection(3)-172.17.0.2> tid=0x22] Received method: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$520/0x0000000100ad1040.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$520/0x0000000100ad1040@79d1c376 [vm1_v1.12.2] [info 2021/05/21 21:12:47.598 GMT <RMI TCP Connection(3)-172.17.0.2> tid=0x22] Got result: null [vm1_v1.12.2] from org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$520/0x0000000100ad1040.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$520/0x0000000100ad1040@79d1c376 (took 929 ms) {noformat} Server 2 returns secondary queue size: {noformat} [vm2_v1.12.2] [info 2021/05/21 21:12:47.617 GMT <RMI TCP Connection(3)-172.17.0.2> tid=0x22] Received method: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$517/0x0000000100ae2c40.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$517/0x0000000100ae2c40@751350b6 [vm2_v1.12.2] [info 2021/05/21 21:12:47.782 GMT <RMI TCP Connection(3)-172.17.0.2> tid=0x22] Got result: null [vm2_v1.12.2] from org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$517/0x0000000100ae2c40.run with 0 args on object: org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest$$Lambda$517/0x0000000100ae2c40@751350b6 (took 161 ms) {noformat} The assertEquals check fails right after this, and the test shuts down. Here is some more detail. Server 1 buckets are created: {noformat} [vm1_v1.12.2] [info 2021/05/21 21:12:24.771 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_0 [vm1_v1.12.2] [info 2021/05/21 21:12:24.847 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_0 completed [vm1_v1.12.2] [info 2021/05/21 21:12:25.418 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_1 [vm1_v1.12.2] [info 2021/05/21 21:12:25.439 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_1 completed [vm1_v1.12.2] [info 2021/05/21 21:12:26.012 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_2 [vm1_v1.12.2] [info 2021/05/21 21:12:26.055 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_2 completed [vm1_v1.12.2] [info 2021/05/21 21:12:26.823 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_3 [vm1_v1.12.2] [info 2021/05/21 21:12:26.842 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_3 completed [vm1_v1.12.2] [info 2021/05/21 21:12:28.299 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_4 [vm1_v1.12.2] [info 2021/05/21 21:12:28.317 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_4 completed [vm1_v1.12.2] [info 2021/05/21 21:12:42.903 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_5 [vm1_v1.12.2] [info 2021/05/21 21:12:42.929 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_5 completed [vm1_v1.12.2] [info 2021/05/21 21:12:43.102 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_6 [vm1_v1.12.2] [info 2021/05/21 21:12:43.107 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_6 completed [vm1_v1.12.2] [info 2021/05/21 21:12:43.661 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_7 [vm1_v1.12.2] [info 2021/05/21 21:12:43.692 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_7 completed [vm1_v1.12.2] [info 2021/05/21 21:12:44.095 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_8 [vm1_v1.12.2] [info 2021/05/21 21:12:44.098 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_8 completed [vm1_v1.12.2] [info 2021/05/21 21:12:44.366 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_9 [vm1_v1.12.2] [info 2021/05/21 21:12:44.386 GMT <ServerConnection on port 27499 Thread 1> tid=0x81] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_9 completed {noformat} Server 2 buckets are created: {noformat} [vm2_v1.12.2] [info 2021/05/21 21:12:24.643 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_0 [vm2_v1.12.2] [info 2021/05/21 21:12:24.657 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_0 completed [vm2_v1.12.2] [info 2021/05/21 21:12:25.506 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_1 [vm2_v1.12.2] [info 2021/05/21 21:12:25.539 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_1 completed [vm2_v1.12.2] [info 2021/05/21 21:12:25.935 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_2 [vm2_v1.12.2] [info 2021/05/21 21:12:25.939 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_2 completed [vm2_v1.12.2] [info 2021/05/21 21:12:27.963 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_3 [vm2_v1.12.2] [info 2021/05/21 21:12:28.107 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_3 completed [vm2_v1.12.2] [info 2021/05/21 21:12:28.364 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_4 [vm2_v1.12.2] [info 2021/05/21 21:12:28.410 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_4 completed [vm2_v1.12.2] [info 2021/05/21 21:12:42.804 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_5 [vm2_v1.12.2] [info 2021/05/21 21:12:42.809 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_5 completed [vm2_v1.12.2] [info 2021/05/21 21:12:43.179 GMT <Pooled Waiting Message Processor 2> tid=0x7a] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_6 [vm2_v1.12.2] [info 2021/05/21 21:12:43.240 GMT <Pooled Waiting Message Processor 2> tid=0x7a] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_6 completed [vm2_v1.12.2] [info 2021/05/21 21:12:43.543 GMT <Pooled Waiting Message Processor 3> tid=0x7c] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_7 [vm2_v1.12.2] [info 2021/05/21 21:12:43.550 GMT <Pooled Waiting Message Processor 3> tid=0x7c] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_7 completed [vm2_v1.12.2] [info 2021/05/21 21:12:44.132 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_8 [vm2_v1.12.2] [info 2021/05/21 21:12:44.176 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_8 completed [vm2_v1.12.2] [info 2021/05/21 21:12:44.297 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initializing region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_9 [vm2_v1.12.2] [info 2021/05/21 21:12:44.300 GMT <Pooled Waiting Message Processor 1> tid=0x39] Initialization of region _B__testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from__v1.12.2]__region_9 completed {noformat} There is a big gap between buckets 4 and 5 being created. Server 2 has several interesting messages in the gap between 21:12:28 and 21:12:42: {noformat} [vm2_v1.12.2] [warn 2021/05/21 21:12:35.588 GMT <StatSampler> tid=0x3f] Statistics sampling thread detected a wakeup delay of 5627 ms, indicating a possible resource issue. Check the GC, memory, and CPU statistics. [vm2_v1.12.2] [warn 2021/05/21 21:12:42.715 GMT <ServerConnection on port 21250 Thread 1> tid=0x6f] Server connection from [identity(172.17.0.2(13685:loner):46166:6094c590,connection=1; port=46166]: connection disconnect detected by EOF. {noformat} And, the client timed out at that time: {noformat} [vm3_v1.12.2] [warn 2021/05/21 21:12:42.233 GMT <RMI TCP Connection(2)-172.17.0.2> tid=0x22] Pool unexpected socket timed out on client connection=Pooled Connection to 7e72072330df:21250: Connection[7e72072330df:21250]@93891194) {noformat} So, server 2 had a resource issue (CPU or GC most likely) that caused the client to timeout. > CI Failure: > WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > -------------------------------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-9299 > URL: https://issues.apache.org/jira/browse/GEODE-9299 > Project: Geode > Issue Type: Bug > Components: wan > Affects Versions: 1.15.0 > Reporter: Hale Bales > Assignee: Barrett Oglesby > Priority: Major > > {code:java} > org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover > > testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover[from_v1.12.2] > FAILED > java.lang.AssertionError: expected:<100> but was:<101> > at org.junit.Assert.fail(Assert.java:89) > at org.junit.Assert.failNotEquals(Assert.java:835) > at org.junit.Assert.assertEquals(Assert.java:647) > at org.junit.Assert.assertEquals(Assert.java:633) > at > org.apache.geode.cache.wan.WANRollingUpgradeDUnitTest.stopSenderAndVerifyEvents(WANRollingUpgradeDUnitTest.java:227) > at > org.apache.geode.cache.wan.WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover.testSecondaryEventsNotReprocessedAfterOldSiteMemberFailover(WANRollingUpgradeSecondaryEventsNotReprocessedAfterOldSiteMemberFailover.java:98) > {code} > CI Failure: > https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-main/jobs/UpgradeTestOpenJDK11/builds/229#B > Artifacts Available here: > http://files.apachegeode-ci.info/builds/apache-develop-main/1.15.0-build.0253/test-results/upgradeTest/1621635640/ -- This message was sent by Atlassian Jira (v8.3.4#803005)