[ https://issues.apache.org/jira/browse/GEODE-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Owen Nichols closed GEODE-9512. ------------------------------- > java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire > or non-GemFire during handshake due to initial byte being 1 > ---------------------------------------------------------------------------------------------------------------------------------------- > > Key: GEODE-9512 > URL: https://issues.apache.org/jira/browse/GEODE-9512 > Project: Geode > Issue Type: Bug > Components: membership, messaging > Affects Versions: 1.15.0 > Reporter: Juan Ramos > Assignee: Kamilla Aslami > Priority: Major > Labels: pull-request-available > Fix For: 1.15.0 > > > I've hit this issue while executing some chaos testing over a GemFire cluster > using 2 locators and 3 servers; {{SSL}} is enabled, and a dummy > {{SecurityManager}} is configured which authenticates any user and always > returns {{true}} within the {{authorize}} method. There are 3 > {{PARTITION_REDUNDANT}} regions configured, one per client, each with 1 > redundant copy. Once the cluster is up and running, 3 clients continuously > execute {{Region.put}} operations on a well known set of keys for its own > {{Region}} (created with {{PROXY}} type), and another process executes the > following logic in parallel (pseudocode): > {noformat} > for server in ${servers} > do > # Pause the JVM for 30 seconds to simulate a stop the world GC > kill -STOP server > sleep 30 > # Unpause the JVM, wait for member to reconnect and regions to recover > redundancy configured > kill -CONT "${SERVER_PID}" > waitForReconnectcompletedInServerLog > waitForNumBucketsWithoutRedundancyToBeZeroInGfshShowRegionMetrics > done > {noformat} > The test works fine most of the time, but randomly fails due to an unexpected > exception logged within the logs for at least one locator. The exception is > always reported from a {{P2P message reader}} thread on the locator for a > server member that has just returned to life, as an example: > {noformat} > #### LOCATOR-0 > [warn 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message > reader for > system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 > unshared ordered sender uid=61 dom #1 local port=48141 remote port=46174> > tid=0x6f] P2P message reader@354fac47 timed out during a membership check. > [fatal 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message > reader for > system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 > unshared ordered sender uid=61 dom #1 local port=48141 remote port=46174> > tid=0x6f] Error deserializing P2P handshake message > java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire > or non-GemFire during handshake due to initial byte being 1 > at > org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875) > at > org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825) > at > org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649) > at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > #### SERVER-2 > [warn 2021/08/17 05:20:44.012 GMT system-test-gemfire-server-2 <StatSampler> > tid=0x35] Statistics sampling thread detected a wakeup delay of 29070 ms, > indicating a possible resource issue. Check the GC, memory, and CPU > statistics. > [warn 2021/08/17 05:20:44.015 GMT system-test-gemfire-server-2 <Geode > Heartbeat Sender> tid=0x23] Failure detection heartbeat-generation thread > overslept by more than a full period. Asleep time: 31,175,291,931 > nanoseconds. Period: 2,500,000,000 nanoseconds. > [info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast > receiver,system-test-gemfire-server-2-25596> tid=0x1e] saving cache server > configuration for use with the cluster-configuration service on reconnect > [info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast > receiver,system-test-gemfire-server-2-25596> tid=0x1e] cache server > configuration saved > [info 2021/08/17 05:20:44.233 GMT system-test-gemfire-server-2 > <DisconnectThread> tid=0xe5] Stopping membership services > [info 2021/08/17 05:20:44.455 GMT system-test-gemfire-server-2 > <ReconnectThread> tid=0xe5] Disconnecting old DistributedSystem to prepare > for a reconnect attempt > [info 2021/08/17 05:20:44.463 GMT system-test-gemfire-server-2 > <ReconnectThread> tid=0xe5] GemFireCache[id = 252990056; isClosing = true; > isShutDownAll = false; created = Tue Aug 17 05:11:50 GMT 2021; server = true; > copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing. > [info 2021/08/17 05:20:44.544 GMT system-test-gemfire-server-2 > <ReconnectThread> tid=0xe5] Cache server on port 40404 is shutting down. > [info 2021/08/17 05:20:44.565 GMT system-test-gemfire-server-2 <Queue Removal > Thread> tid=0x5e] The QueueRemovalThread is done. > {noformat} > > The full set of logs and statistics from all member of the cluster can be > found > [here|https://drive.google.com/file/d/1jU_LIut9DVlZNniAdb53IglM3iWbX49C/view]. > Contrary to what the exception message states, it's worth noticing that *_all > members within the cluster (including clients) are using the same Geode > version._* > — > Below are some extra logs from when I was able to reproduce the issue with > {{log-level=fine}} configured on all members: > {noformat} > #### LOCATOR-0 > [debug 2021/08/16 20:40:22.858 GMT system-test-gemfire-locator-0 <P2P message > reader for > system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 > unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> > tid=0x87] P2P handshake remoteAddr is > system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 (GEODE > 1.15.0) > [warn 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message > reader for > system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 > unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> > tid=0x87] P2P message reader@6673b475 timed out during a membership check. > [fatal 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message > reader for > system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 > unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> > tid=0x87] Error deserializing P2P handshake message > java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire > or non-GemFire during handshake due to initial byte being 1 > at > org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875) > at > org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825) > at > org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649) > at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489) > at > java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) > at > java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) > at java.base/java.lang.Thread.run(Thread.java:829) > [debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message > reader for > system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 > unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> > tid=0x87] readMessages terminated id=/10.12.0.9:58373 from > system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 > isHandshakeReader=false > [debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message > reader for > system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 > unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> > tid=0x87] Stopping P2P message reader@6673b475 for > system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 > #### SERVER-1 > [warn 2021/08/16 20:40:22.698 GMT system-test-gemfire-server-1 <StatSampler> > tid=0x35] Statistics sampling thread detected a wakeup delay of 29219 ms, > indicating a possible resource issue. Check the GC, memory, and CPU > statistics. > ... > [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 > <ClientMembership Event Invoker1> tid=0x58] Starting thread for P2P handshake > reader@6161c139-47 > [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P > handshake reader@6161c139-47> tid=0x33] Starting P2P handshake > reader@6161c139-47 on Socket[addr=/10.12.0.9,port=58373,localport=57818] > [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P > handshake reader@6161c139-47> tid=0x33] readMessages terminated > id=/10.12.1.6:44219 from > system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000 > isHandshakeReader=false > [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P > handshake reader@6161c139-47> tid=0x33] Stopping P2P handshake > reader@6161c139-47 for > system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000 > [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 > <ClientMembership Event Invoker1> tid=0x58] ConnectionTable: created an > ordered connection: > system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47) > [debug 2021/08/16 20:40:23.059 GMT system-test-gemfire-server-1 > <ClientMembership Event Invoker1> tid=0x58] Sending on these 1 connections: > [system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)] > [debug 2021/08/16 20:40:23.065 GMT system-test-gemfire-server-1 > <ClientMembership Event Invoker1> tid=0x58] distribution of message aborted > by shutdown: > UpdateOperation(EntryEventImpl[op=CREATE;region=/_notificationRegion_10.12.1.6<v3>41000;key=org.apache.geode.management.internal.NotificationKey@9b5d432b;callbackArg=null;originRemote=false;originMember=system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000;callbacksInvoked;id=EventID[system-test-gemfire-server-1(system-test-gemfire-server-1)<v3>:41000;threadID=2;sequenceID=5]]) > {noformat} > The full logs and statistics for all member of the cluster can be found > [here|https://drive.google.com/file/d/1gAxLzxAYs70hBtrnXq6pI68K1IkAWpRt/view]. > I'm not familiar with this part of the code but, from the above logs, it > looks like the same reader ({{reader@6161c139-47}}) is being used by two > different threads at the same time?, is this expected?, might that be the > cause of the exception we're seeing?. -- This message was sent by Atlassian Jira (v8.20.7#820007)