[ 
https://issues.apache.org/jira/browse/GEODE-9512?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Owen Nichols closed GEODE-9512.
-------------------------------

> java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire 
> or non-GemFire during handshake due to initial byte being 1
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: GEODE-9512
>                 URL: https://issues.apache.org/jira/browse/GEODE-9512
>             Project: Geode
>          Issue Type: Bug
>          Components: membership, messaging
>    Affects Versions: 1.15.0
>            Reporter: Juan Ramos
>            Assignee: Kamilla Aslami
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.15.0
>
>
> I've hit this issue while executing some chaos testing over a GemFire cluster 
> using 2 locators and 3 servers; {{SSL}} is enabled, and a dummy 
> {{SecurityManager}} is configured which authenticates any user and always 
> returns {{true}} within the {{authorize}} method. There are 3 
> {{PARTITION_REDUNDANT}} regions configured, one per client, each with 1 
> redundant copy. Once the cluster is up and running, 3 clients continuously 
> execute {{Region.put}} operations on a well known set of keys for its own 
> {{Region}} (created with {{PROXY}} type), and another process executes the 
> following logic in parallel (pseudocode):
> {noformat}
> for server in ${servers}
> do
>       # Pause the JVM for 30 seconds to simulate a stop the world GC
>       kill -STOP server 
>       sleep 30
>       # Unpause the JVM, wait for member to reconnect and regions to recover 
> redundancy configured
>       kill -CONT "${SERVER_PID}"
>       waitForReconnectcompletedInServerLog
>       waitForNumBucketsWithoutRedundancyToBeZeroInGfshShowRegionMetrics
> done
> {noformat}
> The test works fine most of the time, but randomly fails due to an unexpected 
> exception logged within the logs for at least one locator. The exception is 
> always reported from a {{P2P message reader}} thread on the locator for a 
> server member that has just returned to life, as an example:
> {noformat}
> #### LOCATOR-0
> [warn 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message 
> reader for 
> system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 
> unshared ordered sender uid=61 dom #1 local port=48141 remote port=46174> 
> tid=0x6f] P2P message reader@354fac47 timed out during a membership check.
> [fatal 2021/08/17 05:20:45.166 GMT system-test-gemfire-locator-0 <P2P message 
> reader for 
> system-test-gemfire-server-2(system-test-gemfire-server-2:1)<v4>:41000 
> unshared ordered sender uid=61 dom #1 local port=48141 remote port=46174> 
> tid=0x6f] Error deserializing P2P handshake message
> java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire 
> or non-GemFire during handshake due to initial byte being 1
>       at 
> org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875)
>       at 
> org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> #### SERVER-2
> [warn 2021/08/17 05:20:44.012 GMT system-test-gemfire-server-2 <StatSampler> 
> tid=0x35] Statistics sampling thread detected a wakeup delay of 29070 ms, 
> indicating a possible resource issue. Check the GC, memory, and CPU 
> statistics.
> [warn 2021/08/17 05:20:44.015 GMT system-test-gemfire-server-2 <Geode 
> Heartbeat Sender> tid=0x23] Failure detection heartbeat-generation thread 
> overslept by more than a full period. Asleep time: 31,175,291,931 
> nanoseconds. Period: 2,500,000,000 nanoseconds.
> [info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast 
> receiver,system-test-gemfire-server-2-25596> tid=0x1e] saving cache server 
> configuration for use with the cluster-configuration service on reconnect
> [info 2021/08/17 05:20:44.143 GMT system-test-gemfire-server-2 <unicast 
> receiver,system-test-gemfire-server-2-25596> tid=0x1e] cache server 
> configuration saved
> [info 2021/08/17 05:20:44.233 GMT system-test-gemfire-server-2 
> <DisconnectThread> tid=0xe5] Stopping membership services
> [info 2021/08/17 05:20:44.455 GMT system-test-gemfire-server-2 
> <ReconnectThread> tid=0xe5] Disconnecting old DistributedSystem to prepare 
> for a reconnect attempt
> [info 2021/08/17 05:20:44.463 GMT system-test-gemfire-server-2 
> <ReconnectThread> tid=0xe5] GemFireCache[id = 252990056; isClosing = true; 
> isShutDownAll = false; created = Tue Aug 17 05:11:50 GMT 2021; server = true; 
> copyOnRead = false; lockLease = 120; lockTimeout = 60]: Now closing.
> [info 2021/08/17 05:20:44.544 GMT system-test-gemfire-server-2 
> <ReconnectThread> tid=0xe5] Cache server on port 40404 is shutting down.
> [info 2021/08/17 05:20:44.565 GMT system-test-gemfire-server-2 <Queue Removal 
> Thread> tid=0x5e] The QueueRemovalThread is done.
> {noformat}
>  
> The full set of logs and statistics from all member of the cluster can be 
> found 
> [here|https://drive.google.com/file/d/1jU_LIut9DVlZNniAdb53IglM3iWbX49C/view].
> Contrary to what the exception message states, it's worth noticing that *_all 
> members within the cluster (including clients) are using the same Geode 
> version._*
> —
> Below are some extra logs from when I was able to reproduce the issue with 
> {{log-level=fine}} configured on all members:
> {noformat}
> #### LOCATOR-0
> [debug 2021/08/16 20:40:22.858 GMT system-test-gemfire-locator-0 <P2P message 
> reader for 
> system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 
> unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> 
> tid=0x87] P2P handshake remoteAddr is 
> system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 (GEODE 
> 1.15.0)
> [warn 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message 
> reader for 
> system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 
> unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> 
> tid=0x87] P2P message reader@6673b475 timed out during a membership check.
> [fatal 2021/08/16 20:40:23.858 GMT system-test-gemfire-locator-0 <P2P message 
> reader for 
> system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 
> unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> 
> tid=0x87] Error deserializing P2P handshake message
> java.lang.IllegalStateException: Detected old version (pre 5.0.1) of GemFire 
> or non-GemFire during handshake due to initial byte being 1
>       at 
> org.apache.geode.internal.tcp.Connection.readHandshakeForReceiver(Connection.java:2875)
>       at 
> org.apache.geode.internal.tcp.Connection.processInputBuffer(Connection.java:2825)
>       at 
> org.apache.geode.internal.tcp.Connection.readMessages(Connection.java:1649)
>       at org.apache.geode.internal.tcp.Connection.run(Connection.java:1489)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>       at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>       at java.base/java.lang.Thread.run(Thread.java:829)
> [debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message 
> reader for 
> system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 
> unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> 
> tid=0x87] readMessages terminated id=/10.12.0.9:58373 from 
> system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 
> isHandshakeReader=false
> [debug 2021/08/16 20:40:23.863 GMT system-test-gemfire-locator-0 <P2P message 
> reader for 
> system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000 
> unshared ordered sender uid=47 dom #1 local port=58373 remote port=57818> 
> tid=0x87] Stopping P2P message reader@6673b475 for 
> system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000
> #### SERVER-1
> [warn 2021/08/16 20:40:22.698 GMT system-test-gemfire-server-1 <StatSampler> 
> tid=0x35] Statistics sampling thread detected a wakeup delay of 29219 ms, 
> indicating a possible resource issue. Check the GC, memory, and CPU 
> statistics.
> ...
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 
> <ClientMembership Event Invoker1> tid=0x58] Starting thread for P2P handshake 
> reader@6161c139-47
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P 
> handshake reader@6161c139-47> tid=0x33] Starting P2P handshake 
> reader@6161c139-47 on Socket[addr=/10.12.0.9,port=58373,localport=57818]
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P 
> handshake reader@6161c139-47> tid=0x33] readMessages terminated 
> id=/10.12.1.6:44219 from 
> system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000
>  isHandshakeReader=false
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 <P2P 
> handshake reader@6161c139-47> tid=0x33] Stopping P2P handshake 
> reader@6161c139-47 for 
> system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000
> [debug 2021/08/16 20:40:22.856 GMT system-test-gemfire-server-1 
> <ClientMembership Event Invoker1> tid=0x58] ConnectionTable: created an 
> ordered connection: 
> system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)
> [debug 2021/08/16 20:40:23.059 GMT system-test-gemfire-server-1 
> <ClientMembership Event Invoker1> tid=0x58] Sending on these 1 connections: 
> [system-test-gemfire-locator-0(system-test-gemfire-locator-0:1:locator)<ec><v0>:41000(uid=47)]
> [debug 2021/08/16 20:40:23.065 GMT system-test-gemfire-server-1 
> <ClientMembership Event Invoker1> tid=0x58] distribution of message aborted 
> by shutdown: 
> UpdateOperation(EntryEventImpl[op=CREATE;region=/_notificationRegion_10.12.1.6<v3>41000;key=org.apache.geode.management.internal.NotificationKey@9b5d432b;callbackArg=null;originRemote=false;originMember=system-test-gemfire-server-1(system-test-gemfire-server-1:1)<v3>:41000;callbacksInvoked;id=EventID[system-test-gemfire-server-1(system-test-gemfire-server-1)<v3>:41000;threadID=2;sequenceID=5]])
> {noformat}
> The full logs and statistics for all member of the cluster can be found 
> [here|https://drive.google.com/file/d/1gAxLzxAYs70hBtrnXq6pI68K1IkAWpRt/view].
> I'm not familiar with this part of the code but, from the above logs, it 
> looks like the same reader ({{reader@6161c139-47}}) is being used by two 
> different threads at the same time?, is this expected?, might that be the 
> cause of the exception we're seeing?.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to