Junegunn Choi created HBASE-29480:
-------------------------------------

             Summary: testCleanClosedWALs fails on branch-2.5
                 Key: HBASE-29480
                 URL: https://issues.apache.org/jira/browse/HBASE-29480
             Project: HBase
          Issue Type: Bug
            Reporter: Junegunn Choi


h2. Problem

I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5.
{noformat}
[ERROR] Failures: 
[ERROR] 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs
[ERROR]   Run 1: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 2: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 3: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 4: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[INFO] 
[ERROR] 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs
[ERROR]   Run 1: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 2: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 3: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 4: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
{noformat}
h2. Workaround

If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), the 
tests pass.
{code:java}
diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
index a997cd0a9d..7c47732259 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
@@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends 
WALEntryStreamTestBase {
       appendToLogAndSync();
       assertNotNull(entryStream.next());
       log.rollWriter();
+      Thread.sleep(10);
       appendToLogAndSync();
       assertNotNull(entryStream.next());
       assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
{code}
h2. Analysis

When the test fails, the following log appears:
{noformat}
2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): 
Encountered a malformed edit, seeking back to last good position in file, from 
218 to 210
java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, 
currentPosition=218
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) 
~[classes/:?]
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) 
~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
 ~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
 ~[test-classes/:?]
        at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) 
~[test-classes/:?]
        at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) 
~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
 ~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727)
 ~[test-classes/:?]
{noformat}
Which led me to this code:

[https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376]

The logic here seems to contradict the comment ("only rethrow EOF if it 
indicates an EOF") and throw EOF even when it has reached the WAL trailer. (Or 
maybe I misread the comment?)

If I change the code to _not_ throw the exception when the WAL trailer is 
found, the test passes.

While I have a limited understanding of this area, this behavior seems 
reasonable: if we have reached the trailer, there's no more entry, so we can 
just return {{{}false{}}}.

Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to