Junegunn Choi created HBASE-29480: ------------------------------------- Summary: testCleanClosedWALs fails on branch-2.5 Key: HBASE-29480 URL: https://issues.apache.org/jira/browse/HBASE-29480 Project: HBase Issue Type: Bug Reporter: Junegunn Choi
h2. Problem I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5. {noformat} [ERROR] Failures: [ERROR] org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs [ERROR] Run 1: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 2: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 3: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 4: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [INFO] [ERROR] org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs [ERROR] Run 1: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 2: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 3: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 4: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> {noformat} h2. Workaround If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), the tests pass. {code:java} diff --git a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java index a997cd0a9d..7c47732259 100644 --- a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java +++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java @@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends WALEntryStreamTestBase { appendToLogAndSync(); assertNotNull(entryStream.next()); log.rollWriter(); + Thread.sleep(10); appendToLogAndSync(); assertNotNull(entryStream.next()); assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); {code} h2. Analysis When the test fails, the following log appears: {noformat} 2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): Encountered a malformed edit, seeking back to last good position in file, from 218 to 210 java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, currentPosition=218 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95) ~[test-classes/:?] at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) ~[test-classes/:?] at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727) ~[test-classes/:?] {noformat} Which led me to this code: [https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376] The logic here seems to contradict the comment ("only rethrow EOF if it indicates an EOF") and throw EOF even when it has reached the WAL trailer. (Or maybe I misread the comment?) If I change the code to _not_ throw the exception when the WAL trailer is found, the test passes. While I have a limited understanding of this area, this behavior seems reasonable: if we have reached the trailer, there's no more entry, so we can just return {{{}false{}}}. Any thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010)