[ 
https://issues.apache.org/jira/browse/HBASE-29480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HBASE-29480:
-----------------------------------
    Labels: pull-request-available  (was: )

> testCleanClosedWALs fails on branch-2.5
> ---------------------------------------
>
>                 Key: HBASE-29480
>                 URL: https://issues.apache.org/jira/browse/HBASE-29480
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Junegunn Choi
>            Priority: Minor
>              Labels: pull-request-available
>
> h2. Problem
> I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5.
> {noformat}
> [ERROR] Failures: 
> [ERROR] 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs
> [ERROR]   Run 1: 
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
>  expected:<0> but was:<1>
> [ERROR]   Run 2: 
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
>  expected:<0> but was:<1>
> [ERROR]   Run 3: 
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
>  expected:<0> but was:<1>
> [ERROR]   Run 4: 
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
>  expected:<0> but was:<1>
> [INFO] 
> [ERROR] 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs
> [ERROR]   Run 1: 
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
> expected:<0> but was:<1>
> [ERROR]   Run 2: 
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
> expected:<0> but was:<1>
> [ERROR]   Run 3: 
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
> expected:<0> but was:<1>
> [ERROR]   Run 4: 
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
> expected:<0> but was:<1>
> {noformat}
> h2. Workaround
> If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), 
> the tests pass.
> {code:java}
> diff --git 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
>  
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> index a997cd0a9d..7c47732259 100644
> --- 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> +++ 
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> @@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends 
> WALEntryStreamTestBase {
>        appendToLogAndSync();
>        assertNotNull(entryStream.next());
>        log.rollWriter();
> +      Thread.sleep(10);
>        appendToLogAndSync();
>        assertNotNull(entryStream.next());
>        assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
> {code}
> h2. Analysis
> When the test fails, the following log appears:
> {noformat}
> 2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): 
> Encountered a malformed edit, seeking back to last good position in file, 
> from 218 to 210
> java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, 
> currentPosition=218
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) 
> ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) 
> ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
>  ~[test-classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
>  ~[test-classes/:?]
>       at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) 
> ~[test-classes/:?]
>       at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) 
> ~[test-classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
>  ~[test-classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727)
>  ~[test-classes/:?]
> {noformat}
> Which led me to this code:
> [https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376]
> The logic here seems to contradict the comment ("only rethrow EOF if it 
> indicates an EOF") and throw EOF even when it has reached the WAL trailer. 
> (Or maybe I misread the comment?)
> If I change the code to _not_ throw the exception when the WAL trailer is 
> found, the test passes.
> While I have a limited understanding of this area, this behavior seems 
> reasonable: if we have reached the trailer, there's no more entry, so we can 
> just return {{{}false{}}}.
> Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to