[ 
https://issues.apache.org/jira/browse/HBASE-29480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Junegunn Choi updated HBASE-29480:
----------------------------------
    Description: 
h2. Problem

I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5.
{noformat}
[ERROR] Failures: 
[ERROR] 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs
[ERROR]   Run 1: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 2: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 3: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 4: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[INFO] 
[ERROR] 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs
[ERROR]   Run 1: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 2: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 3: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 4: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
{noformat}
h2. Workaround

If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), the 
tests pass.
{code:java}
diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
index a997cd0a9d..7c47732259 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
@@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends 
WALEntryStreamTestBase {
       appendToLogAndSync();
       assertNotNull(entryStream.next());
       log.rollWriter();
+      Thread.sleep(10);
       appendToLogAndSync();
       assertNotNull(entryStream.next());
       assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
{code}
h2. Analysis

When the test fails, the following log appears:
{noformat}
2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): 
Encountered a malformed edit, seeking back to last good position in file, from 
218 to 210
java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, 
currentPosition=218
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) 
~[classes/:?]
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) 
~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
 ~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
 ~[test-classes/:?]
        at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) 
~[test-classes/:?]
        at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) 
~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
 ~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727)
 ~[test-classes/:?]
{noformat}
Which led me to this code:

[https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376]

The logic here seems to contradict the comment ("only rethrow EOF if it 
indicates an EOF") and throw EOF even when it has reached the WAL trailer. (Or 
maybe I misread the comment?)

If I change the code to _not_ throw the exception when the WAL trailer is found 
(see [https://github.com/apache/hbase/pull/7176]), the test passes.

While I have a limited understanding of this area, this behavior seems 
reasonable: if we have reached the trailer, there's no more entries, so we can 
just return {{{}false{}}}.

Any thoughts? FYI, the test passes on branch-2.

  was:
h2. Problem

I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5.
{noformat}
[ERROR] Failures: 
[ERROR] 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs
[ERROR]   Run 1: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 2: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 3: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[ERROR]   Run 4: 
TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
 expected:<0> but was:<1>
[INFO] 
[ERROR] 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs
[ERROR]   Run 1: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 2: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 3: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
[ERROR]   Run 4: 
TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
expected:<0> but was:<1>
{noformat}
h2. Workaround

If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), the 
tests pass.
{code:java}
diff --git 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
index a997cd0a9d..7c47732259 100644
--- 
a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
+++ 
b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
@@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends 
WALEntryStreamTestBase {
       appendToLogAndSync();
       assertNotNull(entryStream.next());
       log.rollWriter();
+      Thread.sleep(10);
       appendToLogAndSync();
       assertNotNull(entryStream.next());
       assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
{code}
h2. Analysis

When the test fails, the following log appears:
{noformat}
2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): 
Encountered a malformed edit, seeking back to last good position in file, from 
218 to 210
java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, 
currentPosition=218
        at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) 
~[classes/:?]
        at 
org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) 
~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
 ~[classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
 ~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
 ~[test-classes/:?]
        at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) 
~[test-classes/:?]
        at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) 
~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
 ~[test-classes/:?]
        at 
org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727)
 ~[test-classes/:?]
{noformat}
Which led me to this code:

[https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376]

The logic here seems to contradict the comment ("only rethrow EOF if it 
indicates an EOF") and throw EOF even when it has reached the WAL trailer. (Or 
maybe I misread the comment?)

If I change the code to _not_ throw the exception when the WAL trailer is found 
(see https://github.com/apache/hbase/pull/7176), the test passes.

While I have a limited understanding of this area, this behavior seems 
reasonable: if we have reached the trailer, there's no more entry, so we can 
just return {{{}false{}}}.

Any thoughts?


> testCleanClosedWALs fails on branch-2.5
> ---------------------------------------
>
>                 Key: HBASE-29480
>                 URL: https://issues.apache.org/jira/browse/HBASE-29480
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Junegunn Choi
>            Priority: Minor
>              Labels: pull-request-available
>
> h2. Problem
> I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5.
> {noformat}
> [ERROR] Failures: 
> [ERROR] 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs
> [ERROR]   Run 1: 
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
>  expected:<0> but was:<1>
> [ERROR]   Run 2: 
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
>  expected:<0> but was:<1>
> [ERROR]   Run 3: 
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
>  expected:<0> but was:<1>
> [ERROR]   Run 4: 
> TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727
>  expected:<0> but was:<1>
> [INFO] 
> [ERROR] 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs
> [ERROR]   Run 1: 
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
> expected:<0> but was:<1>
> [ERROR]   Run 2: 
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
> expected:<0> but was:<1>
> [ERROR]   Run 3: 
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
> expected:<0> but was:<1>
> [ERROR]   Run 4: 
> TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 
> expected:<0> but was:<1>
> {noformat}
> h2. Workaround
> If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), 
> the tests pass.
> {code:java}
> diff --git 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
>  
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> index a997cd0a9d..7c47732259 100644
> --- 
> a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> +++ 
> b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java
> @@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends 
> WALEntryStreamTestBase {
>        appendToLogAndSync();
>        assertNotNull(entryStream.next());
>        log.rollWriter();
> +      Thread.sleep(10);
>        appendToLogAndSync();
>        assertNotNull(entryStream.next());
>        assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs());
> {code}
> h2. Analysis
> When the test fails, the following log appears:
> {noformat}
> 2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): 
> Encountered a malformed edit, seeking back to last good position in file, 
> from 218 to 210
> java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, 
> currentPosition=218
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) 
> ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) 
> ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118)
>  ~[classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82)
>  ~[test-classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95)
>  ~[test-classes/:?]
>       at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) 
> ~[test-classes/:?]
>       at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) 
> ~[test-classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94)
>  ~[test-classes/:?]
>       at 
> org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727)
>  ~[test-classes/:?]
> {noformat}
> Which led me to this code:
> [https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376]
> The logic here seems to contradict the comment ("only rethrow EOF if it 
> indicates an EOF") and throw EOF even when it has reached the WAL trailer. 
> (Or maybe I misread the comment?)
> If I change the code to _not_ throw the exception when the WAL trailer is 
> found (see [https://github.com/apache/hbase/pull/7176]), the test passes.
> While I have a limited understanding of this area, this behavior seems 
> reasonable: if we have reached the trailer, there's no more entries, so we 
> can just return {{{}false{}}}.
> Any thoughts? FYI, the test passes on branch-2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to