[ https://issues.apache.org/jira/browse/HBASE-29480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Junegunn Choi updated HBASE-29480: ---------------------------------- Description: h2. Problem I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5. {noformat} [ERROR] Failures: [ERROR] org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs [ERROR] Run 1: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 2: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 3: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 4: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [INFO] [ERROR] org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs [ERROR] Run 1: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 2: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 3: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 4: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> {noformat} h2. Workaround If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), the tests pass. {code:java} diff --git a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java index a997cd0a9d..7c47732259 100644 --- a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java +++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java @@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends WALEntryStreamTestBase { appendToLogAndSync(); assertNotNull(entryStream.next()); log.rollWriter(); + Thread.sleep(10); appendToLogAndSync(); assertNotNull(entryStream.next()); assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); {code} h2. Analysis When the test fails, the following log appears: {noformat} 2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): Encountered a malformed edit, seeking back to last good position in file, from 218 to 210 java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, currentPosition=218 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95) ~[test-classes/:?] at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) ~[test-classes/:?] at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727) ~[test-classes/:?] {noformat} Which led me to this code: [https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376] The logic here seems to contradict the comment ("only rethrow EOF if it indicates an EOF") and throw EOF even when it has reached the WAL trailer. (Or maybe I misread the comment?) If I change the code to _not_ throw the exception when the WAL trailer is found (see [https://github.com/apache/hbase/pull/7176]), the test passes. While I have a limited understanding of this area, this behavior seems reasonable: if we have reached the trailer, there's no more entries, so we can just return {{{}false{}}}. Any thoughts? FYI, the test passes on branch-2. was: h2. Problem I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5. {noformat} [ERROR] Failures: [ERROR] org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs [ERROR] Run 1: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 2: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 3: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 4: TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [INFO] [ERROR] org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs [ERROR] Run 1: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 2: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 3: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> [ERROR] Run 4: TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 expected:<0> but was:<1> {noformat} h2. Workaround If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), the tests pass. {code:java} diff --git a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java index a997cd0a9d..7c47732259 100644 --- a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java +++ b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java @@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends WALEntryStreamTestBase { appendToLogAndSync(); assertNotNull(entryStream.next()); log.rollWriter(); + Thread.sleep(10); appendToLogAndSync(); assertNotNull(entryStream.next()); assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); {code} h2. Analysis When the test fails, the following log appears: {noformat} 2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): Encountered a malformed edit, seeking back to last good position in file, from 218 to 210 java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, currentPosition=218 at org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118) ~[classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95) ~[test-classes/:?] at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) ~[test-classes/:?] at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94) ~[test-classes/:?] at org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727) ~[test-classes/:?] {noformat} Which led me to this code: [https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376] The logic here seems to contradict the comment ("only rethrow EOF if it indicates an EOF") and throw EOF even when it has reached the WAL trailer. (Or maybe I misread the comment?) If I change the code to _not_ throw the exception when the WAL trailer is found (see https://github.com/apache/hbase/pull/7176), the test passes. While I have a limited understanding of this area, this behavior seems reasonable: if we have reached the trailer, there's no more entry, so we can just return {{{}false{}}}. Any thoughts? > testCleanClosedWALs fails on branch-2.5 > --------------------------------------- > > Key: HBASE-29480 > URL: https://issues.apache.org/jira/browse/HBASE-29480 > Project: HBase > Issue Type: Bug > Reporter: Junegunn Choi > Priority: Minor > Labels: pull-request-available > > h2. Problem > I'm unable to make {{testCleanClosedWALs}} pass on branch-2.5. > {noformat} > [ERROR] Failures: > [ERROR] > org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamAsyncFSWAL.testCleanClosedWALs > [ERROR] Run 1: > TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 > expected:<0> but was:<1> > [ERROR] Run 2: > TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 > expected:<0> but was:<1> > [ERROR] Run 3: > TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 > expected:<0> but was:<1> > [ERROR] Run 4: > TestBasicWALEntryStreamAsyncFSWAL>TestBasicWALEntryStream.testCleanClosedWALs:727 > expected:<0> but was:<1> > [INFO] > [ERROR] > org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStreamFSHLog.testCleanClosedWALs > [ERROR] Run 1: > TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 > expected:<0> but was:<1> > [ERROR] Run 2: > TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 > expected:<0> but was:<1> > [ERROR] Run 3: > TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 > expected:<0> but was:<1> > [ERROR] Run 4: > TestBasicWALEntryStreamFSHLog>TestBasicWALEntryStream.testCleanClosedWALs:727 > expected:<0> but was:<1> > {noformat} > h2. Workaround > If I add a short sleep after {{rollWriter}} (as suggested in HBASE-25932), > the tests pass. > {code:java} > diff --git > a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java > > b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java > index a997cd0a9d..7c47732259 100644 > --- > a/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java > +++ > b/hbase-server/src/test/java/org/apache/hadoop/hbase/replication/regionserver/TestBasicWALEntryStream.java > @@ -722,6 +722,7 @@ public abstract class TestBasicWALEntryStream extends > WALEntryStreamTestBase { > appendToLogAndSync(); > assertNotNull(entryStream.next()); > log.rollWriter(); > + Thread.sleep(10); > appendToLogAndSync(); > assertNotNull(entryStream.next()); > assertEquals(0, logQueue.getMetrics().getUncleanlyClosedWALs()); > {code} > h2. Analysis > When the test fails, the following log appears: > {noformat} > 2025-07-29T14:21:12,524 DEBUG [Time-limited test] wal.ProtobufLogReader(447): > Encountered a malformed edit, seeking back to last good position in file, > from 218 to 210 > java.io.EOFException: Invalid PB, EOF? Ignoring; originalPosition=210, > currentPosition=218 > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufLogReader.readNext(ProtobufLogReader.java:376) > ~[classes/:?] > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:104) > ~[classes/:?] > at > org.apache.hadoop.hbase.regionserver.wal.ReaderBase.next(ReaderBase.java:92) > ~[classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.readNextEntryAndRecordReaderPosition(WALEntryStream.java:259) > ~[classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.tryAdvanceEntry(WALEntryStream.java:181) > ~[classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.hasNext(WALEntryStream.java:102) > ~[classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.peek(WALEntryStream.java:111) > ~[classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStream.next(WALEntryStream.java:118) > ~[classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.access$001(WALEntryStreamTestBase.java:82) > ~[test-classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.lambda$next$0(WALEntryStreamTestBase.java:95) > ~[test-classes/:?] > at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:184) > ~[test-classes/:?] > at org.apache.hadoop.hbase.Waiter.waitFor(Waiter.java:135) > ~[test-classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.WALEntryStreamTestBase$WALEntryStreamWithRetries.next(WALEntryStreamTestBase.java:94) > ~[test-classes/:?] > at > org.apache.hadoop.hbase.replication.regionserver.TestBasicWALEntryStream.testCleanClosedWALs(TestBasicWALEntryStream.java:727) > ~[test-classes/:?] > {noformat} > Which led me to this code: > [https://github.com/apache/hbase/blob/c319c6f0a316446342fd9a03ba72d14e8dc7cb36/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/ProtobufLogReader.java#L372-L376] > The logic here seems to contradict the comment ("only rethrow EOF if it > indicates an EOF") and throw EOF even when it has reached the WAL trailer. > (Or maybe I misread the comment?) > If I change the code to _not_ throw the exception when the WAL trailer is > found (see [https://github.com/apache/hbase/pull/7176]), the test passes. > While I have a limited understanding of this area, this behavior seems > reasonable: if we have reached the trailer, there's no more entries, so we > can just return {{{}false{}}}. > Any thoughts? FYI, the test passes on branch-2. -- This message was sent by Atlassian Jira (v8.20.10#820010)