[ https://issues.apache.org/jira/browse/HBASE-28932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rushabh Shah updated HBASE-28932: --------------------------------- Description: RS kept on running even if it was unable to write replication marker for 5 minutes. But this issue is not specific to just replication marker. It applies to compaction marker as well as region event marker (like open, close). Sample exception trace: {noformat} 2024-10-09 10:12:21,659 ERROR [regionserver/regionserver-33:60020.Chore.3] regionserver.ReplicationMarkerChore - Exception whil e sync'ing replication tracker edit org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for txid=15030132, WAL system stuck? at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:171) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:876) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:802) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.doSync(FSHLog.java:836) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$sync$3(AbstractFSWAL.java:602) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.sync(AbstractFSWAL.java:602) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.sync(AbstractFSWAL.java:592) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(WALUtil.java:169) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:146) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeReplicationMarkerAndSync(WALUtil.java:230) at org.apache.hadoop.hbase.replication.regionserver.ReplicationMarkerChore.chore(ReplicationMarkerChore.java:99) at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {noformat} In this case there was namenode crash and all the hdfs clients on regionservers didn't failover to new active namenode due to our inefficient failover configuration parameters. was: RS kept on running even if it was unable to write replication marker for 5 minutes. But this issue is not specific to just replication marker. It applies to compaction marker as well as region event marker (like open, close). Sample exception trace: {noformat} 2024-10-09 10:12:21,659 ERROR [regionserver/regionserver-33:60020.Chore.3] regionserver.ReplicationMarkerChore - Exception whil e sync'ing replication tracker edit org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync result after 300000 ms for txid=15030132, WAL system stuck? at org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:171) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:876) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:802) at org.apache.hadoop.hbase.regionserver.wal.FSHLog.doSync(FSHLog.java:836) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$sync$3(AbstractFSWAL.java:602) at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.sync(AbstractFSWAL.java:602) at org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.sync(AbstractFSWAL.java:592) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(WALUtil.java:169) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:146) at org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeReplicationMarkerAndSync(WALUtil.java:230) at org.apache.hadoop.hbase.replication.regionserver.ReplicationMarkerChore.chore(ReplicationMarkerChore.java:99) at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:750) {noformat} > Abort RS if unable to sync internal markers. > -------------------------------------------- > > Key: HBASE-28932 > URL: https://issues.apache.org/jira/browse/HBASE-28932 > Project: HBase > Issue Type: Bug > Components: wal > Affects Versions: 2.5.8 > Reporter: Rushabh Shah > Priority: Major > > RS kept on running even if it was unable to write replication marker for 5 > minutes. But this issue is not specific to just replication marker. It > applies to compaction marker as well as region event marker (like open, > close). > Sample exception trace: > {noformat} > 2024-10-09 10:12:21,659 ERROR [regionserver/regionserver-33:60020.Chore.3] > regionserver.ReplicationMarkerChore - Exception whil > e sync'ing replication tracker edit > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Failed to get sync > result after 300000 ms for txid=15030132, WAL system stuck? > at > org.apache.hadoop.hbase.regionserver.wal.SyncFuture.get(SyncFuture.java:171) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.blockOnSync(AbstractFSWAL.java:876) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.publishSyncThenBlockOnCompletion(FSHLog.java:802) > at > org.apache.hadoop.hbase.regionserver.wal.FSHLog.doSync(FSHLog.java:836) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.lambda$sync$3(AbstractFSWAL.java:602) > at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:187) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.sync(AbstractFSWAL.java:602) > at > org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.sync(AbstractFSWAL.java:592) > at > org.apache.hadoop.hbase.regionserver.wal.WALUtil.doFullMarkerAppendTransaction(WALUtil.java:169) > at > org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeMarker(WALUtil.java:146) > at > org.apache.hadoop.hbase.regionserver.wal.WALUtil.writeReplicationMarkerAndSync(WALUtil.java:230) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationMarkerChore.chore(ReplicationMarkerChore.java:99) > at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:161) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) > at > java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) > at > org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:750) > {noformat} > In this case there was namenode crash and all the hdfs clients on > regionservers didn't failover to new active namenode due to our inefficient > failover configuration parameters. -- This message was sent by Atlassian Jira (v8.20.10#820010)