hgromer commented on PR #7084:
URL: https://github.com/apache/hbase/pull/7084#issuecomment-3027898590

   The log line indicates that the server is in a SPLITTING state, not the 
region. The SnapshotProcedure will check 
[here](https://github.com/apache/hbase/blob/3bbed010622708c95229687249b88559a281ef9f/hbase-server/src/main/java/org/apache/hadoop/hbase/master/procedure/SnapshotRegionProcedure.java#L181)
 if the target server is online and suspend the procedure if it isn't.
   
   We were in a tricky scenario where a test cluster was trying to run various 
SCP, and those SCP were essentially blocking the SnapshotProcedure. I had 
though the SnapshotProcedure was preventing the SCP from proceeding, but it 
turns out SCP was failing when running a SplitWALRemoteProcedure due to memory 
issues 
   
   ```
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster 2025-07-01T17:01:38,573 
[KeepAlivePEWorker-15] WARN 
org.apache.hadoop.hbase.master.procedure.SplitWALRemoteProcedure: Failed split 
of 
hdfs://sandbox-hb2-a-qa:8020/hbase/WALs/na1-broad-grand-falcon.iad03.hubinternal.net,60020,1751299239767-splitting/na1-broad-grand-falcon.iad03.hubinternal.net%2C60020%2C1751299239767.1751381096903,
 retry...
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException: 
java.lang.Exception: Cannot reserve 262144 bytes of direct buffer memory 
(allocated: 1879042125, limit: 1879048192)
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.procedure2.RemoteProcedureException.fromProto(RemoteProcedureException.java:123)
 ~[hbase-procedure-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.master.MasterRpcServices.lambda$reportProcedureDone$4(MasterRpcServices.java:2573)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
java.util.ArrayList.forEach(ArrayList.java:1596) ~[?:?]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1116) 
~[?:?]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.master.MasterRpcServices.reportProcedureDone(MasterRpcServices.java:2568)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:16726)
 ~[hbase-protocol-shaded-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:443) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:105) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:85) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster Caused by: 
java.lang.Exception: Cannot reserve 262144 bytes of direct buffer memory 
(allocated: 1879042125, limit: 1879048192)
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
java.nio.Bits.reserveMemory(Bits.java:178) ~[?:?]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
java.nio.DirectByteBuffer.<init>(DirectByteBuffer.java:111) ~[?:?]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:360) ~[?:?]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.io.compress.zstd.ZstdDecompressor.<init>(ZstdDecompressor.java:46)
 ~[hbase-compression-zstd-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.io.compress.zstd.ZstdCodec.createDecompressor(ZstdCodec.java:94)
 ~[hbase-compression-zstd-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.io.compress.CodecPool.getDecompressor(CodecPool.java:166)
 ~[hbase-common-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.io.compress.Compression$Algorithm.getDecompressor(Compression.java:487)
 ~[hbase-common-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.regionserver.wal.CompressionContext$ValueCompressor.decompress(CompressionContext.java:119)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.readCompressedValue(WALCellCodec.java:385)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.parseCell(WALCellCodec.java:336)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.codec.BaseDecoder.advance(BaseDecoder.java:66) 
~[hbase-common-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.wal.WALEdit.readFromCells(WALEdit.java:281) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufWALStreamReader.next(ProtobufWALStreamReader.java:84)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.wal.WALStreamReader.next(WALStreamReader.java:42) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.wal.WALSplitter.getNextLogLine(WALSplitter.java:490) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.wal.WALSplitter.splitWAL(WALSplitter.java:319) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.wal.WALSplitter.splitLogFile(WALSplitter.java:200) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.regionserver.SplitLogWorker.splitLog(SplitLogWorker.java:108)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.regionserver.SplitWALCallable.doCall(SplitWALCallable.java:86)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.procedure2.BaseRSProcedureCallable.call(BaseRSProcedureCallable.java:35)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.procedure2.BaseRSProcedureCallable.call(BaseRSProcedureCallable.java:23)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:56)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   hmaster-sandbox-hb2-a-1-557d64787b-9cgkh hmaster     at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
   ```
   
   I'm still attempting to track down exactly, the entire host's memory was 
being eaten up by RecoveredEditsOutputSink#append. We did have a 
disproportionately high number of regions per host on the cluster, which likely 
was the cause. 
   
   Importantly, the SnapshotProcedure was not the cause of these stuck SCP
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to