[
https://issues.apache.org/jira/browse/HBASE-29282?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17949747#comment-17949747
]
Duo Zhang commented on HBASE-29282:
-----------------------------------
OK I think this is the problem...
In TRSP, when updating meta in RegionStateStore, we will use currentTime as
timestamp, but in RegionStateStore.mergeRegions where we delete the parent
regions and insert new region, we use HConstants.LATEST_TIMESTAMP.
I think the intention here is that, we should delete everything about the
parent regions, but in fact, the timestamp will be updated to currentTime at
region server side. So if master's time is a bit faster than the region
server's time which holds meta, then we will fall into the above scenario.
So the proper fix is to also use currentTime in RegionStateStore.mergeRegions.
Will open a PR and then start a new round of ITBLL with the fix in place to see
if the problem still occur.
Thanks.
> Regions are left in CLOSED state after merging
> ----------------------------------------------
>
> Key: HBASE-29282
> URL: https://issues.apache.org/jira/browse/HBASE-29282
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, Region Assignment
> Reporter: Duo Zhang
> Priority: Major
>
> When running ITBLL, some regions are left in CLOSED state for a long time and
> finally were cleaned up by CatalogJanitor.
> After checking, the regions are merged, which should have been removed in
> hbase:meta, but seems they were still present in hbase:meta table with CLOSED
> state.
> Need to dig more.
> {noformat}
> 2025-05-01T00:08:32,903 INFO [PEWorker-15] procedure2.ProcedureExecutor:
> Finished pid=3512, state=SUCCESS, hasLock=false; MergeTableRegionsProcedure
> table=IntegrationTestBigLinkedList,
> regions=[6a98dc86a491041b8d3ac584ac73c0a0, c9f07f77792feb0d8a845d6d9751f048],
> force=false in 734 msec
> 2025-05-01T00:11:26,333 WARN [master/meta02:16000.Chore.1]
> janitor.CatalogJanitor:
> overlap=IntegrationTestBigLinkedList,\x99\x99\x99\x99\x99\x99\x99\x99,1746028626716.6a98dc86a491041b8d3ac584ac73c0a0./IntegrationTestBigLinkedList,\x99\x99\x99\x99\x99\x99\x99\x99,1746028626717.24435a6eefc045cf36ddff9a30409ff1.,
>
> overlap=IntegrationTestBigLinkedList,\x99\x99\x99\x99\x99\x99\x99\x99,1746028626717.24435a6eefc045cf36ddff9a30409ff1./IntegrationTestBigLinkedList,\xA2!RV,1746028626716.c9f07f77792feb0d8a845d6d9751f048.
> 2025-05-01T00:41:40,856 WARN [master/meta02:16000.Chore.1]
> janitor.CatalogJanitor: 283c738f170f361157b470868f6ad89.,
> overlap=IntegrationTestBigLinkedList,\x91\x10\xA3\x07\x03\xAC\xC7\xC3\xCCY\xAE\xE4!1\xD1i,1746029042178.815020ca73a2679bc0c0a298e4dddfda./IntegrationTestBigLinkedList,\x91\x10\xA3\x07\x03\xAC\xC7\xC3\xCCY\xAE\xE4!1\xD1i,1746029042179.278a2eeee359488f859ac5334ee3cde0.,
>
> overlap=IntegrationTestBigLinkedList,\x91\x10\xA3\x07\x03\xAC\xC7\xC3\xCCY\xAE\xE4!1\xD1i,1746029042179.278a2eeee359488f859ac5334ee3cde0./IntegrationTestBigLinkedList,\x95U\x0D9}\xAB\xE1\x98\x80w\xED\xA7+\xF9\xA4\xED,1746029042178.b64120d20856552cd7d154b63bd2ce81.,
>
> overlap=IntegrationTestBigLinkedList,\x99\x99\x99\x99\x99\x99\x99\x99,1746028626716.6a98dc86a491041b8d3ac584ac73c0a0./IntegrationTestBigLinkedList,\x99\x99\x99\x99\x99\x99\x99\x99,1746028626717.24435a6eefc045cf36ddff9a30409ff1.,
>
> overlap=IntegrationTestBigLinkedList,\x99\x99\x99\x99\x99\x99\x99\x99,1746028626717.24435a6eefc045cf36ddff9a30409ff1./IntegrationTestBigLinkedList,\xA2!RV,1746028626716.c9f07f77792feb0d8a845d6d9751f048.
> 2025-05-01T00:42:00,853 INFO [PEWorker-12] procedure.FlushRegionProcedure:
> State of region {ENCODED => 6a98dc86a491041b8d3ac584ac73c0a0, NAME =>
> 'IntegrationTestBigLinkedList,\x99\x99\x99\x99\x99\x99\x99\x99,1746028626716.6a98dc86a491041b8d3ac584ac73c0a0.',
> STARTKEY => '\x99\x99\x99\x99\x99\x99\x99\x99', ENDKEY => '\xA2!RV'} is not
> OPEN or in transition. Skip pid=5810, ppid=5789, state=RUNNABLE,
> hasLock=true; org.apache.hadoop.hbase.master.procedure.FlushRegionProcedure
> ...
> 2025-05-01T00:44:32,339 INFO [PEWorker-3]
> procedure.MasterProcedureScheduler: Took xlock for pid=5964, ppid=5943,
> state=RUNNABLE, hasLock=false; SnapshotRegionProcedure
> 6a98dc86a491041b8d3ac584ac73c0a0
> 2025-05-01T00:44:32,340 WARN [PEWorker-3] procedure.SnapshotRegionProcedure:
> pid=5964, ppid=5943, state=RUNNABLE, hasLock=true; SnapshotRegionProcedure
> 6a98dc86a491041b8d3ac584ac73c0a0 can not run currently because region state
> of
> IntegrationTestBigLinkedList,\x99\x99\x99\x99\x99\x99\x99\x99,1746028626716.6a98dc86a491041b8d3ac584ac73c0a0.
> is CLOSED, wait 1000 ms to retry
> {noformat}
> {noformat}
> 2025-05-01 00:27:59,824 WARN [RPCClient-NioEventLoopGroup-1-2]
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator: Failed to locate
> region in 'IntegrationTestBigLinkedList',
> row='\xA6\x8B\x9E\xC1\xA98&K}g+7N/\xA1\x05', locateType=CURRENT
> org.apache.hadoop.hbase.HBaseIOException: No location found for
> 'IntegrationTestBigLinkedList', row='\xA6\x8B\x9E\xC1\xA98&K}g+7N/\xA1\x05',
> locateType=CURRENT
> at
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator.onScanNext(AsyncNonMetaRegionLocator.java:322)
> at
> org.apache.hadoop.hbase.client.AsyncNonMetaRegionLocator$1.onNext(AsyncNonMetaRegionLocator.java:437)
> at
> org.apache.hadoop.hbase.client.AsyncScanSingleRegionRpcRetryingCaller.onComplete(AsyncScanSingleRegionRpcRetryingCaller.java:535)
> at
> org.apache.hadoop.hbase.client.AsyncScanSingleRegionRpcRetryingCaller.start(AsyncScanSingleRegionRpcRetryingCaller.java:636)
> at
> org.apache.hadoop.hbase.client.AsyncRpcRetryingCallerFactory$ScanSingleRegionCallerBuilder.start(AsyncRpcRetryingCallerFactory.java:322)
> at
> org.apache.hadoop.hbase.client.AsyncClientScanner.startScan(AsyncClientScanner.java:208)
> at
> org.apache.hadoop.hbase.client.AsyncClientScanner.lambda$openScanner$2(AsyncClientScanner.java:268)
> at
> org.apache.hadoop.hbase.util.FutureUtils.lambda$addListener$0(FutureUtils.java:71)
> at
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
> at
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> at
> java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
> at
> org.apache.hadoop.hbase.client.AsyncSingleRequestRpcRetryingCaller.lambda$call$4(AsyncSingleRequestRpcRetryingCaller.java:92)
> at
> org.apache.hadoop.hbase.util.FutureUtils.lambda$addListener$0(FutureUtils.java:71)
> at
> java.base/java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863)
> at
> java.base/java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841)
> at
> java.base/java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510)
> at
> java.base/java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147)
> at
> org.apache.hadoop.hbase.client.AsyncClientScanner.lambda$callOpenScanner$0(AsyncClientScanner.java:187)
> at
> org.apache.hbase.thirdparty.com.google.protobuf.RpcUtil$1.run(RpcUtil.java:56)
> at
> org.apache.hbase.thirdparty.com.google.protobuf.RpcUtil$1.run(RpcUtil.java:47)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient.onCallFinished(AbstractRpcClient.java:400)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:430)
> at
> org.apache.hadoop.hbase.ipc.AbstractRpcClient$3.run(AbstractRpcClient.java:425)
> at org.apache.hadoop.hbase.ipc.Call.callComplete(Call.java:117)
> at org.apache.hadoop.hbase.ipc.Call.setResponse(Call.java:149)
> at
> org.apache.hadoop.hbase.ipc.RpcConnection.finishCall(RpcConnection.java:396)
> at
> org.apache.hadoop.hbase.ipc.RpcConnection.readResponse(RpcConnection.java:461)
> at
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.readResponse(NettyRpcDuplexHandler.java:125)
> at
> org.apache.hadoop.hbase.ipc.NettyRpcDuplexHandler.channelRead(NettyRpcDuplexHandler.java:140)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:346)
> at
> org.apache.hbase.thirdparty.io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:318)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> org.apache.hbase.thirdparty.io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:289)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:442)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1357)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
> at
> org.apache.hbase.thirdparty.io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
> at
> org.apache.hbase.thirdparty.io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:868)
> at
> org.apache.hbase.thirdparty.io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
> at
> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
> at
> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:724)
> at
> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:650)
> at
> org.apache.hbase.thirdparty.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
> at
> org.apache.hbase.thirdparty.io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
> at
> org.apache.hbase.thirdparty.io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.base/java.lang.Thread.run(Thread.java:840)
> {noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)