Lynn Gallinat created GEODE-3940:
------------------------------------

             Summary: Backup can hang while trying to get a lock 
                 Key: GEODE-3940
                 URL: https://issues.apache.org/jira/browse/GEODE-3940
             Project: Geode
          Issue Type: Bug
          Components: persistence
            Reporter: Lynn Gallinat


{noformat}

Backup can hang when createKrf cannot get the compactor lock.
"Pooled Message Processor 2" #196 daemon prio=10 os_prio=0 
tid=0x00007fef9801e000 nid=0x3cf3 waiting on condition [0x00007ff094cd5000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f1a72c60> (a 
java.util.concurrent.locks.ReentrantLock$NonfairSync)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at org.apache.geode.internal.cache.Oplog.lockCompactor(Oplog.java:6046) 
   <============ trying to get compactor lock
        at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4157)     
   <============ in createKrf
        at org.apache.geode.internal.cache.Oplog.finishKrf(Oplog.java:7746)
        at 
org.apache.geode.internal.cache.BackupManager.backupOplog(BackupManager.java:580)
        at 
org.apache.geode.internal.cache.BackupManager.completeBackup(BackupManager.java:270)
        at 
org.apache.geode.internal.cache.BackupManager.doBackup(BackupManager.java:139)
        at 
org.apache.geode.admin.internal.FinishBackupRequest.createResponse(FinishBackupRequest.java:102)
        at 
org.apache.geode.internal.admin.remote.CliLegacyMessage.process(CliLegacyMessage.java:39)
        at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
        at 
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666)
        at 
org.apache.geode.distributed.internal.DistributionManager$4$1.run(DistributionManager.java:923)
        at java.lang.Thread.run(Thread.java:748)


The compactor thread already has the compactor lock.
It is waiting for a DiskEntry lock.

"Idle OplogCompactor" #379 daemon prio=10 os_prio=0 tid=0x00007ff020026000 
nid=0x6916 waiting for monitor entry [0x00007ff08c7ce000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at 
org.apache.geode.internal.cache.Oplog.writeOneKeyEntryForKRF(Oplog.java:3924)
        - waiting to lock <0x00000000f1c55c70> (a 
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
  <====== waiting for DiskEntry lock
        at org.apache.geode.internal.cache.Oplog.createKrf(Oplog.java:4201)     
     <======= already in createKrf
        - locked <0x00000000f1a72c30> (a 
java.util.concurrent.atomic.AtomicBoolean)
        at org.apache.geode.internal.cache.Oplog$2.run(Oplog.java:3875)
        at 
org.apache.geode.internal.cache.DiskStoreImpl$5.run(DiskStoreImpl.java:4386)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

The DiskEntry lock is held by this thread and is waiting for replies:
"PartitionedRegion Message Processor26" #238 daemon prio=10 os_prio=0 
tid=0x00007fef9c01f800 nid=0x3dd0 waiting on condition [0x00007ff08ebe9000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f1e42200> (a 
java.util.concurrent.CountDownLatch$Sync)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
        at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:64)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:718)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:795)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:771)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForRepliesUninterruptibly(ReplyProcessor21.java:858)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation.waitForAckIfNeeded(DistributedCacheOperation.java:761)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation._distribute(DistributedCacheOperation.java:658)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation.startOperation(DistributedCacheOperation.java:264)
        at 
org.apache.geode.internal.cache.BucketRegion.basicPutPart2(BucketRegion.java:665)
        at 
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2839)
        - locked <0x00000000f1c55c70> (a 
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
  <======= has the DiskEntryLock
        at 
org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502)
        at 
org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1222)
        at 
org.apache.geode.internal.cache.PartitionedRegionDataStore.putLocally(PartitionedRegionDataStore.java:1205)
        at 
org.apache.geode.internal.cache.PartitionedRegionDataView.putEntryOnRemote(PartitionedRegionDataView.java:99)
        at 
org.apache.geode.internal.cache.partitioned.PutMessage.operateOnPartitionedRegion(PutMessage.java:744)
        at 
org.apache.geode.internal.cache.partitioned.PartitionMessage.process(PartitionMessage.java:333)
        at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
        at 
org.apache.geode.distributed.internal.DistributionMessage$1.run(DistributionMessage.java:440)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at 
org.apache.geode.distributed.internal.DistributionManager.runUntilShutdown(DistributionManager.java:666)
        at 
org.apache.geode.distributed.internal.DistributionManager$8$1.run(DistributionManager.java:1069)
        at java.lang.Thread.run(Thread.java:748)

The member replying can't because it's waiting for the backup lock.
"P2P message reader for 
rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8(gemfire3_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14506:14506)<ec><v1>:1028
 shared ordered uid=6 port=40404" #152 daemon prio=10 os_prio=0 
tid=0x00007ff91800d800 nid=0x3bf3 waiting on condition [0x00007ffa29d4a000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f0b63ad0> (a 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitUninterruptibly(AbstractQueuedSynchronizer.java:1976)
        at org.apache.geode.internal.cache.BackupLock.lock(BackupLock.java:95)
        at org.apache.geode.internal.cache.Oplog.basicModify(Oplog.java:4612)
        at org.apache.geode.internal.cache.Oplog.modify(Oplog.java:4450)
        at 
org.apache.geode.internal.cache.PersistentOplogSet.modify(PersistentOplogSet.java:188)
        at 
org.apache.geode.internal.cache.DiskStoreImpl.put(DiskStoreImpl.java:730)
        at org.apache.geode.internal.cache.DiskRegion.put(DiskRegion.java:351)
        at 
org.apache.geode.internal.cache.entries.DiskEntry$Helper.writeBytesToDisk(DiskEntry.java:828)
        at 
org.apache.geode.internal.cache.entries.DiskEntry$Helper.basicUpdate(DiskEntry.java:932)
        at 
org.apache.geode.internal.cache.entries.DiskEntry$Helper.update(DiskEntry.java:859)
        - locked <0x00000000f1610f70> (a 
org.apache.geode.internal.cache.DiskId$PersistenceWithIntOffset)
        at 
org.apache.geode.internal.cache.entries.AbstractDiskRegionEntry.setValue(AbstractDiskRegionEntry.java:40)
        at 
org.apache.geode.internal.cache.entries.AbstractRegionEntry.setValueWithTombstoneCheck(AbstractRegionEntry.java:307)
        at 
org.apache.geode.internal.cache.EntryEventImpl.setNewValueInRegion(EntryEventImpl.java:1651)
        at 
org.apache.geode.internal.cache.EntryEventImpl.putExistingEntry(EntryEventImpl.java:1527)
        at 
org.apache.geode.internal.cache.AbstractRegionMap.updateEntry(AbstractRegionMap.java:2993)
        at 
org.apache.geode.internal.cache.AbstractRegionMap.basicPut(AbstractRegionMap.java:2816)
        - locked <0x00000000f1610f28> (a 
org.apache.geode.internal.cache.entries.VersionedThinDiskRegionEntryOffHeapStringKey2)
        at 
org.apache.geode.internal.cache.BucketRegion.virtualPut(BucketRegion.java:502)
        at 
org.apache.geode.internal.cache.LocalRegionDataView.putEntry(LocalRegionDataView.java:152)
        at 
org.apache.geode.internal.cache.LocalRegion.basicUpdate(LocalRegion.java:5584)
        at 
org.apache.geode.internal.cache.AbstractUpdateOperation.doPutOrCreate(AbstractUpdateOperation.java:165)
        at 
org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.basicOperateOnRegion(AbstractUpdateOperation.java:272)
        at 
org.apache.geode.internal.cache.AbstractUpdateOperation$AbstractUpdateMessage.operateOnRegion(AbstractUpdateOperation.java:243)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.basicProcess(DistributedCacheOperation.java:1190)
        at 
org.apache.geode.internal.cache.DistributedCacheOperation$CacheOperationMessage.process(DistributedCacheOperation.java:1091)
        at 
org.apache.geode.distributed.internal.DistributionMessage.scheduleAction(DistributionMessage.java:374)
        at 
org.apache.geode.distributed.internal.DistributionMessage.schedule(DistributionMessage.java:432)
        at 
org.apache.geode.distributed.internal.DistributionManager.scheduleIncomingMessage(DistributionManager.java:3552)
        at 
org.apache.geode.distributed.internal.DistributionManager.handleIncomingDMsg(DistributionManager.java:3186)
        at 
org.apache.geode.distributed.internal.DistributionManager$MyListener.messageReceived(DistributionManager.java:4361)
        at 
org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.dispatchMessage(GMSMembershipManager.java:1127)
        at 
org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager.handleOrDeferMessage(GMSMembershipManager.java:1045)
        at 
org.apache.geode.distributed.internal.membership.gms.mgr.GMSMembershipManager$MyDCReceiver.messageReceived(GMSMembershipManager.java:408)
        at 
org.apache.geode.distributed.internal.direct.DirectChannel.receive(DirectChannel.java:714)
        at 
org.apache.geode.internal.tcp.TCPConduit.messageReceived(TCPConduit.java:874)
        at 
org.apache.geode.internal.tcp.Connection.dispatchMessage(Connection.java:3966)
        at 
org.apache.geode.internal.tcp.Connection.processNIOBuffer(Connection.java:3552)
        at 
org.apache.geode.internal.tcp.Connection.runNioReader(Connection.java:1828)
        at org.apache.geode.internal.tcp.Connection.run(Connection.java:1689)
        at java.lang.Thread.run(Thread.java:748)

Note that the replying member is also the member that initiated the backup:
"vm_4_thr_43_client5_rs-CommunicationsBTTest-2017-10-27-08-42-54-client-8_14524"
 #480 daemon prio=5 os_prio=0 tid=0x00007ff99c0b2800 nid=0x6911 waiting on 
condition [0x00007ffa208cf000]
   java.lang.Thread.State: TIMED_WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00000000f214ef08> (a 
java.util.concurrent.CountDownLatch$Sync)
        at 
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedNanos(AbstractQueuedSynchronizer.java:1037)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.tryAcquireSharedNanos(AbstractQueuedSynchronizer.java:1328)
        at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:277)
        at 
org.apache.geode.internal.util.concurrent.StoppableCountDownLatch.await(StoppableCountDownLatch.java:77)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.basicWait(ReplyProcessor21.java:694)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:644)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:624)
        at 
org.apache.geode.distributed.internal.ReplyProcessor21.waitForReplies(ReplyProcessor21.java:519)
        at 
org.apache.geode.admin.internal.FinishBackupRequest.send(FinishBackupRequest.java:80)
        at 
org.apache.geode.admin.internal.BackupDataStoreHelper.backupAllMembers(BackupDataStoreHelper.java:47)
        at 
org.apache.geode.internal.cache.BackupUtil.backupAllMembers(BackupUtil.java:50)
        at 
org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2315)
        at 
org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2310)
        at 
org.apache.geode.admin.internal.AdminDistributedSystemImpl.backupAllMembers(AdminDistributedSystemImpl.java:2299)
        at util.PersistenceUtil.doOnlineBackup(PersistenceUtil.java:627)
        at parReg.ParRegTest.doConcOpsAndVerify(ParRegTest.java:1822)
        at parReg.ParRegTest.HydraTask_doConcOpsAndVerify(ParRegTest.java:1011)
        at sun.reflect.GeneratedMethodAccessor401.invoke(Unknown Source)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at hydra.MethExecutor.execute(MethExecutor.java:182)
        at hydra.MethExecutor.execute(MethExecutor.java:150)
        at hydra.TestTask.execute(TestTask.java:191)
        at hydra.RemoteTestModule$1.run(RemoteTestModule.java:212)

The replying member is still holding the backup lock because it waits for all 
other members to reply to the FinishBackupRequest message
before it processes its backup and releases the lock:
  From FinishBackupRequest.java
  public static Map<DistributedMember, Set<PersistentID>> send(DM dm, Set 
recipients,
      File targetDir, File baselineDir, boolean abort) {
    FinishBackupRequest request = new FinishBackupRequest(targetDir, 
baselineDir, abort);
    request.setRecipients(recipients);

    FinishBackupReplyProcessor replyProcessor = new 
FinishBackupReplyProcessor(dm, recipients);
    request.msgId = replyProcessor.getProcessorId();
    dm.putOutgoing(request);
    try {
      replyProcessor.waitForReplies();   <======= wait for all other members 
(but they won't finish until this member does its work; this is where we are 
stuck)
    } catch (ReplyException e) {
      if (!(e.getCause() instanceof CancelException)) {
        throw e;
      }
    } catch (InterruptedException e) {
      e.printStackTrace();
    }
    AdminResponse response = request.createResponse((DistributionManager) dm);  
 <====== this and the following lines do this member's work
    response.setSender(dm.getDistributionManagerId());
    replyProcessor.process(response);
    return replyProcessor.results;
  }

{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to