Hi All - we're using Apache Solr Cloud 5.2.1, with an HDFS system that is 86% full. Some of the datanodes in the HDFS cluster are more close to being full than other nodes. We're getting messages about "Error adding log" from the index process, which I **think** is related to datanodes being full. Is that the case? Even though HDFS still has room available?
Upon restart of Solr Cloud, we see messages such as: ERROR - 2016-01-28 22:15:27.594; [ UNCLASS] org.apache.solr.common.SolrException; Failure to open existing log file (non fatal) hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282501:org.apache.solr.common.SolrException: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282501 for DFSClient_NONMAPREDUCE_1371064795_29 for client 172.16.100.218 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3075) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2905) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:3186) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:3149) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:611) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.append(AuthorizationProviderProxyClientProtocol.java:124) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:416) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at org.apache.solr.update.HdfsTransactionLog.<init>(HdfsTransactionLog.java:131) at org.apache.solr.update.HdfsUpdateLog.init(HdfsUpdateLog.java:193) at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:136) at org.apache.solr.update.UpdateHandler.<init>(UpdateHandler.java:94) at org.apache.solr.update.DirectUpdateHandler2.<init>(DirectUpdateHandler2.java:99) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at org.apache.solr.core.SolrCore.createInstance(SolrCore.java:573) at org.apache.solr.core.SolrCore.createUpdateHandler(SolrCore.java:635) at org.apache.solr.core.SolrCore.initUpdateHandler(SolrCore.java:928) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:786) at org.apache.solr.core.SolrCore.<init>(SolrCore.java:658) at org.apache.solr.core.CoreContainer.create(CoreContainer.java:637) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:381) at org.apache.solr.core.CoreContainer$1.call(CoreContainer.java:375) at java.util.concurrent.FutureTask.run(FutureTask.java:262) at org.apache.solr.common.util.ExecutorUtil$MDCAwareThreadPoolExecutor$1.run(ExecutorUtil.java:148) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.protocol.AlreadyBeingCreatedException): failed to create file /solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282501 for DFSClient_NONMAPREDUCE_1371064795_29 for client 172.16.100.218 because current leaseholder is trying to recreate file. at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.recoverLeaseInternal(FSNamesystem.java:3075) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInternal(FSNamesystem.java:2905) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFileInt(FSNamesystem.java:3186) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.appendFile(FSNamesystem.java:3149) at org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.append(NameNodeRpcServer.java:611) at org.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.append(AuthorizationProviderProxyClientProtocol.java:124) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.append(ClientNamenodeProtocolServerSideTranslatorPB.java:416) at org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1060) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2086) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2082) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2080) at org.apache.hadoop.ipc.Client.call(Client.java:1472) at org.apache.hadoop.ipc.Client.call(Client.java:1403) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230) at com.sun.proxy.$Proxy11.append(Unknown Source) at org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.append(ClientNamenodeProtocolTranslatorPB.java:313) at sun.reflect.GeneratedMethodAccessor10.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:252) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104) at com.sun.proxy.$Proxy12.append(Unknown Source) at org.apache.hadoop.hdfs.DFSClient.callAppend(DFSClient.java:1842) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1878) at org.apache.hadoop.hdfs.DFSClient.append(DFSClient.java:1871) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:329) at org.apache.hadoop.hdfs.DistributedFileSystem$4.doCall(DistributedFileSystem.java:325) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.append(DistributedFileSystem.java:325) at org.apache.hadoop.fs.FileSystem.append(FileSystem.java:1172) at org.apache.solr.update.HdfsTransactionLog.<init>(HdfsTransactionLog.java:94) ... 21 more INFO - 2016-01-28 22:15:27.639; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; Recovering lease on dfs file hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 INFO - 2016-01-28 22:15:27.698; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=0 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 58ms INFO - 2016-01-28 22:15:31.821; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=1 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 4181ms INFO - 2016-01-28 22:16:32.971; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=2 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 65331ms INFO - 2016-01-28 22:17:34.638; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=3 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 126998ms INFO - 2016-01-28 22:18:35.764; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=4 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 188124ms INFO - 2016-01-28 22:19:37.114; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=5 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 249474ms INFO - 2016-01-28 22:20:38.629; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=6 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 310989ms INFO - 2016-01-28 22:21:39.751; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=7 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 372111ms INFO - 2016-01-28 22:22:40.854; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=8 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 433214ms INFO - 2016-01-28 22:23:41.981; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=9 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 494341ms INFO - 2016-01-28 22:24:43.088; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=10 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 555448ms INFO - 2016-01-28 22:25:44.808; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=11 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 617168ms INFO - 2016-01-28 22:26:45.934; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=12 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 678294ms INFO - 2016-01-28 22:27:47.036; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=13 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 739396ms INFO - 2016-01-28 22:28:48.504; [ UNCLASS] org.apache.solr.util.FSHDFSUtils; recoverLease=false, attempt=14 on file=hdfs://nameservice1:8020/solr5.2/UNCLASS/core_node14/data/tlog/tlog.0000000000000282933 after 800864ms Some shards in the cluster can take hours to come back up. Any ideas? It appears to wait 900 seconds for each of the tlog files. When there are 60+ files - this takes a long time! Thank you!