Re: Can't recover - HDFS

Joe Obernberger Tue, 03 Jul 2018 05:55:45 -0700

Thank you Shawn -

I think the root issue is related to some weirdness with HDFS. Log fileis here:

http://lovehorsepower.com/solr.log.4
Config is here:
http://lovehorsepower.com/solrconfig.xml
I don't see anything set to 20 seconds.


I believe the root exception is:

org.apache.hadoop.ipc.RemoteException(java.io.IOException): File/solr7.1.0/UNCLASS_30DAYS/core_node-1684300827/data/tlog/tlog.0000000000000008930could only be replicated to 0 nodes instead of minReplication (=1). There are 41 datanode(s) running and no node(s) are excluded in thisoperation. atorg.apache.hadoop.hdfs.server.blockmanagement.BlockManager.chooseTarget4NewBlock(BlockManager.java:1724) atorg.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3449) atorg.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.addBlock(NameNodeRpcServer.java:692) atorg.apache.hadoop.hdfs.server.namenode.AuthorizationProviderProxyClientProtocol.addBlock(AuthorizationProviderProxyClientProtocol.java:217) atorg.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.addBlock(ClientNamenodeProtocolServerSideTranslatorPB.java:506) atorg.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) atorg.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:617)

        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)

atorg.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1920)

        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)

        at org.apache.hadoop.ipc.Client.call(Client.java:1504)
        at org.apache.hadoop.ipc.Client.call(Client.java:1441)

atorg.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:230)

        at com.sun.proxy.$Proxy11.addBlock(Unknown Source)

atorg.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.addBlock(ClientNamenodeProtocolTranslatorPB.java:423)

        at sun.reflect.GeneratedMethodAccessor28.invoke(Unknown Source)

atsun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

        at java.lang.reflect.Method.invoke(Method.java:498)

atorg.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:258) atorg.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:104)

        at com.sun.proxy.$Proxy12.addBlock(Unknown Source)

atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.locateFollowingBlock(DFSOutputStream.java:1860) atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.nextBlockOutputStream(DFSOutputStream.java:1656) atorg.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:790)2018-07-02 14:50:24.949 ERROR (indexFetcher-41-thread-1)[c:UNCLASS_30DAYS s:shard37 r:core_node-1684300827x:UNCLASS_30DAYS_shard37_replica_t-1246382645]o.a.s.h.ReplicationHandler Exception in fetching index

org.apache.solr.common.SolrException: Error logging add

atorg.apache.solr.update.TransactionLog.write(TransactionLog.java:420)

        at org.apache.solr.update.UpdateLog.add(UpdateLog.java:535)
        at org.apache.solr.update.UpdateLog.add(UpdateLog.java:519)

atorg.apache.solr.update.UpdateLog.copyOverOldUpdates(UpdateLog.java:1213) atorg.apache.solr.update.UpdateLog.copyAndSwitchToNewTlog(UpdateLog.java:1168) atorg.apache.solr.update.UpdateLog.copyOverOldUpdates(UpdateLog.java:1155) atorg.apache.solr.cloud.ReplicateFromLeader.lambda$startReplication$0(ReplicateFromLeader.java:100) atorg.apache.solr.handler.ReplicationHandler.lambda$setupPolling$12(ReplicationHandler.java:1160) atjava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)

        at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)

atjava.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) atjava.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) atjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) atjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)

        at java.lang.Thread.run(Thread.java:748)

Thank you very much for the help!

-Joe


On 7/2/2018 8:32 PM, Shawn Heisey wrote:

On 7/2/2018 1:40 PM, Joe Obernberger wrote:

Hi All - having this same problem again with a large index in HDFS.  A
replica needs to recover, and it just spins retrying over and over
again.  Any ideas?  Is there an adjustable timeout?

Screenshot:
http://lovehorsepower.com/images/SolrShot1.jpg

There is considerably more log detail available than can be seen in the
screenshot.  Can you please make your solr.log file from this server
available so we can see full error and warning log messages, and let us
know the exact Solr version that wrote the log?  You'll probably need to
use a file sharing site, and make sure the file is available until after
the problem has been examined.  Attachments sent to the mailing list are
almost always stripped.

Based on the timestamps in the screenshot, it is taking about 22 to 24
seconds to transfer 1750073344 bytes.  Which calculates to right around
the 75 MB per second rate that you were configuring in your last email
thread.  In order for that single large file to transfer successfully,
you're going to need a timeout of at least 40 seconds.  Based on what I
see, it sounds like the timeout has been set to 20 seconds.  The default
client socket timeout on replication should be about two minutes, which
would be plenty for a file of that size to transfer.

This might be a timeout issue, but without seeing the full log and
knowing the exact version of Solr that created it, it is difficult to
know for sure where the problem might be or what can be done to fix it.
We will need that logfile.  If there are multiple servers involved, we
may need logfiles from both ends of the replication.

Do you have any config in solrconfig.xml for the /replication handler
other than the maxWriteMBPerSec config you showed last time?

Have you configured anything (particularly a socket timeout or sotimeout
setting) to a value near 20 or 20000?

Thanks,
Shawn


---
This email has been checked for viruses by AVG.
https://www.avg.com

Re: Can't recover - HDFS

Reply via email to