Hello ! @Amith sha <[email protected]> I checked also the system metrics, nothing wrong in CPU, RAM or IO. The only thing I found was these TCP errors (ListenDrop).
@HK I'm monitoring a lot of JVM metrics like this one : "UnderReplicatedBlocks" in the bean "Hadoop:service=NameNode,name=FSNamesystem". And I found no under replicated blocks when the problem of timeout occurs, unfortunately. Thanks for you advice, in addition to the tcpdump, I'll perform some jstacks to see if I can find what ipc handlers are doing. Best regards. T@le Le mar. 22 févr. 2022 à 04:30, HK <[email protected]> a écrit : > Hi Tape, > Could you please thread dump of namenode process. Could you please check > what ipc handlers are doing. > > We faced similar issue when the under replication is high in the cluster > due to filesystem wirteLock. > > On Tue, 22 Feb 2022, 8:37 am Amith sha, <[email protected]> wrote: > >> Check your system metrics too. >> >> On Mon, Feb 21, 2022, 10:52 PM Tale Hive <[email protected]> wrote: >> >>> Yeah, next step is for me to perform a tcpdump just when the problem >>> occurs. >>> I want to know if my namenode does not accept connections because it >>> freezes for some reasons or because there is too many connections at a time. >>> >>> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d >>> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut. >>> And during this time, CallQueue is OK, Heap is OK, I don't find any >>> metrics which could show me a problem inside the namenode JVM. >>> >>> Best regards. >>> >>> T@le >>> >>> Le lun. 21 févr. 2022 à 16:32, Amith sha <[email protected]> a >>> écrit : >>> >>>> If you still concerned about the delay of > 2 s then you need to do >>>> benchmark with and without load. To find the root cause of the problem it >>>> will help. >>>> >>>> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <[email protected]> wrote: >>>> >>>>> Hello Amith. >>>>> >>>>> Hm, not a bad idea. If I increase the size of the listenQueue and if I >>>>> increase timeout, the combination of both may mitigate more the problem >>>>> than just increasing listenQueue size. >>>>> It won't solve the problem of acceptance speed, but it could help. >>>>> >>>>> Thanks for the suggestion ! >>>>> >>>>> T@le >>>>> >>>>> Le lun. 21 févr. 2022 à 02:33, Amith sha <[email protected]> a >>>>> écrit : >>>>> >>>>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout >>>>>> while waiting for channel to be ready for connect. >>>>>> Connection timed out after 20000 milli sec i suspect this value is >>>>>> very low for a namenode with 75Gb of heap usage. Can you increase the >>>>>> value >>>>>> to 5sec and check the connection. To increase the value modify this >>>>>> property >>>>>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add >>>>>> to the core-site.xml) >>>>>> >>>>>> >>>>>> Thanks & Regards >>>>>> Amithsha >>>>>> >>>>>> >>>>>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Hello Tom. >>>>>>> >>>>>>> Sorry for my absence of answers, I don't know why gmail puts your >>>>>>> mail into spam -_-. >>>>>>> >>>>>>> To answer you : >>>>>>> >>>>>>> - The metrics callQueueLength, avgQueueTime, avgProcessingTime >>>>>>> and GC metric are all OK >>>>>>> - Threads are plenty sufficient (I can see the metrics also for >>>>>>> them and I am below 200, the number I have for 8020 RPC server) >>>>>>> >>>>>>> Did you see my other answers about this problem ? >>>>>>> I would be interested to have your opinion about that ! >>>>>>> >>>>>>> Best regards. >>>>>>> >>>>>>> T@le >>>>>>> >>>>>>> >>>>>>> Le mar. 15 févr. 2022 à 02:16, tom lee <[email protected]> a >>>>>>> écrit : >>>>>>> >>>>>>>> It might be helpful to analyze namenode metrics and logs. >>>>>>>> >>>>>>>> What about some key metrics? Examples are callQueueLength, >>>>>>>> avgQueueTime, avgProcessingTime and GC metrics. >>>>>>>> >>>>>>>> In addition, is the number of >>>>>>>> threads(dfs.namenode.service.handler.count) in the namenode sufficient? >>>>>>>> >>>>>>>> Hopefully this will help. >>>>>>>> >>>>>>>> Best regards. >>>>>>>> Tom >>>>>>>> >>>>>>>> Tale Hive <[email protected]> 于2022年2月14日周一 23:57写道: >>>>>>>> >>>>>>>>> Hello. >>>>>>>>> >>>>>>>>> I encounter a strange problem with my namenode. I have the >>>>>>>>> following architecture : >>>>>>>>> - Two namenodes in HA >>>>>>>>> - 600 datanodes >>>>>>>>> - HDP 3.1.4 >>>>>>>>> - 150 millions of files and folders >>>>>>>>> >>>>>>>>> Sometimes, when I query the namenode with the hdfs client, I got a >>>>>>>>> timeout error like this : >>>>>>>>> hdfs dfs -ls -d /user/myuser >>>>>>>>> >>>>>>>>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler: >>>>>>>>> org.apache.hadoop.net.ConnectTimeoutException: Call From >>>>>>>>> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020 >>>>>>>>> failed on socket timeout exception: >>>>>>>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis >>>>>>>>> timeout while waiting for channel to be ready for connect. ch : >>>>>>>>> java.nio.channels.SocketChannel[connection-pending >>>>>>>>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020]; >>>>>>>>> For more details see: >>>>>>>>> http://wiki.apache.org/hadoop/SocketTimeout, >>>>>>>>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over >>>>>>>>> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover >>>>>>>>> attempts. Trying to failover after sleeping for 2694ms. >>>>>>>>> >>>>>>>>> I checked the heap of the namenode and there is no problem (I have >>>>>>>>> 75 GB of max heap, I'm around 50 used GB). >>>>>>>>> I checked the threads of the clientRPC for the namenode and I'm at >>>>>>>>> 200 which respects the recommandations from hadoop operations book. >>>>>>>>> I have serviceRPC enabled to prevent any problem which could be >>>>>>>>> coming from datanodes or ZKFC. >>>>>>>>> General resources seems OK, CPU usage is pretty fine, same for >>>>>>>>> memory, network or IO. >>>>>>>>> No firewall is enabled on my namenodes nor my client. >>>>>>>>> >>>>>>>>> I was wondering what could cause this problem, please ? >>>>>>>>> >>>>>>>>> Thank you in advance for your help ! >>>>>>>>> >>>>>>>>> Best regards. >>>>>>>>> >>>>>>>>> T@le >>>>>>>>> >>>>>>>>
