If TCP error occurs then you need to check the network metrics. Yes, TCP DUMP can help you.
Thanks & Regards Amithsha On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <[email protected]> wrote: > Hello ! > > @Amith sha <[email protected]> > I checked also the system metrics, nothing wrong in CPU, RAM or IO. > The only thing I found was these TCP errors (ListenDrop). > > @HK > I'm monitoring a lot of JVM metrics like this one : > "UnderReplicatedBlocks" in the bean > "Hadoop:service=NameNode,name=FSNamesystem". > And I found no under replicated blocks when the problem of timeout occurs, > unfortunately. > Thanks for you advice, in addition to the tcpdump, I'll perform some > jstacks to see if I can find what ipc handlers are doing. > > Best regards. > > T@le > > > > > > > Le mar. 22 févr. 2022 à 04:30, HK <[email protected]> a écrit : > >> Hi Tape, >> Could you please thread dump of namenode process. Could you please check >> what ipc handlers are doing. >> >> We faced similar issue when the under replication is high in the cluster >> due to filesystem wirteLock. >> >> On Tue, 22 Feb 2022, 8:37 am Amith sha, <[email protected]> wrote: >> >>> Check your system metrics too. >>> >>> On Mon, Feb 21, 2022, 10:52 PM Tale Hive <[email protected]> wrote: >>> >>>> Yeah, next step is for me to perform a tcpdump just when the problem >>>> occurs. >>>> I want to know if my namenode does not accept connections because it >>>> freezes for some reasons or because there is too many connections at a >>>> time. >>>> >>>> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d >>>> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut. >>>> And during this time, CallQueue is OK, Heap is OK, I don't find any >>>> metrics which could show me a problem inside the namenode JVM. >>>> >>>> Best regards. >>>> >>>> T@le >>>> >>>> Le lun. 21 févr. 2022 à 16:32, Amith sha <[email protected]> a >>>> écrit : >>>> >>>>> If you still concerned about the delay of > 2 s then you need to do >>>>> benchmark with and without load. To find the root cause of the problem it >>>>> will help. >>>>> >>>>> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <[email protected]> wrote: >>>>> >>>>>> Hello Amith. >>>>>> >>>>>> Hm, not a bad idea. If I increase the size of the listenQueue and if >>>>>> I increase timeout, the combination of both may mitigate more the problem >>>>>> than just increasing listenQueue size. >>>>>> It won't solve the problem of acceptance speed, but it could help. >>>>>> >>>>>> Thanks for the suggestion ! >>>>>> >>>>>> T@le >>>>>> >>>>>> Le lun. 21 févr. 2022 à 02:33, Amith sha <[email protected]> a >>>>>> écrit : >>>>>> >>>>>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout >>>>>>> while waiting for channel to be ready for connect. >>>>>>> Connection timed out after 20000 milli sec i suspect this value is >>>>>>> very low for a namenode with 75Gb of heap usage. Can you increase the >>>>>>> value >>>>>>> to 5sec and check the connection. To increase the value modify this >>>>>>> property >>>>>>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add >>>>>>> to the core-site.xml) >>>>>>> >>>>>>> >>>>>>> Thanks & Regards >>>>>>> Amithsha >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hello Tom. >>>>>>>> >>>>>>>> Sorry for my absence of answers, I don't know why gmail puts your >>>>>>>> mail into spam -_-. >>>>>>>> >>>>>>>> To answer you : >>>>>>>> >>>>>>>> - The metrics callQueueLength, avgQueueTime, avgProcessingTime >>>>>>>> and GC metric are all OK >>>>>>>> - Threads are plenty sufficient (I can see the metrics also for >>>>>>>> them and I am below 200, the number I have for 8020 RPC server) >>>>>>>> >>>>>>>> Did you see my other answers about this problem ? >>>>>>>> I would be interested to have your opinion about that ! >>>>>>>> >>>>>>>> Best regards. >>>>>>>> >>>>>>>> T@le >>>>>>>> >>>>>>>> >>>>>>>> Le mar. 15 févr. 2022 à 02:16, tom lee <[email protected]> a >>>>>>>> écrit : >>>>>>>> >>>>>>>>> It might be helpful to analyze namenode metrics and logs. >>>>>>>>> >>>>>>>>> What about some key metrics? Examples are callQueueLength, >>>>>>>>> avgQueueTime, avgProcessingTime and GC metrics. >>>>>>>>> >>>>>>>>> In addition, is the number of >>>>>>>>> threads(dfs.namenode.service.handler.count) in the namenode >>>>>>>>> sufficient? >>>>>>>>> >>>>>>>>> Hopefully this will help. >>>>>>>>> >>>>>>>>> Best regards. >>>>>>>>> Tom >>>>>>>>> >>>>>>>>> Tale Hive <[email protected]> 于2022年2月14日周一 23:57写道: >>>>>>>>> >>>>>>>>>> Hello. >>>>>>>>>> >>>>>>>>>> I encounter a strange problem with my namenode. I have the >>>>>>>>>> following architecture : >>>>>>>>>> - Two namenodes in HA >>>>>>>>>> - 600 datanodes >>>>>>>>>> - HDP 3.1.4 >>>>>>>>>> - 150 millions of files and folders >>>>>>>>>> >>>>>>>>>> Sometimes, when I query the namenode with the hdfs client, I got >>>>>>>>>> a timeout error like this : >>>>>>>>>> hdfs dfs -ls -d /user/myuser >>>>>>>>>> >>>>>>>>>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler: >>>>>>>>>> org.apache.hadoop.net.ConnectTimeoutException: Call From >>>>>>>>>> <my-client-hostname>/<my-client-ip> to >>>>>>>>>> <active-namenode-hostname>:8020 >>>>>>>>>> failed on socket timeout exception: >>>>>>>>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis >>>>>>>>>> timeout while waiting for channel to be ready for connect. ch : >>>>>>>>>> java.nio.channels.SocketChannel[connection-pending >>>>>>>>>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020]; >>>>>>>>>> For more details see: >>>>>>>>>> http://wiki.apache.org/hadoop/SocketTimeout, >>>>>>>>>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo >>>>>>>>>> over <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 >>>>>>>>>> failover >>>>>>>>>> attempts. Trying to failover after sleeping for 2694ms. >>>>>>>>>> >>>>>>>>>> I checked the heap of the namenode and there is no problem (I >>>>>>>>>> have 75 GB of max heap, I'm around 50 used GB). >>>>>>>>>> I checked the threads of the clientRPC for the namenode and I'm >>>>>>>>>> at 200 which respects the recommandations from hadoop operations >>>>>>>>>> book. >>>>>>>>>> I have serviceRPC enabled to prevent any problem which could be >>>>>>>>>> coming from datanodes or ZKFC. >>>>>>>>>> General resources seems OK, CPU usage is pretty fine, same for >>>>>>>>>> memory, network or IO. >>>>>>>>>> No firewall is enabled on my namenodes nor my client. >>>>>>>>>> >>>>>>>>>> I was wondering what could cause this problem, please ? >>>>>>>>>> >>>>>>>>>> Thank you in advance for your help ! >>>>>>>>>> >>>>>>>>>> Best regards. >>>>>>>>>> >>>>>>>>>> T@le >>>>>>>>>> >>>>>>>>>
