Check your system metrics too. On Mon, Feb 21, 2022, 10:52 PM Tale Hive <[email protected]> wrote:
> Yeah, next step is for me to perform a tcpdump just when the problem > occurs. > I want to know if my namenode does not accept connections because it > freezes for some reasons or because there is too many connections at a time. > > My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d > /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut. > And during this time, CallQueue is OK, Heap is OK, I don't find any > metrics which could show me a problem inside the namenode JVM. > > Best regards. > > T@le > > Le lun. 21 févr. 2022 à 16:32, Amith sha <[email protected]> a écrit : > >> If you still concerned about the delay of > 2 s then you need to do >> benchmark with and without load. To find the root cause of the problem it >> will help. >> >> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <[email protected]> wrote: >> >>> Hello Amith. >>> >>> Hm, not a bad idea. If I increase the size of the listenQueue and if I >>> increase timeout, the combination of both may mitigate more the problem >>> than just increasing listenQueue size. >>> It won't solve the problem of acceptance speed, but it could help. >>> >>> Thanks for the suggestion ! >>> >>> T@le >>> >>> Le lun. 21 févr. 2022 à 02:33, Amith sha <[email protected]> a >>> écrit : >>> >>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout >>>> while waiting for channel to be ready for connect. >>>> Connection timed out after 20000 milli sec i suspect this value is very >>>> low for a namenode with 75Gb of heap usage. Can you increase the value to >>>> 5sec and check the connection. To increase the value modify this property >>>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to >>>> the core-site.xml) >>>> >>>> >>>> Thanks & Regards >>>> Amithsha >>>> >>>> >>>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <[email protected]> wrote: >>>> >>>>> Hello Tom. >>>>> >>>>> Sorry for my absence of answers, I don't know why gmail puts your mail >>>>> into spam -_-. >>>>> >>>>> To answer you : >>>>> >>>>> - The metrics callQueueLength, avgQueueTime, avgProcessingTime and >>>>> GC metric are all OK >>>>> - Threads are plenty sufficient (I can see the metrics also for >>>>> them and I am below 200, the number I have for 8020 RPC server) >>>>> >>>>> Did you see my other answers about this problem ? >>>>> I would be interested to have your opinion about that ! >>>>> >>>>> Best regards. >>>>> >>>>> T@le >>>>> >>>>> >>>>> Le mar. 15 févr. 2022 à 02:16, tom lee <[email protected]> a >>>>> écrit : >>>>> >>>>>> It might be helpful to analyze namenode metrics and logs. >>>>>> >>>>>> What about some key metrics? Examples are callQueueLength, >>>>>> avgQueueTime, avgProcessingTime and GC metrics. >>>>>> >>>>>> In addition, is the number of >>>>>> threads(dfs.namenode.service.handler.count) in the namenode sufficient? >>>>>> >>>>>> Hopefully this will help. >>>>>> >>>>>> Best regards. >>>>>> Tom >>>>>> >>>>>> Tale Hive <[email protected]> 于2022年2月14日周一 23:57写道: >>>>>> >>>>>>> Hello. >>>>>>> >>>>>>> I encounter a strange problem with my namenode. I have the following >>>>>>> architecture : >>>>>>> - Two namenodes in HA >>>>>>> - 600 datanodes >>>>>>> - HDP 3.1.4 >>>>>>> - 150 millions of files and folders >>>>>>> >>>>>>> Sometimes, when I query the namenode with the hdfs client, I got a >>>>>>> timeout error like this : >>>>>>> hdfs dfs -ls -d /user/myuser >>>>>>> >>>>>>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler: >>>>>>> org.apache.hadoop.net.ConnectTimeoutException: Call From >>>>>>> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020 >>>>>>> failed on socket timeout exception: >>>>>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis >>>>>>> timeout while waiting for channel to be ready for connect. ch : >>>>>>> java.nio.channels.SocketChannel[connection-pending >>>>>>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020]; >>>>>>> For more details see: http://wiki.apache.org/hadoop/SocketTimeout, >>>>>>> >>>>>>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over >>>>>>> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover >>>>>>> attempts. Trying to failover after sleeping for 2694ms. >>>>>>> >>>>>>> I checked the heap of the namenode and there is no problem (I have >>>>>>> 75 GB of max heap, I'm around 50 used GB). >>>>>>> I checked the threads of the clientRPC for the namenode and I'm at >>>>>>> 200 which respects the recommandations from hadoop operations book. >>>>>>> I have serviceRPC enabled to prevent any problem which could be >>>>>>> coming from datanodes or ZKFC. >>>>>>> General resources seems OK, CPU usage is pretty fine, same for >>>>>>> memory, network or IO. >>>>>>> No firewall is enabled on my namenodes nor my client. >>>>>>> >>>>>>> I was wondering what could cause this problem, please ? >>>>>>> >>>>>>> Thank you in advance for your help ! >>>>>>> >>>>>>> Best regards. >>>>>>> >>>>>>> T@le >>>>>>> >>>>>>
