Hello !

@Amith sha <[email protected]>
I checked also the system metrics, nothing wrong in CPU, RAM or IO.
The only thing I found was these TCP errors (ListenDrop).

@HK
I'm monitoring a lot of JVM metrics like this one : "UnderReplicatedBlocks"
in the bean "Hadoop:service=NameNode,name=FSNamesystem".
And I found no under replicated blocks when the problem of timeout occurs,
unfortunately.
Thanks for you advice, in addition to the tcpdump, I'll perform some
jstacks to see if I can find what ipc handlers are doing.

Best regards.

T@le






Le mar. 22 févr. 2022 à 04:30, HK <[email protected]> a écrit :

> Hi Tape,
> Could you please thread dump of namenode process. Could you please check
> what ipc handlers are doing.
>
> We faced similar issue when the under replication is high in the cluster
> due to filesystem wirteLock.
>
> On Tue, 22 Feb 2022, 8:37 am Amith sha, <[email protected]> wrote:
>
>> Check your system metrics too.
>>
>> On Mon, Feb 21, 2022, 10:52 PM Tale Hive <[email protected]> wrote:
>>
>>> Yeah, next step is for me to perform a tcpdump just when the problem
>>> occurs.
>>> I want to know if my namenode does not accept connections because it
>>> freezes for some reasons or because there is too many connections at a time.
>>>
>>> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d
>>> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.
>>> And during this time, CallQueue is OK, Heap is OK, I don't find any
>>> metrics which could show me a problem inside the namenode JVM.
>>>
>>> Best regards.
>>>
>>> T@le
>>>
>>> Le lun. 21 févr. 2022 à 16:32, Amith sha <[email protected]> a
>>> écrit :
>>>
>>>> If you still concerned about the delay of > 2 s then you need to do
>>>> benchmark with and without load. To find the root cause of the problem it
>>>> will help.
>>>>
>>>> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <[email protected]> wrote:
>>>>
>>>>> Hello Amith.
>>>>>
>>>>> Hm, not a bad idea. If I increase the size of the listenQueue and if I
>>>>> increase timeout, the combination of both may mitigate more the problem
>>>>> than just increasing listenQueue size.
>>>>> It won't solve the problem of acceptance speed, but it could help.
>>>>>
>>>>> Thanks for the suggestion !
>>>>>
>>>>> T@le
>>>>>
>>>>> Le lun. 21 févr. 2022 à 02:33, Amith sha <[email protected]> a
>>>>> écrit :
>>>>>
>>>>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
>>>>>> while waiting for channel to be ready for connect.
>>>>>> Connection timed out after 20000 milli sec i suspect this value is
>>>>>> very low for a namenode with 75Gb of heap usage. Can you increase the 
>>>>>> value
>>>>>> to 5sec and check the connection. To increase the value modify this 
>>>>>> property
>>>>>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add
>>>>>> to the core-site.xml)
>>>>>>
>>>>>>
>>>>>> Thanks & Regards
>>>>>> Amithsha
>>>>>>
>>>>>>
>>>>>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Hello Tom.
>>>>>>>
>>>>>>> Sorry for my absence of answers, I don't know why gmail puts your
>>>>>>> mail into spam -_-.
>>>>>>>
>>>>>>> To answer you :
>>>>>>>
>>>>>>>    - The metrics callQueueLength, avgQueueTime, avgProcessingTime
>>>>>>>    and GC metric are all OK
>>>>>>>    - Threads are plenty sufficient (I can see the metrics also for
>>>>>>>    them and I  am below 200, the number I have for 8020 RPC server)
>>>>>>>
>>>>>>> Did you see my other answers about this problem ?
>>>>>>> I would be interested to have your opinion about that !
>>>>>>>
>>>>>>> Best regards.
>>>>>>>
>>>>>>> T@le
>>>>>>>
>>>>>>>
>>>>>>> Le mar. 15 févr. 2022 à 02:16, tom lee <[email protected]> a
>>>>>>> écrit :
>>>>>>>
>>>>>>>> It might be helpful to analyze namenode metrics and logs.
>>>>>>>>
>>>>>>>> What about some key metrics? Examples are callQueueLength,
>>>>>>>> avgQueueTime, avgProcessingTime and GC metrics.
>>>>>>>>
>>>>>>>> In addition, is the number of
>>>>>>>> threads(dfs.namenode.service.handler.count) in the namenode sufficient?
>>>>>>>>
>>>>>>>> Hopefully this will help.
>>>>>>>>
>>>>>>>> Best regards.
>>>>>>>> Tom
>>>>>>>>
>>>>>>>> Tale Hive <[email protected]> 于2022年2月14日周一 23:57写道:
>>>>>>>>
>>>>>>>>> Hello.
>>>>>>>>>
>>>>>>>>> I encounter a strange problem with my namenode. I have the
>>>>>>>>> following architecture :
>>>>>>>>> - Two namenodes in HA
>>>>>>>>> - 600 datanodes
>>>>>>>>> - HDP 3.1.4
>>>>>>>>> - 150 millions of files and folders
>>>>>>>>>
>>>>>>>>> Sometimes, when I query the namenode with the hdfs client, I got a
>>>>>>>>> timeout error like this :
>>>>>>>>> hdfs dfs -ls -d /user/myuser
>>>>>>>>>
>>>>>>>>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler:
>>>>>>>>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>>>>>>>>> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020
>>>>>>>>> failed on socket timeout exception:
>>>>>>>>>   org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>>>>>>>>> timeout while waiting for channel to be ready for connect. ch :
>>>>>>>>> java.nio.channels.SocketChannel[connection-pending
>>>>>>>>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020];
>>>>>>>>>   For more details see:
>>>>>>>>> http://wiki.apache.org/hadoop/SocketTimeout,
>>>>>>>>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over
>>>>>>>>> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover
>>>>>>>>> attempts. Trying to failover after sleeping for 2694ms.
>>>>>>>>>
>>>>>>>>> I checked the heap of the namenode and there is no problem (I have
>>>>>>>>> 75 GB of max heap, I'm around 50 used GB).
>>>>>>>>> I checked the threads of the clientRPC for the namenode and I'm at
>>>>>>>>> 200 which respects the recommandations from hadoop operations book.
>>>>>>>>> I have serviceRPC enabled to prevent any problem which could be
>>>>>>>>> coming from datanodes or ZKFC.
>>>>>>>>> General resources seems OK, CPU usage is pretty fine, same for
>>>>>>>>> memory, network or IO.
>>>>>>>>> No firewall is enabled on my namenodes nor my client.
>>>>>>>>>
>>>>>>>>> I was wondering what could cause this problem, please ?
>>>>>>>>>
>>>>>>>>> Thank you in advance for your help !
>>>>>>>>>
>>>>>>>>> Best regards.
>>>>>>>>>
>>>>>>>>> T@le
>>>>>>>>>
>>>>>>>>

Reply via email to