Check your system metrics too.

On Mon, Feb 21, 2022, 10:52 PM Tale Hive <[email protected]> wrote:

> Yeah, next step is for me to perform a tcpdump just when the problem
> occurs.
> I want to know if my namenode does not accept connections because it
> freezes for some reasons or because there is too many connections at a time.
>
> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d
> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.
> And during this time, CallQueue is OK, Heap is OK, I don't find any
> metrics which could show me a problem inside the namenode JVM.
>
> Best regards.
>
> T@le
>
> Le lun. 21 févr. 2022 à 16:32, Amith sha <[email protected]> a écrit :
>
>> If you still concerned about the delay of > 2 s then you need to do
>> benchmark with and without load. To find the root cause of the problem it
>> will help.
>>
>> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <[email protected]> wrote:
>>
>>> Hello Amith.
>>>
>>> Hm, not a bad idea. If I increase the size of the listenQueue and if I
>>> increase timeout, the combination of both may mitigate more the problem
>>> than just increasing listenQueue size.
>>> It won't solve the problem of acceptance speed, but it could help.
>>>
>>> Thanks for the suggestion !
>>>
>>> T@le
>>>
>>> Le lun. 21 févr. 2022 à 02:33, Amith sha <[email protected]> a
>>> écrit :
>>>
>>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
>>>> while waiting for channel to be ready for connect.
>>>> Connection timed out after 20000 milli sec i suspect this value is very
>>>> low for a namenode with 75Gb of heap usage. Can you increase the value to
>>>> 5sec and check the connection. To increase the value modify this property
>>>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to
>>>> the core-site.xml)
>>>>
>>>>
>>>> Thanks & Regards
>>>> Amithsha
>>>>
>>>>
>>>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <[email protected]> wrote:
>>>>
>>>>> Hello Tom.
>>>>>
>>>>> Sorry for my absence of answers, I don't know why gmail puts your mail
>>>>> into spam -_-.
>>>>>
>>>>> To answer you :
>>>>>
>>>>>    - The metrics callQueueLength, avgQueueTime, avgProcessingTime and
>>>>>    GC metric are all OK
>>>>>    - Threads are plenty sufficient (I can see the metrics also for
>>>>>    them and I  am below 200, the number I have for 8020 RPC server)
>>>>>
>>>>> Did you see my other answers about this problem ?
>>>>> I would be interested to have your opinion about that !
>>>>>
>>>>> Best regards.
>>>>>
>>>>> T@le
>>>>>
>>>>>
>>>>> Le mar. 15 févr. 2022 à 02:16, tom lee <[email protected]> a
>>>>> écrit :
>>>>>
>>>>>> It might be helpful to analyze namenode metrics and logs.
>>>>>>
>>>>>> What about some key metrics? Examples are callQueueLength,
>>>>>> avgQueueTime, avgProcessingTime and GC metrics.
>>>>>>
>>>>>> In addition, is the number of
>>>>>> threads(dfs.namenode.service.handler.count) in the namenode sufficient?
>>>>>>
>>>>>> Hopefully this will help.
>>>>>>
>>>>>> Best regards.
>>>>>> Tom
>>>>>>
>>>>>> Tale Hive <[email protected]> 于2022年2月14日周一 23:57写道:
>>>>>>
>>>>>>> Hello.
>>>>>>>
>>>>>>> I encounter a strange problem with my namenode. I have the following
>>>>>>> architecture :
>>>>>>> - Two namenodes in HA
>>>>>>> - 600 datanodes
>>>>>>> - HDP 3.1.4
>>>>>>> - 150 millions of files and folders
>>>>>>>
>>>>>>> Sometimes, when I query the namenode with the hdfs client, I got a
>>>>>>> timeout error like this :
>>>>>>> hdfs dfs -ls -d /user/myuser
>>>>>>>
>>>>>>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler:
>>>>>>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>>>>>>> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020
>>>>>>> failed on socket timeout exception:
>>>>>>>   org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>>>>>>> timeout while waiting for channel to be ready for connect. ch :
>>>>>>> java.nio.channels.SocketChannel[connection-pending
>>>>>>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020];
>>>>>>>   For more details see:  http://wiki.apache.org/hadoop/SocketTimeout,
>>>>>>>
>>>>>>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over
>>>>>>> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover
>>>>>>> attempts. Trying to failover after sleeping for 2694ms.
>>>>>>>
>>>>>>> I checked the heap of the namenode and there is no problem (I have
>>>>>>> 75 GB of max heap, I'm around 50 used GB).
>>>>>>> I checked the threads of the clientRPC for the namenode and I'm at
>>>>>>> 200 which respects the recommandations from hadoop operations book.
>>>>>>> I have serviceRPC enabled to prevent any problem which could be
>>>>>>> coming from datanodes or ZKFC.
>>>>>>> General resources seems OK, CPU usage is pretty fine, same for
>>>>>>> memory, network or IO.
>>>>>>> No firewall is enabled on my namenodes nor my client.
>>>>>>>
>>>>>>> I was wondering what could cause this problem, please ?
>>>>>>>
>>>>>>> Thank you in advance for your help !
>>>>>>>
>>>>>>> Best regards.
>>>>>>>
>>>>>>> T@le
>>>>>>>
>>>>>>

Reply via email to