Re: Client to namenode Socket timeout exception - connection-pending

Tale Hive Mon, 21 Feb 2022 09:22:55 -0800

Yeah, next step is for me to perform a tcpdump just when the problem occurs.
I want to know if my namenode does not accept connections because it
freezes for some reasons or because there is too many connections at a time.


My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d
/user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.
And during this time, CallQueue is OK, Heap is OK, I don't find any metrics
which could show me a problem inside the namenode JVM.

Best regards.

T@le

Le lun. 21 févr. 2022 à 16:32, Amith sha <[email protected]> a écrit :

> If you still concerned about the delay of > 2 s then you need to do
> benchmark with and without load. To find the root cause of the problem it
> will help.
>
> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <[email protected]> wrote:
>
>> Hello Amith.
>>
>> Hm, not a bad idea. If I increase the size of the listenQueue and if I
>> increase timeout, the combination of both may mitigate more the problem
>> than just increasing listenQueue size.
>> It won't solve the problem of acceptance speed, but it could help.
>>
>> Thanks for the suggestion !
>>
>> T@le
>>
>> Le lun. 21 févr. 2022 à 02:33, Amith sha <[email protected]> a écrit :
>>
>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
>>> while waiting for channel to be ready for connect.
>>> Connection timed out after 20000 milli sec i suspect this value is very
>>> low for a namenode with 75Gb of heap usage. Can you increase the value to
>>> 5sec and check the connection. To increase the value modify this property
>>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to
>>> the core-site.xml)
>>>
>>>
>>> Thanks & Regards
>>> Amithsha
>>>
>>>
>>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <[email protected]> wrote:
>>>
>>>> Hello Tom.
>>>>
>>>> Sorry for my absence of answers, I don't know why gmail puts your mail
>>>> into spam -_-.
>>>>
>>>> To answer you :
>>>>
>>>>    - The metrics callQueueLength, avgQueueTime, avgProcessingTime and
>>>>    GC metric are all OK
>>>>    - Threads are plenty sufficient (I can see the metrics also for
>>>>    them and I  am below 200, the number I have for 8020 RPC server)
>>>>
>>>> Did you see my other answers about this problem ?
>>>> I would be interested to have your opinion about that !
>>>>
>>>> Best regards.
>>>>
>>>> T@le
>>>>
>>>>
>>>> Le mar. 15 févr. 2022 à 02:16, tom lee <[email protected]> a écrit :
>>>>
>>>>> It might be helpful to analyze namenode metrics and logs.
>>>>>
>>>>> What about some key metrics? Examples are callQueueLength,
>>>>> avgQueueTime, avgProcessingTime and GC metrics.
>>>>>
>>>>> In addition, is the number of
>>>>> threads(dfs.namenode.service.handler.count) in the namenode sufficient?
>>>>>
>>>>> Hopefully this will help.
>>>>>
>>>>> Best regards.
>>>>> Tom
>>>>>
>>>>> Tale Hive <[email protected]> 于2022年2月14日周一 23:57写道：
>>>>>
>>>>>> Hello.
>>>>>>
>>>>>> I encounter a strange problem with my namenode. I have the following
>>>>>> architecture :
>>>>>> - Two namenodes in HA
>>>>>> - 600 datanodes
>>>>>> - HDP 3.1.4
>>>>>> - 150 millions of files and folders
>>>>>>
>>>>>> Sometimes, when I query the namenode with the hdfs client, I got a
>>>>>> timeout error like this :
>>>>>> hdfs dfs -ls -d /user/myuser
>>>>>>
>>>>>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler:
>>>>>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>>>>>> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020
>>>>>> failed on socket timeout exception:
>>>>>>   org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
>>>>>> while waiting for channel to be ready for connect. ch :
>>>>>> java.nio.channels.SocketChannel[connection-pending
>>>>>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020];
>>>>>>   For more details see:  http://wiki.apache.org/hadoop/SocketTimeout,
>>>>>>
>>>>>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over
>>>>>> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover
>>>>>> attempts. Trying to failover after sleeping for 2694ms.
>>>>>>
>>>>>> I checked the heap of the namenode and there is no problem (I have 75
>>>>>> GB of max heap, I'm around 50 used GB).
>>>>>> I checked the threads of the clientRPC for the namenode and I'm at
>>>>>> 200 which respects the recommandations from hadoop operations book.
>>>>>> I have serviceRPC enabled to prevent any problem which could be
>>>>>> coming from datanodes or ZKFC.
>>>>>> General resources seems OK, CPU usage is pretty fine, same for
>>>>>> memory, network or IO.
>>>>>> No firewall is enabled on my namenodes nor my client.
>>>>>>
>>>>>> I was wondering what could cause this problem, please ?
>>>>>>
>>>>>> Thank you in advance for your help !
>>>>>>
>>>>>> Best regards.
>>>>>>
>>>>>> T@le
>>>>>>
>>>>>

Re: Client to namenode Socket timeout exception - connection-pending

Reply via email to