If TCP error occurs then you need to check the network metrics. Yes, TCP
DUMP can help you.


Thanks & Regards
Amithsha


On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <[email protected]> wrote:

> Hello !
>
> @Amith sha <[email protected]>
> I checked also the system metrics, nothing wrong in CPU, RAM or IO.
> The only thing I found was these TCP errors (ListenDrop).
>
> @HK
> I'm monitoring a lot of JVM metrics like this one :
> "UnderReplicatedBlocks" in the bean
> "Hadoop:service=NameNode,name=FSNamesystem".
> And I found no under replicated blocks when the problem of timeout occurs,
> unfortunately.
> Thanks for you advice, in addition to the tcpdump, I'll perform some
> jstacks to see if I can find what ipc handlers are doing.
>
> Best regards.
>
> T@le
>
>
>
>
>
>
> Le mar. 22 févr. 2022 à 04:30, HK <[email protected]> a écrit :
>
>> Hi Tape,
>> Could you please thread dump of namenode process. Could you please check
>> what ipc handlers are doing.
>>
>> We faced similar issue when the under replication is high in the cluster
>> due to filesystem wirteLock.
>>
>> On Tue, 22 Feb 2022, 8:37 am Amith sha, <[email protected]> wrote:
>>
>>> Check your system metrics too.
>>>
>>> On Mon, Feb 21, 2022, 10:52 PM Tale Hive <[email protected]> wrote:
>>>
>>>> Yeah, next step is for me to perform a tcpdump just when the problem
>>>> occurs.
>>>> I want to know if my namenode does not accept connections because it
>>>> freezes for some reasons or because there is too many connections at a 
>>>> time.
>>>>
>>>> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d
>>>> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.
>>>> And during this time, CallQueue is OK, Heap is OK, I don't find any
>>>> metrics which could show me a problem inside the namenode JVM.
>>>>
>>>> Best regards.
>>>>
>>>> T@le
>>>>
>>>> Le lun. 21 févr. 2022 à 16:32, Amith sha <[email protected]> a
>>>> écrit :
>>>>
>>>>> If you still concerned about the delay of > 2 s then you need to do
>>>>> benchmark with and without load. To find the root cause of the problem it
>>>>> will help.
>>>>>
>>>>> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <[email protected]> wrote:
>>>>>
>>>>>> Hello Amith.
>>>>>>
>>>>>> Hm, not a bad idea. If I increase the size of the listenQueue and if
>>>>>> I increase timeout, the combination of both may mitigate more the problem
>>>>>> than just increasing listenQueue size.
>>>>>> It won't solve the problem of acceptance speed, but it could help.
>>>>>>
>>>>>> Thanks for the suggestion !
>>>>>>
>>>>>> T@le
>>>>>>
>>>>>> Le lun. 21 févr. 2022 à 02:33, Amith sha <[email protected]> a
>>>>>> écrit :
>>>>>>
>>>>>>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
>>>>>>> while waiting for channel to be ready for connect.
>>>>>>> Connection timed out after 20000 milli sec i suspect this value is
>>>>>>> very low for a namenode with 75Gb of heap usage. Can you increase the 
>>>>>>> value
>>>>>>> to 5sec and check the connection. To increase the value modify this 
>>>>>>> property
>>>>>>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add
>>>>>>> to the core-site.xml)
>>>>>>>
>>>>>>>
>>>>>>> Thanks & Regards
>>>>>>> Amithsha
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hello Tom.
>>>>>>>>
>>>>>>>> Sorry for my absence of answers, I don't know why gmail puts your
>>>>>>>> mail into spam -_-.
>>>>>>>>
>>>>>>>> To answer you :
>>>>>>>>
>>>>>>>>    - The metrics callQueueLength, avgQueueTime, avgProcessingTime
>>>>>>>>    and GC metric are all OK
>>>>>>>>    - Threads are plenty sufficient (I can see the metrics also for
>>>>>>>>    them and I  am below 200, the number I have for 8020 RPC server)
>>>>>>>>
>>>>>>>> Did you see my other answers about this problem ?
>>>>>>>> I would be interested to have your opinion about that !
>>>>>>>>
>>>>>>>> Best regards.
>>>>>>>>
>>>>>>>> T@le
>>>>>>>>
>>>>>>>>
>>>>>>>> Le mar. 15 févr. 2022 à 02:16, tom lee <[email protected]> a
>>>>>>>> écrit :
>>>>>>>>
>>>>>>>>> It might be helpful to analyze namenode metrics and logs.
>>>>>>>>>
>>>>>>>>> What about some key metrics? Examples are callQueueLength,
>>>>>>>>> avgQueueTime, avgProcessingTime and GC metrics.
>>>>>>>>>
>>>>>>>>> In addition, is the number of
>>>>>>>>> threads(dfs.namenode.service.handler.count) in the namenode 
>>>>>>>>> sufficient?
>>>>>>>>>
>>>>>>>>> Hopefully this will help.
>>>>>>>>>
>>>>>>>>> Best regards.
>>>>>>>>> Tom
>>>>>>>>>
>>>>>>>>> Tale Hive <[email protected]> 于2022年2月14日周一 23:57写道:
>>>>>>>>>
>>>>>>>>>> Hello.
>>>>>>>>>>
>>>>>>>>>> I encounter a strange problem with my namenode. I have the
>>>>>>>>>> following architecture :
>>>>>>>>>> - Two namenodes in HA
>>>>>>>>>> - 600 datanodes
>>>>>>>>>> - HDP 3.1.4
>>>>>>>>>> - 150 millions of files and folders
>>>>>>>>>>
>>>>>>>>>> Sometimes, when I query the namenode with the hdfs client, I got
>>>>>>>>>> a timeout error like this :
>>>>>>>>>> hdfs dfs -ls -d /user/myuser
>>>>>>>>>>
>>>>>>>>>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler:
>>>>>>>>>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>>>>>>>>>> <my-client-hostname>/<my-client-ip> to 
>>>>>>>>>> <active-namenode-hostname>:8020
>>>>>>>>>> failed on socket timeout exception:
>>>>>>>>>>   org.apache.hadoop.net.ConnectTimeoutException: 20000 millis
>>>>>>>>>> timeout while waiting for channel to be ready for connect. ch :
>>>>>>>>>> java.nio.channels.SocketChannel[connection-pending
>>>>>>>>>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020];
>>>>>>>>>>   For more details see:
>>>>>>>>>> http://wiki.apache.org/hadoop/SocketTimeout,
>>>>>>>>>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo
>>>>>>>>>> over <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 
>>>>>>>>>> failover
>>>>>>>>>> attempts. Trying to failover after sleeping for 2694ms.
>>>>>>>>>>
>>>>>>>>>> I checked the heap of the namenode and there is no problem (I
>>>>>>>>>> have 75 GB of max heap, I'm around 50 used GB).
>>>>>>>>>> I checked the threads of the clientRPC for the namenode and I'm
>>>>>>>>>> at 200 which respects the recommandations from hadoop operations 
>>>>>>>>>> book.
>>>>>>>>>> I have serviceRPC enabled to prevent any problem which could be
>>>>>>>>>> coming from datanodes or ZKFC.
>>>>>>>>>> General resources seems OK, CPU usage is pretty fine, same for
>>>>>>>>>> memory, network or IO.
>>>>>>>>>> No firewall is enabled on my namenodes nor my client.
>>>>>>>>>>
>>>>>>>>>> I was wondering what could cause this problem, please ?
>>>>>>>>>>
>>>>>>>>>> Thank you in advance for your help !
>>>>>>>>>>
>>>>>>>>>> Best regards.
>>>>>>>>>>
>>>>>>>>>> T@le
>>>>>>>>>>
>>>>>>>>>

Reply via email to