Re.

As a note, I managed to got one jstack from the standby namenode when the
problem occured.
Here was the state of the listener thread for port 8020 :
"IPC Server listener on 8020" #122 daemon prio=5 os_prio=0
tid=0x00007f8a0e84d000 nid=0x6a07e waiting on condition [0x00007f7647ae5000]
   java.lang.Thread.State: WAITING (parking)
        at sun.misc.Unsafe.park(Native Method)
        - parking to wait for  <0x00007f7a81525618> (a
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2039)
        at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:350)
        at
org.apache.hadoop.ipc.Server$Listener$Reader.addConnection(Server.java:1135)
        at org.apache.hadoop.ipc.Server$Listener.doAccept(Server.java:1236)
        at org.apache.hadoop.ipc.Server$Listener.run(Server.java:1167)

   Locked ownable synchronizers:
        - None

At this time, listenQueue was full and the number of "SYNs to LISTEN
sockets dropped" increased by 6 000 :
25-02-2022 10:18:06:        568780747 SYNs to LISTEN sockets dropped
(...)
25-02-2022 10:18:24:        568786673 SYNs to LISTEN sockets dropped

I don't find anything about 0x00007f7a81525618 in the jstack unfortunately.
It seems listener thread is waiting for something, but I don't know what
for the moment.

Best regards.

T@le







Le lun. 28 févr. 2022 à 11:52, Tale Hive <[email protected]> a écrit :

> Hello Gurmuck Singh.
>
> Thank you for your answers.
>
> Why 75GB heap size for NN? are you running a very large cluster?
>> 50 GB of heap used? Can you check are talking about the NN heap itself or
>> are you saying about the total mem used on the server?
>> 50GB approx means 200 million blocks? do you have that many.
>>
>
> I have ~150 millions of blocks/files and I set up this heap following the
> recommandations here :
>
> https://docs.cloudera.com/HDPDocuments/HDP2/HDP-2.3.0/bk_installing_manually_book/content/ref-80953924-1cbf-4655-9953-1e744290a6c3.1.html
>
> The formula is 20 X log base2(n); where n is the number of nodes.
>> So, if you have a thousand nodes we keep it to 200 (20 X log2(1024)=200)
>> and then approx 20 threads per thousand nodes.
>>
>
> I have 600 datanodes, which makes me normally at 20 * log2'(600) = 185
> threads for the ClientRPC server (the one which listens on port 8020)
>
> $ sysctl -n net.core.somaxconn
>>
>> $ sysctl -n net.ipv4.tcp_max_syn_backlog
>>
>> $ sysctl -n net.core.netdev_max_backlog
>>
>
> net.core.somaxconn= 8432
> net.ipv4.tcp_max_syn_backlog = 4096
> net.core.netdev_max_backlog = 2000
>
>
> $ netstat -an | grep -c SYN_RECV
>>
> $ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l
>>
>
> I'll check again and get you more information.
>
> What do you see in the JN logs? and what about ZK logs?
>> any logs in NN, ZK on the lines of "Slow sync'
>>
> Didn't check these logs, going to check them and get back to you.
>
>> What is the ZK heap?
>>
> Zookeeper heap is 4 GB.
>
> Disk latency
>> Heap
>> maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting
>> a high job workload)
>> jute.maxbuffer=1GB (By default it is very low, especially in a
>> kerberozied env, it must be bumped up). This setting is not there in HDP by
>> default, you have to put under custom-zoo.cfg
>>
>
> I'm going to check this also.
>
> If you can send me the NN, JN, ZK logs; more than happy to look into it.
>>
> I can yes, I just need time to anonymize everything.
>
> Thanks again for your help.
>
> Best regards.
>
> T@le
>
>
>
> Le jeu. 24 févr. 2022 à 21:28, gurmukh singh <[email protected]>
> a écrit :
>
>> Also, as you are using hive/beeline. You can fetch all the config as:
>>
>> beeline -u "JDBC URL to connect to HS2 " --outputformat=tsv2 -e 'set -v'
>> > /tmp/BeelineSet.out
>>
>> Please attach the BeelineSet.out
>>
>> On Friday, 25 February, 2022, 07:15:51 am GMT+11, gurmukh singh
>> <[email protected]> wrote:
>>
>>
>> on ZK side
>>
>> Important things:
>>
>> Disk latency
>> Heap
>> maxClientCnxns=800 (At least) As you have 600 nodes, so you are expecting
>> a high job workload)
>> jute.maxbuffer=1GB (By default it is very low, especially in a
>> kerberozied env, it must be bumped up). This setting is not there in HDP by
>> default, you have to put under custom-zoo.cfg
>>
>>
>> If you can send me the NN, JN, ZK logs; more than happy to look into it.
>>
>>
>>
>> On Friday, 25 February, 2022, 06:59:17 am GMT+11, gurmukh singh
>> <[email protected]> wrote:
>>
>>
>> @Tale Hive you provided the details in the first email, missed it.
>>
>> Can you provide me the output of below from Namenode:
>>
>> $ sysctl -n net.core.somaxconn
>>
>> $ sysctl -n net.ipv4.tcp_max_syn_backlog
>>
>> $ sysctl -n net.core.netdev_max_backlog
>>
>> $ netstat -an | grep -c SYN_RECV
>>
>> $ netstat -an | egrep -v "MYIP.(PORTS|IN|LISTEN)"  | wc -l
>>
>>
>> What do you see in the JN logs? and what about ZK logs?
>> any logs in NN, ZK on the lines of "Slow sync'
>> What is the ZK heap?
>>
>>
>>
>> On Friday, 25 February, 2022, 06:42:31 am GMT+11, gurmukh singh <
>> [email protected]> wrote:
>>
>>
>> I checked the heap of the namenode and there is no problem (I have 75 GB
>> of max heap, I'm around 50 used GB).
>>
>>     Why 75GB heap size for NN? are you running a very large cluster?
>>     50 GB of heap used? Can you check are talking about the NN heap
>> itself or are you saying about the total mem used on the server?
>>     50GB approx means 200 million blocks? do you have that many.
>>
>> I checked the threads of the clientRPC for the namenode and I'm at 200
>> which respects the recommandations from hadoop operations book.
>>     The formula is 20 X log base2(n); where n is the number of nodes.
>>     So, if you have a thousand nodes we keep it to 200 (20 X
>> log2(1024)=200) and then approx 20 threads per thousand nodes.
>>
>> I have serviceRPC enabled to prevent any problem which could be coming
>> from datanodes or ZKFC.
>>
>>
>> On Thursday, 24 February, 2022, 12:19:51 am GMT+11, Tale Hive <
>> [email protected]> wrote:
>>
>>
>> Hello.
>>
>> According to what I saw this morning, I can see that I am in the first
>> situation in fact :
>>
>>    - Client sends one packet with flag SYN to namenode
>>    - Namenode sends one packet with flags SYN, ACK to the client
>>    - Client sends n packets with flags PSH, ACK to the namenode, for
>>    each subfolder
>>    - Namenode sends n packets PSH, ACK to the client, for the content of
>>    each subfolder
>>
>> So the number of (PSH, ACK) packets from the tcpdump pcap file is not
>> what is filling the accept queue of port 8020 ClientRPC server on Namenode.
>>
>> I'm going to focus on checking the packets with SYN flag which arrive to
>> the namenode.
>> After that, because the jstack provokes active namenode failover, I don't
>> have many more tracks to follow except increase again the listenQueue, to
>> mitigate the problem, not to solve it.
>>
>> Best regards.
>>
>> T@le
>>
>>
>>
>> Le mer. 23 févr. 2022 à 11:46, Tale Hive <[email protected]> a écrit :
>>
>> Hello guys.
>>
>> Still investigating the tcpdump. I don't see a lot of packets with the
>> flag SYN when the listenQueue is full.
>> What I see is a lot of packets with the flag "PSH, ACK" with data inside
>> like this :
>> getListing.org.apache.hadoop.hdfs.protocol.ClientProtocol
>> /apps/hive/warehouse/<mydb>.db/<mytable>/<mypartition>
>>
>> It makes me wonder, when a client perform an hdfs dfs -ls -R <HDFS_PATH>,
>> how many SYN packets will it send to the namenode ? One in total or one by
>> subfolder ?
>> Let's say I have "n" subfolders inside <HDFS_PATH>. Will we have this
>> situation :
>> - Client sends one SYN packet to Namenode
>> - Namenode sends one SYN-ACK packets to client
>> - Client sends n ACK or (PSH, ACK) packets to Namenode
>>
>> Or this situation :
>> - Client sends n SYN packet to Namenode
>> - Namenode sends n SYN-ACK packets to client
>> - Client sends n ACK or (PSH, ACK)
>>
>> It would mean an hdfs recursive listing on a path with a lot of
>> subfolders could harm the other clients by sending too many packets to the
>> namenode ?
>>
>> About the jstack, I tried it on the namenode JVM but it provoked a
>> failover, as the namenode was not answering at all (in particular, no
>> answer to ZKFC), and the jstack never ended, I had to kill it.
>> I don't know if a kill -3 or a jstack -F could help, but at least jstack
>> -F contains less valuable information.
>>
>> T@le
>>
>> Le mar. 22 févr. 2022 à 10:29, Amith sha <[email protected]> a écrit :
>>
>> If TCP error occurs then you need to check the network metrics. Yes, TCP
>> DUMP can help you.
>>
>>
>> Thanks & Regards
>> Amithsha
>>
>>
>> On Tue, Feb 22, 2022 at 1:29 PM Tale Hive <[email protected]> wrote:
>>
>> Hello !
>>
>> @Amith sha <[email protected]>
>> I checked also the system metrics, nothing wrong in CPU, RAM or IO.
>> The only thing I found was these TCP errors (ListenDrop).
>>
>> @HK
>> I'm monitoring a lot of JVM metrics like this one :
>> "UnderReplicatedBlocks" in the bean
>> "Hadoop:service=NameNode,name=FSNamesystem".
>> And I found no under replicated blocks when the problem of timeout
>> occurs, unfortunately.
>> Thanks for you advice, in addition to the tcpdump, I'll perform some
>> jstacks to see if I can find what ipc handlers are doing.
>>
>> Best regards.
>>
>> T@le
>>
>>
>>
>>
>>
>>
>> Le mar. 22 févr. 2022 à 04:30, HK <[email protected]> a écrit :
>>
>> Hi Tape,
>> Could you please thread dump of namenode process. Could you please check
>> what ipc handlers are doing.
>>
>> We faced similar issue when the under replication is high in the cluster
>> due to filesystem wirteLock.
>>
>> On Tue, 22 Feb 2022, 8:37 am Amith sha, <[email protected]> wrote:
>>
>> Check your system metrics too.
>>
>> On Mon, Feb 21, 2022, 10:52 PM Tale Hive <[email protected]> wrote:
>>
>> Yeah, next step is for me to perform a tcpdump just when the problem
>> occurs.
>> I want to know if my namenode does not accept connections because it
>> freezes for some reasons or because there is too many connections at a time.
>>
>> My delay if far worse than 2s, sometimes, an hdfs dfs -ls -d
>> /user/<my-user> takes 20s, 43s and rarely it is even bigger than 1 minut.
>> And during this time, CallQueue is OK, Heap is OK, I don't find any
>> metrics which could show me a problem inside the namenode JVM.
>>
>> Best regards.
>>
>> T@le
>>
>> Le lun. 21 févr. 2022 à 16:32, Amith sha <[email protected]> a écrit :
>>
>> If you still concerned about the delay of > 2 s then you need to do
>> benchmark with and without load. To find the root cause of the problem it
>> will help.
>>
>> On Mon, Feb 21, 2022, 1:52 PM Tale Hive <[email protected]> wrote:
>>
>> Hello Amith.
>>
>> Hm, not a bad idea. If I increase the size of the listenQueue and if I
>> increase timeout, the combination of both may mitigate more the problem
>> than just increasing listenQueue size.
>> It won't solve the problem of acceptance speed, but it could help.
>>
>> Thanks for the suggestion !
>>
>> T@le
>>
>> Le lun. 21 févr. 2022 à 02:33, Amith sha <[email protected]> a écrit :
>>
>> org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout while
>> waiting for channel to be ready for connect.
>> Connection timed out after 20000 milli sec i suspect this value is very
>> low for a namenode with 75Gb of heap usage. Can you increase the value to
>> 5sec and check the connection. To increase the value modify this property
>> ipc.client.rpc-timeout.ms - core-site.xml (If not present then add to
>> the core-site.xml)
>>
>>
>> Thanks & Regards
>> Amithsha
>>
>>
>> On Fri, Feb 18, 2022 at 9:17 PM Tale Hive <[email protected]> wrote:
>>
>> Hello Tom.
>>
>> Sorry for my absence of answers, I don't know why gmail puts your mail
>> into spam -_-.
>>
>> To answer you :
>>
>>    - The metrics callQueueLength, avgQueueTime, avgProcessingTime and GC
>>    metric are all OK
>>    - Threads are plenty sufficient (I can see the metrics also for them
>>    and I  am below 200, the number I have for 8020 RPC server)
>>
>> Did you see my other answers about this problem ?
>> I would be interested to have your opinion about that !
>>
>> Best regards.
>>
>> T@le
>>
>>
>> Le mar. 15 févr. 2022 à 02:16, tom lee <[email protected]> a écrit :
>>
>> It might be helpful to analyze namenode metrics and logs.
>>
>> What about some key metrics? Examples are callQueueLength, avgQueueTime,
>> avgProcessingTime and GC metrics.
>>
>> In addition, is the number of threads(dfs.namenode.service.handler.count)
>> in the namenode sufficient?
>>
>> Hopefully this will help.
>>
>> Best regards.
>> Tom
>>
>> Tale Hive <[email protected]> 于2022年2月14日周一 23:57写道:
>>
>> Hello.
>>
>> I encounter a strange problem with my namenode. I have the following
>> architecture :
>> - Two namenodes in HA
>> - 600 datanodes
>> - HDP 3.1.4
>> - 150 millions of files and folders
>>
>> Sometimes, when I query the namenode with the hdfs client, I got a
>> timeout error like this :
>> hdfs dfs -ls -d /user/myuser
>>
>> 22/02/14 15:07:44 INFO retry.RetryInvocationHandler:
>> org.apache.hadoop.net.ConnectTimeoutException: Call From
>> <my-client-hostname>/<my-client-ip> to <active-namenode-hostname>:8020
>> failed on socket timeout exception:
>>   org.apache.hadoop.net.ConnectTimeoutException: 20000 millis timeout
>> while waiting for channel to be ready for connect. ch :
>> java.nio.channels.SocketChannel[connection-pending
>> remote=<active-namenode-hostname>/<active-namenode-ip>:8020];
>>   For more details see:  http://wiki.apache.org/hadoop/SocketTimeout,
>> while invoking ClientNamenodeProtocolTranslatorPB.getFileInfo over
>> <active-namenode-hostname>/<active-namenode-ip>:8020 after 2 failover
>> attempts. Trying to failover after sleeping for 2694ms.
>>
>> I checked the heap of the namenode and there is no problem (I have 75 GB
>> of max heap, I'm around 50 used GB).
>> I checked the threads of the clientRPC for the namenode and I'm at 200
>> which respects the recommandations from hadoop operations book.
>> I have serviceRPC enabled to prevent any problem which could be coming
>> from datanodes or ZKFC.
>> General resources seems OK, CPU usage is pretty fine, same for memory,
>> network or IO.
>> No firewall is enabled on my namenodes nor my client.
>>
>> I was wondering what could cause this problem, please ?
>>
>> Thank you in advance for your help !
>>
>> Best regards.
>>
>> T@le
>>
>>

Reply via email to