[
https://issues.apache.org/jira/browse/KAFKA-6544?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16357483#comment-16357483
]
Yu Yang commented on KAFKA-6544:
--------------------------------
[~cmccabe] The kafka process is in `<defunct>` status. sudo ls -l
/proc/$kafka_pid/fd returns 0. I am also including "netstat -pnt" output
here. Connections are either in ESTABLISHED or CLOSE_WAIT status.
{code}
proc/30413/fd]# sudo ls -l /proc/30413/fd
total 0
{code}
{code}
netstat -pnt | grep "10.1.160.124:9092" | wc
116 812 11252
{code}
{code}
netstat -pnt | grep "10.1.160.124:9092"
tcp 29 0 10.1.160.124:9092 10.1.25.241:55616 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:58624 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.9.121:33894 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:53886 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:43122 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:50766 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.26.165:34282 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.79.149:47682 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.163.135:44008 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.66.116:52398 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.64.116:36656 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.207.247:51904 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.9.16:45942 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.131.15:57118 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:55974 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.214.5:33040 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:33494 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.201.139:60230 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.207.247:51792 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:42858 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:44246 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.194.26:42406 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:32902 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.169.94:35532 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.193.101:48832 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.204.225:60946 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:35772 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:46972 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:56226 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:46432 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:44436 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:48888 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:47364 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:44908 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:43060 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.10.15:39282 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.181.86:55500 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.17.191:32812 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.141.30:52024 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.76.141:51366 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:50940 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.11.196:44064 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.143.107:37116 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:37416 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.71.116:35110 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:60884 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.14.163:51768 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.15.51:49542 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.6.217:46520 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:60314 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:56516 ESTABLISHED
-
tcp 0 0 10.1.160.124:9092 10.1.232.16:60754 SYN_RECV
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:57568 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.198.209:38446 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:38278 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.201.206:46686 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:48798 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:51958 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:40716 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.0.41:47810 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.215.172:34926 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:36104 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.193.30:49338 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:41596 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.201.28:57122 CLOSE_WAIT
-
tcp 0 12 10.1.160.124:9092 10.1.150.72:36506 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.165.50:43042 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:50396 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.4.9:44952 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.98.254:36852 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.247.162:38234 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:38694 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:55794 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.138.76:56542 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:40790 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:32858 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.77.228:34292 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.203.191:55610 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:45182 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.3.215:58404 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:42014 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:46172 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:39050 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:36000 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:51330 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:44994 ESTABLISHED
-
tcp 0 8 10.1.160.124:9092 10.1.150.72:46158 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:59280 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:46678 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:54272 ESTABLISHED
-
tcp 0 16 10.1.160.124:9092 10.1.63.47:56546 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.79.66:34010 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:56790 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:47846 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.229.18:34272 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.2.141:44584 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:53156 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:52610 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:37628 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.203.117:41170 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:42540 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:41244 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:56308 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:51810 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:38634 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.5.51:60498 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.78.153:53942 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:33506 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:43768 ESTABLISHED
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:37134 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.69.17:45370 CLOSE_WAIT
-
tcp 1 0 10.1.160.124:9092 10.1.83.103:57640 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.205.66:40266 CLOSE_WAIT
-
tcp 29 0 10.1.160.124:9092 10.1.25.241:60058 ESTABLISHED
-
tcp 65 0 10.1.160.124:9092 10.1.245.20:54896 CLOSE_WAIT
-
tcp 65 0 10.1.160.124:9092 10.1.76.47:53444 CLOSE_WAIT
-
{code}
> kafka process should exit when it encounters "java.io.IOException: Too many
> open files"
> -----------------------------------------------------------------------------------------
>
> Key: KAFKA-6544
> URL: https://issues.apache.org/jira/browse/KAFKA-6544
> Project: Kafka
> Issue Type: Bug
> Components: admin, network
> Affects Versions: 0.10.2.1
> Reporter: Yu Yang
> Priority: Major
>
> Our kafka cluster encountered a few disk/xfs failures in the cloud vm
> instances. When a disk/xfs failure happens, kafka process did not exit
> gracefully. Instead, it run into "<defunct>" status, with port 9092 still be
> reachable. when failures like this happens, kafka should shutdown all
> threads and exit. The following is the kafka logs when the failure happens:
> {code:java}
> [2018-02-08 12:52:31,764] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at kafka.network.Acceptor.accept(SocketServer.scala:340)
> at kafka.network.Acceptor.run(SocketServer.scala:283)
> at java.lang.Thread.run(Thread.java:748)
> [2018-02-08 12:52:31,772] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at kafka.network.Acceptor.accept(SocketServer.scala:340)
> at kafka.network.Acceptor.run(SocketServer.scala:283)
> at java.lang.Thread.run(Thread.java:748)
> [2018-02-08 12:52:31,772] ERROR Error while accepting connection
> (kafka.network.Acceptor)
> java.io.IOException: Too many open files
> at sun.nio.ch.ServerSocketChannelImpl.accept0(Native Method)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:422)
> at
> sun.nio.ch.ServerSocketChannelImpl.accept(ServerSocketChannelImpl.java:250)
> at kafka.network.Acceptor.accept(SocketServer.scala:340)
> at kafka.network.Acceptor.run(SocketServer.scala:283)
> at java.lang.Thread.run(Thread.java:748)
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)