I'm reporting what appears to be a bug in the Linux kernel's epoll support. It seems that epoll appears to sometimes fail to report an EPOLLOUT event when the other side of an AF_UNIX/SOCK_DGRAM socket is closed. This bug report started as a Go program reported at https://golang.org/issue/23604. I've written a C program that demonstrates the same symptoms, at https://github.com/golang/go/issues/23604#issuecomment-398945027 .
The C program sets up an AF_UNIX/SOCK_DGRAM server and serveral identical clients, all running in non-blocking mode. All the non-blocking sockets are added to epoll, using EPOLLET. The server periodically closes and reopens its socket. The clients look for ECONNREFUSED errors on their write calls, and close and reopen their sockets when they see one. The clients will sometimes fill up their buffer and block with EAGAIN. At that point they expect the poller to return an EPOLLOUT event to tell them when they are ready to write again. The expectation is that either the server will read data, freeing up buffer space, or will close the socket, which should cause the sending packets to be discarded, freeing up buffer space. Generally the EPOLLOUT event happens. But sometimes, the poller never returns such an event, and the client stalls. In the test program this is reported as a client that waits more than 20 seconds to be told to continue. A similar bug report was made, with few details, at https://stackoverflow.com/questions/38441059/edge-triggered-epoll-for-unix-domain-socket . I've tested the program and seen the failure on kernel 4.9.0-6-amd64. A colleague has tested the program and seen the failure on 4.18.0-smp-DEV #3 SMP @1529531011 x86_64 GNU/Linux. If there is a better way for me to report this, please let me know. Thanks for your attention. Ian