On Fri, 12 Oct 2018 09:30:40 +0200 Henning Rogge <henning.ro...@fkie.fraunhofer.de> wrote:
> Hi, > > I am working on a self-written routing agent > (https://github.com/OLSR/OONF) and am stuck on a problem with netlink > that I cannot explain with an userspace error. > > I am using a netlink socket for setting routes > (RTM_NEWROUTE/RTM_DELROUTE), querying the kernel for the current routes > in the database (via a RTM_GETROUTE dump) and for getting multicast > messages for ongoing routing changes. > > After a few netlink messages I get to the point where the kernel just > does not responst to a RTM_NEWROUTE. No error, no answer, despite the > NLM_F_ACK flag set)... but sometime when (during shutdown of the routing > agent) the program sends another route command (most times a > RTM_DELROUTE) I get a single netlink packet with a "successful" response > for both the "missing" RTM_NEWROUTE and one for the new RTM DELROUTE > sequence number. > > I am testing two routing agents, each of them in a systemd-nspawn based > container connected over a bridge on the host system on a current Debian > Testing (kernel 4.18.0-1-amd64). > > I am directly using the netlink sockets, without any other userspace > library in between. > > I have checked the hexdumps of a couple of netlink messages (including > the ones just before the bug happens) by hand and they seem to be okay. > > When I tried to add a "netlink listener" socket for futher debugging (ip > link add nlmon0 type nlmon) the problem vanished until I removed the > listener socket again. > > Any ideas how to debug this problem? Unfortunately I have no short > example program to trigger the bug... I have rarely seen the problem for > years (once every couple of months), but until a few days ago I never > managed to reproduce it. > > Henning Rogge Are you reading the responses to your requests? If you don't read the response, the socket will get flow blocked.