Hi netdev,
I've got an application that handles network traffic using various protocols. 
The application is comprised of a supervisor process and one or more worker 
processes that implement a watchdog that enables the supervisor to kill hung 
workers or detect when they've crashed and start new ones. Originally we had 
only a single worker process and the watchdog was comprised of a UDP socket on 
the loopback address through which the supervisor sends a health check to the 
worker and the healthy worker replies. When we improved the application to 
support multiple worker processes we were able to simply extend the watchdog to 
use multicast. This was accomplished with no significant change to the watchdog 
logic, i.e., just a matter of the workers joining the multicast group and 
replying with an ID when the the supervisor sends to the multicast group.

The new multicast watchdog works fine except under heavy load. Using the test 
program curl-loader we ramp up to several thousand http connections to the 
worker process. As the load builds the supervisor health check starts to fail 
intermittently and until it reaches 100% failure at peak load. The failure 
occurs on the origination of the healthcheck when sendto() fails with EINVAL. 
As the load drops, sendto() begins to succeed again. The arguments to sendto() 
do not change during the test. Using printk I have isolated the failure to 
udp_sendmsg() in net/ipv4/udp.c:

int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg, size_t 
len)

Within this function at this block

    /* Lockless fast path for the non-corking case. */
    if (!corkreq) {
        skb = ip_make_skb(sk, fl4, getfrag, msg->msg_iov, ulen,
                  sizeof(struct udphdr), &ipc, &rt,
                  msg->msg_flags);
        err = PTR_ERR(skb);
        if (!IS_ERR_OR_NULL(skb))
            err = udp_send_skb(skb, fl4);

               printk(KERN_ERR "%s goto out from line: 
%d\n",__FUNCTION__,__LINE__);
        goto out;
    }

the function udp_send_skb() is returning EINVAL.

The kernel is v3.10.0 from upstream RHEL 7.5. Can anyone offer advice before I 
proceed down the stack to look for the root cause? The behavior (failure under 
load but recovery after the load is removed) suggests contention for resources 
but the EINVAL return code makes no sense to me given the arguments to sendto() 
do not change. I am totally unfamiliar with this code so any help is 
appreciated.

Thanks,
Chris

Reply via email to