Jakub Sitnicki wrote: > On Thu, May 23, 2019 at 05:58 PM CEST, John Fastabend wrote: > > [...] > > > >> > >> Thanks for taking a look at it. Setting MSG_DONTWAIT works great for > >> me. No more crashes in sk_stream_wait_memory. I've tested it on top of > >> current bpf-next (f49aa1de9836). Here's my: > >> > >> Tested-by: Jakub Sitnicki <ja...@cloudflare.com> > >> > >> The actual I've tested is below, for completeness. > >> > >> BTW. I've ran into another crash which I haven't seen before while > >> testing sockmap-echo, but it looks unrelated: > >> > >> > >> https://lore.kernel.org/netdev/20190522100142.28925-1-ja...@cloudflare.com/ > >> > >> -Jakub > >> > >> --- 8< --- > >> > >> diff --git a/net/core/skbuff.c b/net/core/skbuff.c > >> index e89be6282693..4a7c656b195b 100644 > >> --- a/net/core/skbuff.c > >> +++ b/net/core/skbuff.c > >> @@ -2337,6 +2337,7 @@ int skb_send_sock_locked(struct sock *sk, struct > >> sk_buff *skb, int offset, > >> kv.iov_base = skb->data + offset; > >> kv.iov_len = slen; > >> memset(&msg, 0, sizeof(msg)); > >> + msg.msg_flags = MSG_DONTWAIT; > >> > >> ret = kernel_sendmsg_locked(sk, &msg, &kv, 1, slen); > >> if (ret <= 0) > > > > I went ahead and submitted this feel free to add your signed-off-by. > > Thanks! The fix was all your idea :-)
If I can push the correct patch to Daniel it should be in bpf tree soon. > > Now that those pesky crashes are gone, we plan to look into drops when > doing echo with sockmap. Marek tried running echo-sockmap [1] with > latest bpf-next (plus mentioned crash fixes) and reports that not all > data bounces back: > > $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c > 971832 > $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c > 867352 > $ yes| head -c $[1024*1024] | nc -q2 192.168.1.33 1234 |wc -c > 952648 > > I'm tring to turn echo-sockmap into a selftest but as you can probably > guess over loopback all works fine. > Right, sockmap when used from recvmsg with redirect is lossy. This was a design choice I made that apparently caught a few people by surprise. The original rationale for this was when doing a multiplex operation, e.g. single ingress socket to many egress sockets blocking would cause head of line blocking on all sockets. To resolve this I simply dropped the packet and then allow the flow to continue. This pushes the logic up to the application to do retries, etc. when this happens. FWIW userspace proxies I tested also had similar points where they fell over and dropped packets. In hind sight though it probably would have made more sense to make this behavior opt-in vs the default. But, the use case I was solving at the time I wrote this could handle drops and was actually a NxM sockets with N ingress sockets and M egress sockets so head of line blocking was a real problem. Adding a flag to turn this into a blocking op has been on my todo list for awhile. Especially when sockmap is being used as a single ingress to single egress socket then blocking vs dropping makes much more sense. The place to look is in sk_psock_verdict_apply() in __SK_REDIRECT case there is a few checks and notice we can fallthrough to a kfree_skb(skb). This is most likely the drops you are hitting. Maybe annotate it with a dbg statement to check. To fix this we could have a flag to _not_ drop but enqueue the packet regardless of the test or hold it until space is available. I even think sk_psock_strp_read could push back on the stream parser which would eventually push back via TCP and get the behavior you want. Also, I have a couple items on my TODO list that I'll eventually get to. First we run without stream parsers in some Cilium use cases. I'll push some patches to allow this in the next months or so. This avoids the annoying stream parser prog that simply returns skb->len. This is mostly an optimizations. A larger change I want to make at some point is to remove the backlog workqueue altogether. Originally it was added for simplicity but actually causes some latency spikes when looking at 99+ percentiles. It really doesn't need to be there it was a hold over from some original architecture that got pushed upstream. If you have time and want to let me know if you would like to tackle removing it. Thanks, John