Hi Josef, On 15.12.2016 19:53, Josef Bacik wrote: > On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <t...@herbertland.com> wrote: >> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatg...@gmail.com> >> wrote: >>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <t...@herbertland.com> >>> wrote: >>>> I think there may be some suspicious code in inet_csk_get_port. At >>>> tb_found there is: >>>> >>>> if (((tb->fastreuse > 0 && reuse) || >>>> (tb->fastreuseport > 0 && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>> sk->sk_reuseport && uid_eq(tb->fastuid, >>>> uid))) && >>>> smallest_size == -1) >>>> goto success; >>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, >>>> tb, true)) { >>>> if ((reuse || >>>> (tb->fastreuseport > 0 && >>>> sk->sk_reuseport && >>>> >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>> uid_eq(tb->fastuid, uid))) && >>>> smallest_size != -1 && --attempts >= 0) { >>>> spin_unlock_bh(&head->lock); >>>> goto again; >>>> } >>>> goto fail_unlock; >>>> } >>>> >>>> AFAICT there is redundancy in these two conditionals. The same clause >>>> is being checked in both: (tb->fastreuseport > 0 && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the >>>> first conditional should be hit, goto done, and the second will never >>>> evaluate that part to true-- unless the sk is changed (do we need >>>> READ_ONCE for sk->sk_reuseport_cb?). >>> That's an interesting point... It looks like this function also >>> changed in 4.6 from using a single local_bh_disable() at the beginning >>> with several spin_lock(&head->lock) to exclusively >>> spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh >>> disable variant was preventing the timers in your stack trace from >>> running interleaved with this function before? >> >> Could be, although dropping the lock shouldn't be able to affect the >> search state. TBH, I'm a little lost in reading function, the >> SO_REUSEPORT handling is pretty complicated. For instance, >> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that >> function and also in every call to inet_csk_bind_conflict. I wonder if >> we can simply this under the assumption that SO_REUSEPORT is only >> allowed if the port number (snum) is explicitly specified. > > Ok first I have data for you Hannes, here's the time distributions > before during and after the lockup (with all the debugging in place the > box eventually recovers). I've attached it as a text file since it is > long.
Thanks a lot! > Second is I was thinking about why we would spend so much time doing the > ->owners list, and obviously it's because of the massive amount of > timewait sockets on the owners list. I wrote the following dumb patch > and tested it and the problem has disappeared completely. Now I don't > know if this is right at all, but I thought it was weird we weren't > copying the soreuseport option from the original socket onto the twsk. > Is there are reason we aren't doing this currently? Does this help > explain what is happening? Thanks, The patch is interesting and a good clue, but I am immediately a bit concerned that we don't copy/tag the socket with the uid also to keep the security properties for SO_REUSEPORT. I have to think a bit more about this. We have seen hangs during connect. I am afraid this patch wouldn't help there while also guaranteeing uniqueness. Thanks a lot!