On Thu, Dec 15, 2016 at 10:53 AM, Josef Bacik <jba...@fb.com> wrote: > On Tue, Dec 13, 2016 at 6:32 PM, Tom Herbert <t...@herbertland.com> wrote: >> >> On Tue, Dec 13, 2016 at 3:03 PM, Craig Gallek <kraigatg...@gmail.com> >> wrote: >>> >>> On Tue, Dec 13, 2016 at 3:51 PM, Tom Herbert <t...@herbertland.com> >>> wrote: >>>> >>>> I think there may be some suspicious code in inet_csk_get_port. At >>>> tb_found there is: >>>> >>>> if (((tb->fastreuse > 0 && reuse) || >>>> (tb->fastreuseport > 0 && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && >>>> sk->sk_reuseport && uid_eq(tb->fastuid, uid))) && >>>> smallest_size == -1) >>>> goto success; >>>> if (inet_csk(sk)->icsk_af_ops->bind_conflict(sk, tb, >>>> true)) { >>>> if ((reuse || >>>> (tb->fastreuseport > 0 && >>>> sk->sk_reuseport && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) >>>> && >>>> uid_eq(tb->fastuid, uid))) && >>>> smallest_size != -1 && --attempts >= 0) { >>>> spin_unlock_bh(&head->lock); >>>> goto again; >>>> } >>>> goto fail_unlock; >>>> } >>>> >>>> AFAICT there is redundancy in these two conditionals. The same clause >>>> is being checked in both: (tb->fastreuseport > 0 && >>>> !rcu_access_pointer(sk->sk_reuseport_cb) && sk->sk_reuseport && >>>> uid_eq(tb->fastuid, uid))) && smallest_size == -1. If this is true the >>>> first conditional should be hit, goto done, and the second will never >>>> evaluate that part to true-- unless the sk is changed (do we need >>>> READ_ONCE for sk->sk_reuseport_cb?). >>> >>> That's an interesting point... It looks like this function also >>> changed in 4.6 from using a single local_bh_disable() at the beginning >>> with several spin_lock(&head->lock) to exclusively >>> spin_lock_bh(&head->lock) at each locking point. Perhaps the full bh >>> disable variant was preventing the timers in your stack trace from >>> running interleaved with this function before? >> >> >> Could be, although dropping the lock shouldn't be able to affect the >> search state. TBH, I'm a little lost in reading function, the >> SO_REUSEPORT handling is pretty complicated. For instance, >> rcu_access_pointer(sk->sk_reuseport_cb) is checked three times in that >> function and also in every call to inet_csk_bind_conflict. I wonder if >> we can simply this under the assumption that SO_REUSEPORT is only >> allowed if the port number (snum) is explicitly specified. > > > Ok first I have data for you Hannes, here's the time distributions before > during and after the lockup (with all the debugging in place the box > eventually recovers). I've attached it as a text file since it is long. > > Second is I was thinking about why we would spend so much time doing the > ->owners list, and obviously it's because of the massive amount of timewait > sockets on the owners list. I wrote the following dumb patch and tested it > and the problem has disappeared completely. Now I don't know if this is > right at all, but I thought it was weird we weren't copying the soreuseport > option from the original socket onto the twsk. Is there are reason we > aren't doing this currently? Does this help explain what is happening? > Thanks, > I think that would explain it. We would be walking long lists of TW sockets in inet_bind_bucket_for_each(tb, &head->chain). This should break, although now I'm wondering if there's other ways we can get into this situation. reuseport ensures that we can have long lists of sockets in a single bucket, TW sockets can make that list really long.
Tom > Josef