Hi We currently use SIOCOUTQ on AF_UNIX+SOCK_STREAM sockets to figure out whether data that we wrote to a socket is still pending or was dequeued. Preferably, we would like to know whether a specific message we queued was dequeued by the other side, but sadly SIOCOUTQ does not report the actual payload-size, but `skb->truesize`, making it impossible for us to predict how much data the other side has dequeued. Hence, we simply use the value to wait for the queue to empty.
Practically, this means under special circumstances we refrain from writing data to a socket until SIOCOUTQ returns 0. If it does not return 0, we wait to be woken-up by a following EPOLLOUT (which works fine in edge-triggered mode). This worked fine for us, until we spuriously got off-by-one errors, where SIOCOUTQ started returning 1, but did not send a wakeup / EPOLLOUT afterwards despite the counter at some point dropping to 0. The culprit seems to be sock_wfree(): [slightly simplified] void sock_wfree(struct sk_buff *skb) { struct sock *sk = skb->sk; unsigned int len = skb->truesize; if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) { [...] refcount_sub_and_test(len - 1, &sk->sk_wmem_alloc); sk->sk_write_space(sk); len = 1; } [...] if (refcount_sub_and_test(len, &sk->sk_wmem_alloc)) __sk_free(sk); } The kernel seems to use the `sk_wmem_alloc` counter for reference-counting. Therefore, if we check SIOCOUTQ exactly between `sk->sk_write_space()` and the following `refcount` update, we will see a counter of 1, but never get woken up afterwards. Some of our users can hit this reliably on newer arm64 machines. My question now is whether this is intended behavior? I couldn't come up with a simple fix, as I assume this overload on sk_wmem_alloc was done for performance reasons. We could acquire an sk_refcnt before sk_write_space and drop it afterwards, but it would probably need further adjustments in sk_free() and friends, and would negatively affect performance. As a workaround we assume a return value lower than 128 means effectively 0, because even a single queued skb would cause the counter to be way bigger than 128, due to the size of `struct sk_buff`. This gives us room to deal with up to 128 temporary references the kernel holds. However, this does feel slightly awkward, so I wondered whether someone has a better idea to fix this? Or whether SIOCOUTQ is just not meant to be used that way? NOTE: Our SIOCOUTQ code is used to properly account for inflight-file-descriptors. The kernel accounts them on the sender's quota, so we use this to reliably track how long an inflight FD is queued on a socket, and thus accounted on us. This way we prevent malicious users from causing us to queue too many inflight file-descriptors and thus exceeding our quota. Thanks David