Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Michal Luczaj Sun, 15 Mar 2026 16:59:34 -0700

On 3/10/26 23:20, Martin KaFai Lau wrote:
> On 3/6/26 6:09 AM, Michal Luczaj wrote:
>> On 3/6/26 06:01, Jiayuan Chen wrote:
>>>
>>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>>> socket is properly set up, which would include having a defined peer. IOW,
>>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>>> socket which still has unix_peer(sk) == NULL.
>>>>
>>>>             T0 bpf                            T1 connect
>>>>             ------                            ----------
>>>>
>>>>                                   WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>>> sock_map_sk_state_allowed(sk)
>>>> ...
>>>> sk_pair = unix_peer(sk)
>>>> sock_hold(sk_pair)
>>>>                                   sock_hold(newsk)
>>>>                                   smp_mb__after_atomic()
>>>>                                   unix_peer(sk) = newsk
>>>>
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>>> Call Trace:
>>>>     sock_map_link+0x564/0x8b0
>>>>     sock_map_update_common+0x6e/0x340
>>>>     sock_map_update_elem_sys+0x17d/0x240
>>>>     __sys_bpf+0x26db/0x3250
>>>>     __x64_sys_bpf+0x21/0x30
>>>>     do_syscall_64+0x6b/0x3a0
>>>>     entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>>> but that involved an additional memory barrier, and changing the hot path
>>>> was rejected. Then a check during proto update was considered[2], but a
>>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>>> lock. Or, more specifically, an insufficient lock[4].
>>>>
>>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>>
>>>> [1]: 
>>>> https://lore.kernel.org/netdev/[email protected]/
>>>> [2]: 
>>>> https://lore.kernel.org/netdev/[email protected]/
>>>> [3]: 
>>>> https://lore.kernel.org/netdev/[email protected]/
>>>> [4]: 
>>>> https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+...@mail.gmail.com/
>>>>
>>>> Suggested-by: Kuniyuki Iwashima <[email protected]>
>>>> Suggested-by: Martin KaFai Lau <[email protected]>
>>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>>> Signed-off-by: Michal Luczaj <[email protected]>
>>>> ---
>>>>    net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>>>    1 file changed, 29 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>>> --- a/net/core/sock_map.c
>>>> +++ b/net/core/sock_map.c
>>>> @@ -12,6 +12,7 @@
>>>>    #include <linux/list.h>
>>>>    #include <linux/jhash.h>
>>>>    #include <linux/sock_diag.h>
>>>> +#include <net/af_unix.h>
>>>>    #include <net/udp.h>
>>>>    
>>>>    struct bpf_stab {
>>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr, 
>>>> enum bpf_prog_type ptype)
>>>>    }
>>>>    
>>>>    static void sock_map_sk_acquire(struct sock *sk)
>>>> -  __acquires(&sk->sk_lock.slock)
>>>>    {
>>>>            lock_sock(sk);
>>>> +
>>>> +  if (sk_is_unix(sk))
>>>> +          unix_state_lock(sk);
>>>> +
>>>>            rcu_read_lock();
>>>>    }
>>>>    
>>>
>>> This introduces a new ordering constraint: lock_sock() before
>>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>>> lock_sock() inside unix_state_lock() in the future would create an
>>> ABBA deadlock, which is exactly why the order was settled this way. However,
>>> the thread did not reach a conclusion on whether that constraint should be
>>> documented in the code.
>>>
>>> Since there is nothing enforcing this ordering mechanically, a brief comment
>>> at sock_map_sk_acquire() would help future readers avoid introducing the
>>> inverse ordering.
>>
>> Sure, will do.
>>
>>>>    static void sock_map_sk_release(struct sock *sk)
>>>> -  __releases(&sk->sk_lock.slock)
>>>>    {
>>>>            rcu_read_unlock();
>>>> +
>>>> +  if (sk_is_unix(sk))
>>>> +          unix_state_unlock(sk);
>>>> +
>>>>            release_sock(sk);
>>>>    }
>>>>    
>>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>>> +{
>>>> +  local_bh_disable();
>>>> +  bh_lock_sock(sk);
>>>> +
>>>> +  if (sk_is_unix(sk))
>>>> +          unix_state_lock(sk);
>>>> +}
>>>> +
>>>
>>>
>>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>>> sockets took only unix_state_lock() in the fast path, skipping
>>> bh_lock_sock() entirely:
>>>
>>> /* v2 */
>>> if (sk_is_unix(sk))
>>>     unix_state_lock(sk);
>>> else
>>>     bh_lock_sock(sk);
>>>
>>> v3 takes both for AF_UNIX sockets.
>>>
>>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>>> actually needs protection here — sk_state and unix_peer() — lives under
>>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>>> anything useful for AF_UNIX sockets in this path?
>>
>> Yeah, good point. I think I was just trying to keep the _fast variant
>> aligned with the sleepy one. Which I agree might be unnecessary.
> 
> I hope the common use case should not be calling 
> bpf_map_update_elem(sockmap) only.
> 
> Beside, from looking at the may_update_sockmap(), I don't know if it is 
> even doable (or useful) to bpf_map_update_elem(unix_sk) in 
> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking 
> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case 
> when the bpf_map_update_elem(sockmap) support was added iirc.


What about a situation when unix_sk is stored in a sockmap, then tc prog
looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
useful, but seems doable.

> The only path left is bpf_iter, which I believe was the primary use case 
> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice 
> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix) 
> where lock_sock() has already been done. It is more for 
> reading-correctness though. This just came to my mind. 
> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has 
> been using it to conditionally take lock_sock() or not.

[ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
always `true` in sock_map_update_elem(), right? ]

Let me know if I'm correctly rephrasing your idea: assume all bpf-context
callers hold the socket locked or keep it "stable" (meaning: "sk won't
surprise sockmap update by some breaking state change coming from another
thread"). As you said, most bpf iters already take the sock_lock(), and I
have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
dropping that bh_lock_sock().

> [ I would still keep patch 3 though. ]

Right.

> [1]: https://lore.kernel.org/bpf/[email protected]/
> 
>>
>> In a parallel thread I've asked Kuniyuki if it might be better to
>> completely drop patch 2/5, which would change how we interact with
>> sock_map_close(). Lets see how it goes.
>>
> 
> If patch 2 is dropped, lock_sock() is always needed for unix_sk?

For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
following Kuniyuki's suggestion to keep locking pattern/order (that repeats
when unix bpf iter prog invokes bpf_map_update_elem() ->
sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.

Re: [PATCH bpf v3 5/5] bpf, sockmap: Adapt for af_unix-specific lock

Reply via email to