On 3/10/26 23:20, Martin KaFai Lau wrote:
> On 3/6/26 6:09 AM, Michal Luczaj wrote:
>> On 3/6/26 06:01, Jiayuan Chen wrote:
>>>
>>> On 3/6/26 7:30 AM, Michal Luczaj wrote:
>>>> unix_stream_connect() sets sk_state (`WRITE_ONCE(sk->sk_state,
>>>> TCP_ESTABLISHED)`) _before_ it assigns a peer (`unix_peer(sk) = newsk`).
>>>> sk_state == TCP_ESTABLISHED makes sock_map_sk_state_allowed() believe that
>>>> socket is properly set up, which would include having a defined peer. IOW,
>>>> there's a window when unix_stream_bpf_update_proto() can be called on
>>>> socket which still has unix_peer(sk) == NULL.
>>>>
>>>> T0 bpf T1 connect
>>>> ------ ----------
>>>>
>>>> WRITE_ONCE(sk->sk_state, TCP_ESTABLISHED)
>>>> sock_map_sk_state_allowed(sk)
>>>> ...
>>>> sk_pair = unix_peer(sk)
>>>> sock_hold(sk_pair)
>>>> sock_hold(newsk)
>>>> smp_mb__after_atomic()
>>>> unix_peer(sk) = newsk
>>>>
>>>> BUG: kernel NULL pointer dereference, address: 0000000000000080
>>>> RIP: 0010:unix_stream_bpf_update_proto+0xa0/0x1b0
>>>> Call Trace:
>>>> sock_map_link+0x564/0x8b0
>>>> sock_map_update_common+0x6e/0x340
>>>> sock_map_update_elem_sys+0x17d/0x240
>>>> __sys_bpf+0x26db/0x3250
>>>> __x64_sys_bpf+0x21/0x30
>>>> do_syscall_64+0x6b/0x3a0
>>>> entry_SYSCALL_64_after_hwframe+0x76/0x7e
>>>>
>>>> Initial idea was to move peer assignment _before_ the sk_state update[1],
>>>> but that involved an additional memory barrier, and changing the hot path
>>>> was rejected. Then a check during proto update was considered[2], but a
>>>> follow-up discussion[3] concluded the root cause is sockmap taking a wrong
>>>> lock. Or, more specifically, an insufficient lock[4].
>>>>
>>>> Thus, teach sockmap about the af_unix-specific locking: af_unix protects
>>>> critical sections under unix_state_lock() operating on unix_sock::lock.
>>>>
>>>> [1]:
>>>> https://lore.kernel.org/netdev/[email protected]/
>>>> [2]:
>>>> https://lore.kernel.org/netdev/[email protected]/
>>>> [3]:
>>>> https://lore.kernel.org/netdev/[email protected]/
>>>> [4]:
>>>> https://lore.kernel.org/netdev/CAAVpQUA+8GL_j63CaKb8hbxoL21izD58yr1NvhOhU=j+35+...@mail.gmail.com/
>>>>
>>>> Suggested-by: Kuniyuki Iwashima <[email protected]>
>>>> Suggested-by: Martin KaFai Lau <[email protected]>
>>>> Fixes: c63829182c37 ("af_unix: Implement ->psock_update_sk_prot()")
>>>> Signed-off-by: Michal Luczaj <[email protected]>
>>>> ---
>>>> net/core/sock_map.c | 35 +++++++++++++++++++++++++++++------
>>>> 1 file changed, 29 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/net/core/sock_map.c b/net/core/sock_map.c
>>>> index 7ba6a7f24ccd..6109fbe6f99c 100644
>>>> --- a/net/core/sock_map.c
>>>> +++ b/net/core/sock_map.c
>>>> @@ -12,6 +12,7 @@
>>>> #include <linux/list.h>
>>>> #include <linux/jhash.h>
>>>> #include <linux/sock_diag.h>
>>>> +#include <net/af_unix.h>
>>>> #include <net/udp.h>
>>>>
>>>> struct bpf_stab {
>>>> @@ -115,19 +116,43 @@ int sock_map_prog_detach(const union bpf_attr *attr,
>>>> enum bpf_prog_type ptype)
>>>> }
>>>>
>>>> static void sock_map_sk_acquire(struct sock *sk)
>>>> - __acquires(&sk->sk_lock.slock)
>>>> {
>>>> lock_sock(sk);
>>>> +
>>>> + if (sk_is_unix(sk))
>>>> + unix_state_lock(sk);
>>>> +
>>>> rcu_read_lock();
>>>> }
>>>>
>>>
>>> This introduces a new ordering constraint: lock_sock() before
>>> unix_state_lock(). Kuniyuki flagged in the v2 review that taking
>>> lock_sock() inside unix_state_lock() in the future would create an
>>> ABBA deadlock, which is exactly why the order was settled this way. However,
>>> the thread did not reach a conclusion on whether that constraint should be
>>> documented in the code.
>>>
>>> Since there is nothing enforcing this ordering mechanically, a brief comment
>>> at sock_map_sk_acquire() would help future readers avoid introducing the
>>> inverse ordering.
>>
>> Sure, will do.
>>
>>>> static void sock_map_sk_release(struct sock *sk)
>>>> - __releases(&sk->sk_lock.slock)
>>>> {
>>>> rcu_read_unlock();
>>>> +
>>>> + if (sk_is_unix(sk))
>>>> + unix_state_unlock(sk);
>>>> +
>>>> release_sock(sk);
>>>> }
>>>>
>>>> +static inline void sock_map_sk_acquire_fast(struct sock *sk)
>>>> +{
>>>> + local_bh_disable();
>>>> + bh_lock_sock(sk);
>>>> +
>>>> + if (sk_is_unix(sk))
>>>> + unix_state_lock(sk);
>>>> +}
>>>> +
>>>
>>>
>>> v2 and v3 differ here in a way that deserves a closer look. In v2, AF_UNIX
>>> sockets took only unix_state_lock() in the fast path, skipping
>>> bh_lock_sock() entirely:
>>>
>>> /* v2 */
>>> if (sk_is_unix(sk))
>>> unix_state_lock(sk);
>>> else
>>> bh_lock_sock(sk);
>>>
>>> v3 takes both for AF_UNIX sockets.
>>>
>>> bh_lock_sock() protects sock::sk_lock.slock, whereas the state that
>>> actually needs protection here — sk_state and unix_peer() — lives under
>>> unix_sock::lock. Since unix_state_lock() is already sufficient to close
>>> the race against unix_stream_connect(), is bh_lock_sock() still doing
>>> anything useful for AF_UNIX sockets in this path?
>>
>> Yeah, good point. I think I was just trying to keep the _fast variant
>> aligned with the sleepy one. Which I agree might be unnecessary.
>
> I hope the common use case should not be calling
> bpf_map_update_elem(sockmap) only.
>
> Beside, from looking at the may_update_sockmap(), I don't know if it is
> even doable (or useful) to bpf_map_update_elem(unix_sk) in
> tc/flow_dissector/xdp. One possible path is the SOCK_FILTER when looking
> at unix_dgram_sendmsg() => sk_filter(). It was not the original use case
> when the bpf_map_update_elem(sockmap) support was added iirc.
What about a situation when unix_sk is stored in a sockmap, then tc prog
looks it up and invokes bpf_map_update_elem(unix_sk)? I'm not sure it's
useful, but seems doable.
> The only path left is bpf_iter, which I believe was the primary use case
> when adding bpf_map_update_elem(sockmap) support [1]. It would be nice
> to avoid bh_lock_sock() when calling from all bpf_iter (tcp/udp/unix)
> where lock_sock() has already been done. It is more for
> reading-correctness though. This just came to my mind.
> has_current_bpf_ctx() can be used to check this. sockopt_lock_sock() has
> been using it to conditionally take lock_sock() or not.
[ One clarification: bh_lock_sock() is a sock_map_update_elem() thing,
which can only be called by a bpf prog. IOW, has_current_bpf_ctx() is
always `true` in sock_map_update_elem(), right? ]
Let me know if I'm correctly rephrasing your idea: assume all bpf-context
callers hold the socket locked or keep it "stable" (meaning: "sk won't
surprise sockmap update by some breaking state change coming from another
thread"). As you said, most bpf iters already take the sock_lock(), and I
have a patch that fixes sock_{map,hash}_seq_show(). Then we could try
dropping that bh_lock_sock().
> [ I would still keep patch 3 though. ]
Right.
> [1]: https://lore.kernel.org/bpf/[email protected]/
>
>>
>> In a parallel thread I've asked Kuniyuki if it might be better to
>> completely drop patch 2/5, which would change how we interact with
>> sock_map_close(). Lets see how it goes.
>>
>
> If patch 2 is dropped, lock_sock() is always needed for unix_sk?
For sock_map_update_elem_sys() I wanted to lock_sock()+unix_state_lock()
following Kuniyuki's suggestion to keep locking pattern/order (that repeats
when unix bpf iter prog invokes bpf_map_update_elem() ->
sock_map_update_elem()). For sock_map_update_elem() not, we can't sleep there.