On Mon, Feb 04, 2019 at 04:50:32PM -0800, Martin Lau wrote: > On Mon, Feb 04, 2019 at 11:33:28PM +0100, Daniel Borkmann wrote: > > Hi Martin, > > > > On 02/01/2019 08:03 AM, Martin KaFai Lau wrote: > > > In kernel, it is common to check "!skb->sk && sk_fullsock(skb->sk)" > > > before accessing the fields in sock. For example, in __netdev_pick_tx: > > > > > > static u16 __netdev_pick_tx(struct net_device *dev, struct sk_buff *skb, > > > struct net_device *sb_dev) > > > { > > > /* ... */ > > > > > > struct sock *sk = skb->sk; > > > > > > if (queue_index != new_index && sk && > > > sk_fullsock(sk) && > > > rcu_access_pointer(sk->sk_dst_cache)) > > > sk_tx_queue_set(sk, new_index); > > > > > > /* ... */ > > > > > > return queue_index; > > > } > > > > > > This patch adds a "struct bpf_sock *sk" pointer to the "struct __sk_buff" > > > where a few of the convert_ctx_access() in filter.c has already been > > > accessing the skb->sk sock_common's fields, > > > e.g. sock_ops_convert_ctx_access(). > > > > > > "__sk_buff->sk" is a PTR_TO_SOCK_COMMON_OR_NULL in the verifier. > > > Some of the fileds in "bpf_sock" will not be directly > > > accessible through the "__sk_buff->sk" pointer. It is limited > > > by the new "bpf_sock_common_is_valid_access()". > > > e.g. The existing "type", "protocol", "mark" and "priority" in bpf_sock > > > are not allowed. > > > > > > The newly added "struct bpf_sock *bpf_sk_fullsock(struct bpf_sock *sk)" > > > can be used to get a sk with all accessible fields in "bpf_sock". > > > This helper is added to both cg_skb and sched_(cls|act). > > > > > > int cg_skb_foo(struct __sk_buff *skb) { > > > struct bpf_sock *sk; > > > __u32 family; > > > > > > sk = skb->sk; > > > if (!sk) > > > return 1; > > > > > > sk = bpf_sk_fullsock(sk); > > > if (!sk) > > > return 1; > > > > > > if (sk->family != AF_INET6 || sk->protocol != IPPROTO_TCP) > > > return 1; > > > > > > /* some_traffic_shaping(); */ > > > > > > return 1; > > > } > > > > > > (1) The sk is read only > > > > > > (2) There is no new "struct bpf_sock_common" introduced. > > > > > > (3) Future kernel sock's members could be added to bpf_sock only > > > instead of repeatedly adding at multiple places like currently > > > in bpf_sock_ops_md, bpf_sock_addr_md, sk_reuseport_md...etc. > > > > > > (4) After "sk = skb->sk", the reg holding sk is in type > > > PTR_TO_SOCK_COMMON_OR_NULL. > > > > > > (5) After bpf_sk_fullsock(), the return type will be in type > > > PTR_TO_SOCKET_OR_NULL which is the same as the return type of > > > bpf_sk_lookup_xxx(). > > > > > > However, bpf_sk_fullsock() does not take refcnt. The > > > acquire_reference_state() is only depending on the return type now. > > > To avoid it, a new is_acquire_function() is checked before calling > > > acquire_reference_state(). > > > > Bit unfortunate that a helper like bpf_sk_fullsock() would be needed, after > > all this is more of an implementation detail which we would expose here to > > the developer. > > > > Is there a specific reason why fetching skb->sk couldn't already be of the > > type PTR_TO_SOCKET_OR_NULL such that the bpf_sk_fullsock() step wouldn't be > > needed and most logic we have today could already be reused (modulo refcnt > > avoidance)? > Not all running context has a fullsock (PTR_TO_SOCKET_OR_NULL). > > Based on how sk_to_full_sk() is used (e.g. in bpf_get_socket_uid()), > not sure a sk (e.g. tw sock) can always be traced back to a full sk. > > In term of the patch implementation, it is not much difference. It is a bit > simplier without bpf_sk_fullsock() and PTR_TO_SOCK_COMMON(_OR_NULL) but > not a lot. I have tried both. > > The "fullsock" has already been exposed in another form. > e.g. In sock_ops, the tcp_sock fields is not read if it is not a fullsock > while other sock_common fields will still be available. The bpf_prog > can test the sock_ops->is_fullsock for what to do. > > > > > In particular, do you need the skb->sk without the full-sk part somewhere > > (e.g. in tw socks)? Why not doing something like sk_to_full_sk() inside the > > helper or even better as BPF ctx rewrite upon skb->sk to fetch the full sk > > parent where you could also access remaining bpf_sock fields? > I am thinking more on what if the bpf_prog only needs the fields from > sock_common (e.g. the src/dst ip/port) and skb already has > other needed info (e.g. protocol/mark/priority). > Enforing skb->sk must be a fullsock will unnecessarily limit those > bpf_prog from seeing all skb. > > A "struct bpf_common_sock" could be added instead vs a bpf_sk_fullsock() > tester. I think having one "struct bpf_sock" is better and less confusing. > > Later, for the running context that is sure to have a fullsock, > skb->sk can directly have PTR_TO_SOCKET_OR_NULL instead of > PTR_TO_SOCK_COMMON_OR_NULL. > Following up the discussion in the iovisor conf call.
One of discussion was about: other than tw, can __sk_buff->sk always return a fullsock (PTR_TO_SOCKET_OR_NULL). In request_sock case, it is doable because it can trace back to the listener sock. However, that will go back to the sock_common accessing question. In particular, how to access the sock_common's fields of the request_sock itself? Those fields in the request_sock are different from its listener sock. e.g. the skc_daddr and skc_dport. Also, if the sock_common fields of tw is needed, it will become weird because likely a new "struct bpf_tw_sock" is needed which is OK but all sock_common fields need to be copied from bpf_sock to bpf_tw_sock. I think reading a sk from a ctx should return the most basic type PTR_TO_SOCK_COMMON_OR_NULL (unless the running ctx can guarantee that it always has a fullsock). Currently, it is __sk_buff->sk. Later, sock_ops->sk...etc. One single 'struct bpf_sock' and limit fullsock field access at verification time. The bpf_prog then moves down the chain based on what it needs. It could be fullsock, tcp_sock...etc. I think that will be the most flexible way to write bpf_prog while also avoid having duplicate fields in different bpf struct in uapi.