On 03/14/2018 12:09 AM, Alexei Starovoitov wrote: > On 3/13/18 3:47 PM, Eric Dumazet wrote: >> >> >> On 03/13/2018 03:37 PM, Yonghong Song wrote: >>> Adding additional cc's: >>> Saeed Mahameed as this is most likely mlx5 driver related. >>> Diptanu Gon Choudhury who initially reported the issue. >>> >>> >>> On 3/13/18 1:44 AM, Steffen Klassert wrote: >>>> On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote: >>>>> >>>>> >>>>> On 03/12/2018 11:08 PM, Yonghong Song wrote: >>>>>> >>>>>> >>>>>> On 3/12/18 11:04 PM, Eric Dumazet wrote: >>>>>>> >>>>>>> >>>>>>> On 03/12/2018 10:45 PM, Yonghong Song wrote: >>>>>>>> ... >>>>>>>> Setup: >>>>>>>> ===== >>>>>>>> >>>>>>>> The test will involve three machines: >>>>>>>> M_ipv6 <-> M_nat <-> M_ipv4 >>>>>>>> >>>>>>>> The M_nat will do ipv4<->ipv6 address translation and then >>>>>>>> forward packet >>>>>>>> to proper destination. The control plane will configure M_nat >>>>>>>> properly >>>>>>>> will understand virtual ipv4 address for machine M_ipv6, and >>>>>>>> virtual ipv6 address for machine M_ipv4. >>>>>>>> >>>>>>>> M_nat runs a bpf program, which is attached to clsact (ingress) >>>>>>>> qdisc. >>>>>>>> The program uses bpf_skb_change_proto to do protocol conversion. >>>>>>>> bpf_skb_change_proto will adjust skb header_len and len properly >>>>>>>> based on protocol change. >>>>>>>> After the conversion, the program will make proper change on >>>>>>>> ethhdr and ip4/6 header, recalculate checksum, and send the >>>>>>>> packet out >>>>>>>> through bpf_redirect. >>>>>>>> >>>>>>>> Experiment: >>>>>>>> =========== >>>>>>>> >>>>>>>> MTU: 1500B for all three machines. >>>>>>>> >>>>>>>> The tso/lro/gro are enabled on the M_nat box. >>>>>>>> >>>>>>>> ping works on both ways of M_ipv6 <-> M_ipv4. >>>>>>>> It works for transfering a small file (4KB) between M_ipv6 and >>>>>>>> M_ipv4 (both ways). >>>>>>>> Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, >>>>>>>> failed with the above BUG_ON, really fast. >>>>>>>> Did not really test from M_ipv4 to M_ipv6 with large file. >>>>>>>> >>>>>>>> The error path likely to be (also from the above call stack): >>>>>>>> nic -> lro/gro -> bpf_program -> gso (BUG_ON) >>>> >>>> Just out of curiosity, are these packets created with LRO or GRO? >>>> Usually LRO is disabled if forwarding is enabled on a machine, >>>> because segmented LGO packets are likely corrupt. >>> >>> In our experiments, LRO is disabled. >>> On mlx5, when GRO is on, the BUG_ON will happen, and >>> when GRO is off, the BUG_ON will not happen. >>> >>>> >>>> These packets take an alternative redirect path, so not sure what >>>> happens here. >>>> >>>>>>>> >>>>>>>> In one of experiments, I explicitly printed the skb->len and >>>>>>>> skb->data_len. The values are below: >>>>>>>> skb_segment: len 2856, data_len 2686 >>>>>>>> They should be equal to avoid BUG. >>>>>>>> >>>>>>>> In another experiment, I got: >>>>>>>> skb_segment: len 1428, data_len 1258 >>>>>>>> >>>>>>>> In both cases, the difference is 170 bytes. Not sure whether >>>>>>>> this is just a coincidence or not. >>>>>>>> >>>>>>>> Workaround: >>>>>>>> =========== >>>>>>>> >>>>>>>> A workaround to avoid BUG_ON is to disable lro/gro. This way, >>>>>>>> kernel will not receive big packets and hence gso is not really >>>>>>>> called. >>>>>>>> >>>>>>>> I am not familiar with gso code. Does anybody hit this BUG_ON >>>>>>>> before? >>>>>>>> Any suggestion on how to debug this? >>>>>>>> >>>>>>> >>>>>>> skb_segment() works if incoming GRO packet is not modified in its >>>>>>> geometry. >>>>>>> >>>>>>> In your case it seems you had to adjust gso_size (calling >>>>>>> skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks >>>>>>> skb_segment() badly, because geometry changes, unless you had >>>>>>> specific MTU/MSS restrictions. >>>>>>> >>>>>>> You will have to make skb_segment() more generic if you really >>>>>>> want this. >>>>>> >>>>>> In net/core/filter.c function bpf_skb_change_proto, which is called >>>>>> in the bpf program, does some GSO adjustment. Could you help check >>>>>> whether it satisfies my above use case or not? Thanks! >>>>> >>>>> As I said this helper ends up modifying gso_size by +/- 20 >>>>> (sizeof(ipv6 >>>>> header) - sizeof(ipv4 header)) >>>>> >>>>> So it wont work if skb_segment() is called after this change. >>>> >>>> Even HW TSO use gso_size to segment the packets. Would'nt this >>>> result in broken packets too, if gso_size is modified on a >>>> forwarding path? >>>> >>>>> >>>>> Not clear why the GRO packet is not sent as is (as a TSO packet) since >>>>> mlx4/mlx5 NICs certainly support TSO. >>> >>> This is a very good observation. >>> We did the same experiment on mlx4, the same kernel, the same >>> userspace apps, the same bpf program. The only difference is mlx4 vs. >>> mlx5. >>> The mlx4 works fine with LRO=off and GRO=on, while >>> mlx5 failed with the same LRO/GRO configuration. >>> >>> On mlx4 box, we are able to see TSO packets are increasing as the >>> large file scp is in progress. >>> # ethtool -S eth0 | grep tso >>> tso_packets: 45495 >>> # ethtool -S eth0 | grep tso >>> tso_packets: 45865 >>> # ethtool -S eth0 | grep tso >>> tso_packets: 46337 >>> # ethtool -S eth0 | grep tso >>> tso_packets: 46724 >>> >>> And use bcc tool to track to func call count for skb_segment >>> and find it is called 0 times. Clearly, mlx4 is able to take >>> the packet as TSO and hence the packet will not go to >>> the stack. >>> >>> # funccount.py -i 3 'skb_segment' >>> Tracing 1 functions for "skb_segment"... Hit Ctrl-C to end. >>> >>> FUNC COUNT >>> >>> FUNC COUNT >>> ... >>> >>> CC Saeed Mahameed who may help debug and provide some insights >>> what is the problem. >> >> There are many reasons why a GRO packet might need to be segmented by >> software (skb_segment()) >> >> This is the step where you have crashes, so really mlx4/mlx5 difference >> do not matter. >> >> gso_size should not be dynamically changed. This needs to be fixed in >> eBPF helpers eventually. > > we have bpf_skb_proto_6_to_4() that was used by cilium for long time. > It's not clear why it's not crashing there, but we cannot just > reject changing proto in bpf programs now.
Right, used in Cilium since the helper was added. Back then I recall I tested this back to back over ixgbe w/ the various options mixed on/off with BPF prog implementing nat64 on each side, running over various netperf streams; and later integrated the logic into Cilium itself. Perhaps something changed with more recent kernels as well. I'll look closer into this issue tomorrow. > We have to fix whatever needs to be fixed in skb_segment > (if bug is there) or fix whatever necessary on mlx5 side. > In bpf helper we mark it as SKB_GSO_DODGY just like packets coming > through virtio would do, so if skb_segment() needs to do something > special with skb the SKB_GSO_DODGY flag is already there.