On Mon, Mar 12, 2018 at 11:25:09PM -0700, Eric Dumazet wrote: > > > On 03/12/2018 11:08 PM, Yonghong Song wrote: > > > > > > On 3/12/18 11:04 PM, Eric Dumazet wrote: > > > > > > > > > On 03/12/2018 10:45 PM, Yonghong Song wrote: > > > > ... > > > > Setup: > > > > ===== > > > > > > > > The test will involve three machines: > > > > M_ipv6 <-> M_nat <-> M_ipv4 > > > > > > > > The M_nat will do ipv4<->ipv6 address translation and then > > > > forward packet > > > > to proper destination. The control plane will configure M_nat properly > > > > will understand virtual ipv4 address for machine M_ipv6, and > > > > virtual ipv6 address for machine M_ipv4. > > > > > > > > M_nat runs a bpf program, which is attached to clsact (ingress) qdisc. > > > > The program uses bpf_skb_change_proto to do protocol conversion. > > > > bpf_skb_change_proto will adjust skb header_len and len properly > > > > based on protocol change. > > > > After the conversion, the program will make proper change on > > > > ethhdr and ip4/6 header, recalculate checksum, and send the packet out > > > > through bpf_redirect. > > > > > > > > Experiment: > > > > =========== > > > > > > > > MTU: 1500B for all three machines. > > > > > > > > The tso/lro/gro are enabled on the M_nat box. > > > > > > > > ping works on both ways of M_ipv6 <-> M_ipv4. > > > > It works for transfering a small file (4KB) between M_ipv6 and > > > > M_ipv4 (both ways). > > > > Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, > > > > failed with the above BUG_ON, really fast. > > > > Did not really test from M_ipv4 to M_ipv6 with large file. > > > > > > > > The error path likely to be (also from the above call stack): > > > > nic -> lro/gro -> bpf_program -> gso (BUG_ON)
Just out of curiosity, are these packets created with LRO or GRO? Usually LRO is disabled if forwarding is enabled on a machine, because segmented LGO packets are likely corrupt. These packets take an alternative redirect path, so not sure what happens here. > > > > > > > > In one of experiments, I explicitly printed the skb->len and > > > > skb->data_len. The values are below: > > > > skb_segment: len 2856, data_len 2686 > > > > They should be equal to avoid BUG. > > > > > > > > In another experiment, I got: > > > > skb_segment: len 1428, data_len 1258 > > > > > > > > In both cases, the difference is 170 bytes. Not sure whether > > > > this is just a coincidence or not. > > > > > > > > Workaround: > > > > =========== > > > > > > > > A workaround to avoid BUG_ON is to disable lro/gro. This way, > > > > kernel will not receive big packets and hence gso is not really called. > > > > > > > > I am not familiar with gso code. Does anybody hit this BUG_ON before? > > > > Any suggestion on how to debug this? > > > > > > > > > > skb_segment() works if incoming GRO packet is not modified in its > > > geometry. > > > > > > In your case it seems you had to adjust gso_size (calling > > > skb_decrease_gso_size() or skb_increase_gso_size()), and this breaks > > > skb_segment() badly, because geometry changes, unless you had > > > specific MTU/MSS restrictions. > > > > > > You will have to make skb_segment() more generic if you really want this. > > > > In net/core/filter.c function bpf_skb_change_proto, which is called > > in the bpf program, does some GSO adjustment. Could you help check > > whether it satisfies my above use case or not? Thanks! > > As I said this helper ends up modifying gso_size by +/- 20 (sizeof(ipv6 > header) - sizeof(ipv4 header)) > > So it wont work if skb_segment() is called after this change. Even HW TSO use gso_size to segment the packets. Would'nt this result in broken packets too, if gso_size is modified on a forwarding path? > > Not clear why the GRO packet is not sent as is (as a TSO packet) since > mlx4/mlx5 NICs certainly support TSO. If the packets are generated with GRO, there could be data chained at the frag_list pointer. Most NICs can't offload such skbs, so if skb_segment() can't split at the frag_list pointer, it will just segment the packets based on gso_size.