Hi, Song On 2018/3/13 13:45, Yonghong Song wrote: > Hi, > > One of our in-house projects, bpf-based NAT, hits a kernel BUG_ON at > net-next function skb_segment, line 3667. > > 3472 struct sk_buff *skb_segment(struct sk_buff *head_skb, > 3473 netdev_features_t features) > 3474 { > 3475 struct sk_buff *segs = NULL; > 3476 struct sk_buff *tail = NULL; > ... > 3665 while (pos < offset + len) { > 3666 if (i >= nfrags) { > 3667 BUG_ON(skb_headlen(list_skb)); > 3668 > 3669 i = 0; > 3670 nfrags = skb_shinfo(list_skb)->nr_frags; > 3671 frag = skb_shinfo(list_skb)->frags; > 3672 frag_skb = list_skb; > ... > > call stack: > ... > #0 [ffff883ffef034f8] machine_kexec at ffffffff81044c41 > #1 [ffff883ffef03558] __crash_kexec at ffffffff8110c525 > #2 [ffff883ffef03620] crash_kexec at ffffffff8110d5cc > #3 [ffff883ffef03640] oops_end at ffffffff8101d7e7 > #4 [ffff883ffef03668] die at ffffffff8101deb2 > #5 [ffff883ffef03698] do_trap at ffffffff8101a700 > #6 [ffff883ffef036e8] do_error_trap at ffffffff8101abfe > #7 [ffff883ffef037a0] do_invalid_op at ffffffff8101acd0 > #8 [ffff883ffef037b0] invalid_op at ffffffff81a00bab > [exception RIP: skb_segment+3044] > RIP: ffffffff817e4dd4 RSP: ffff883ffef03860 RFLAGS: 00010216 > RAX: 0000000000002bf6 RBX: ffff883feb7aaa00 RCX: 0000000000000011 > RDX: ffff883fb87910c0 RSI: 0000000000000011 RDI: ffff883feb7ab500 > RBP: ffff883ffef03928 R8: 0000000000002ce2 R9: 00000000000027da > R10: 000001ea00000000 R11: 0000000000002d82 R12: ffff883f90a1ee80 > R13: ffff883fb8791120 R14: ffff883feb7abc00 R15: 0000000000002ce2 > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > #9 [ffff883ffef03930] tcp_gso_segment at ffffffff818713e7 > #10 [ffff883ffef03990] tcp4_gso_segment at ffffffff818717d8 > #11 [ffff883ffef039b0] inet_gso_segment at ffffffff81882c9b > #12 [ffff883ffef03a10] skb_mac_gso_segment at ffffffff817f39b8 > #13 [ffff883ffef03a38] __skb_gso_segment at ffffffff817f3ac9 > #14 [ffff883ffef03a68] validate_xmit_skb at ffffffff817f3eed > #15 [ffff883ffef03aa8] validate_xmit_skb_list at ffffffff817f40a2 > #16 [ffff883ffef03ad8] sch_direct_xmit at ffffffff81824efb > #17 [ffff883ffef03b20] __qdisc_run at ffffffff818251aa > #18 [ffff883ffef03b90] __dev_queue_xmit at ffffffff817f45ed > #19 [ffff883ffef03c08] dev_queue_xmit at ffffffff817f4b90 > #20 [ffff883ffef03c18] __bpf_redirect at ffffffff81812b66 > #21 [ffff883ffef03c40] skb_do_redirect at ffffffff81813209 > #22 [ffff883ffef03c60] __netif_receive_skb_core at ffffffff817f310d > #23 [ffff883ffef03cc8] __netif_receive_skb at ffffffff817f32e8 > #24 [ffff883ffef03ce8] netif_receive_skb_internal at ffffffff817f5538 > #25 [ffff883ffef03d10] napi_gro_complete at ffffffff817f56c0 > #26 [ffff883ffef03d28] dev_gro_receive at ffffffff817f5ea6 > #27 [ffff883ffef03d78] napi_gro_receive at ffffffff817f6168 > #28 [ffff883ffef03da0] mlx5e_handle_rx_cqe_mpwrq at ffffffff817381c2 > #29 [ffff883ffef03e30] mlx5e_poll_rx_cq at ffffffff817386c2 > #30 [ffff883ffef03e80] mlx5e_napi_poll at ffffffff8173926e > #31 [ffff883ffef03ed0] net_rx_action at ffffffff817f5a6e > #32 [ffff883ffef03f48] __softirqentry_text_start at ffffffff81c000c3 > #33 [ffff883ffef03fa8] irq_exit at ffffffff8108f515 > #34 [ffff883ffef03fb8] do_IRQ at ffffffff81a01b11 > --- <IRQ stack> --- > bt: cannot transition from IRQ stack to current process stack: > IRQ stack pointer: ffff883ffef034f8 > process stack pointer: ffffffff81a01ae9 > current stack base: ffffc9000c5c4000 > ... > Setup: > ===== > > The test will involve three machines: > M_ipv6 <-> M_nat <-> M_ipv4 > > The M_nat will do ipv4<->ipv6 address translation and then forward packet > to proper destination. The control plane will configure M_nat properly > will understand virtual ipv4 address for machine M_ipv6, and > virtual ipv6 address for machine M_ipv4. > > M_nat runs a bpf program, which is attached to clsact (ingress) qdisc. > The program uses bpf_skb_change_proto to do protocol conversion. > bpf_skb_change_proto will adjust skb header_len and len properly > based on protocol change. > After the conversion, the program will make proper change on > ethhdr and ip4/6 header, recalculate checksum, and send the packet out > through bpf_redirect. > > Experiment: > =========== > > MTU: 1500B for all three machines. > > The tso/lro/gro are enabled on the M_nat box. > > ping works on both ways of M_ipv6 <-> M_ipv4. > It works for transfering a small file (4KB) between M_ipv6 and M_ipv4 (both > ways). > Transfering a large file (e.g., 4MB) from M_ipv6 to M_ipv4, failed with the > above BUG_ON, really fast. > Did not really test from M_ipv4 to M_ipv6 with large file. > > The error path likely to be (also from the above call stack): > nic -> lro/gro -> bpf_program -> gso (BUG_ON) > > In one of experiments, I explicitly printed the skb->len and skb->data_len. > The values are below: > skb_segment: len 2856, data_len 2686 > They should be equal to avoid BUG. > > In another experiment, I got: > skb_segment: len 1428, data_len 1258 > > In both cases, the difference is 170 bytes. Not sure whether > this is just a coincidence or not. > > Workaround: > =========== > > A workaround to avoid BUG_ON is to disable lro/gro. This way, > kernel will not receive big packets and hence gso is not really called. > > I am not familiar with gso code. Does anybody hit this BUG_ON before? > Any suggestion on how to debug this?
When the bpf_program do ipv4<->ipv6 address translation , shinfo->gso_type maybe need to be changed, for example, SKB_GSO_TCPV4 -> SKB_GSO_TCPV6. I am not sure if there are other field related to gro need to be changed, you may want to debug it. You may call call skb_mac_gso_segment with the packet after address translation to debug this problem. Hope this will help. > > Thanks! > > Yonghong > >