On Tue, 26 Feb 2019 21:02:10 +0800
Sheng Lan <lansh...@huawei.com> wrote:

> > On Mon, 25 Feb 2019 22:49:39 +0800
> > Sheng Lan <lansh...@huawei.com> wrote:
> >   
> >> From: Sheng Lan <lansh...@huawei.com>
> >> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
> >>
> >> It can be reproduced by following steps:
> >> 1. virtio_net NIC is configured with gso/tso on
> >> 2. configure nginx as http server with an index file bigger than 1M bytes
> >> 3. use tc netem to produce duplicate packets and delay:
> >>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
> >> 4. continually curl the nginx http server to get index file on client
> >> 5. BUG_ON is seen quickly
> >>
> >> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
> >> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
> >> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W        
> >>  5.0.0-rc6 #2
> >> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
> >> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 
> >> 00000000000005ea
> >> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 
> >> 0000000000000002
> >> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: 
> >> ffffa0578f7cb028
> >> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) 
> >> knlGS:0000000000000000
> >> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 
> >> 00000000000006e0
> >> [10258690.372094] Call Trace:
> >> [10258690.372094]  <IRQ>
> >> [10258690.372094]  skb_to_sgvec+0x11/0x40
> >> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
> >> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
> >> [10258690.372094]  sch_direct_xmit+0xff/0x260
> >> [10258690.372094]  __qdisc_run+0x15e/0x4e0
> >> [10258690.372094]  net_tx_action+0x137/0x210
> >> [10258690.372094]  __do_softirq+0xd6/0x2a9
> >> [10258690.372094]  irq_exit+0xde/0xf0
> >> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
> >> [10258690.372094]  apic_timer_interrupt+0xf/0x20
> >> [10258690.372094]  </IRQ>
> >>
> >> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
> >> linear data size and nonlinear data size, thus BUG_ON triggered. The
> >> bad skb's nonlinear data size is less than skb->data_len, because the
> >> skb is cloned and a part of related cloned skb's nonlinear data is
> >> split off.
> >>
> >> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
> >> some time in qdisc. Due to the delay time, the original skb will be pushed
> >> again later in __tcp_push_pending_frames when tcp receives new packets.
> >> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
> >> the original skb will be fragmented and the skb's nonlinear data will be
> >> split off. The length of the skb cloned by netem will not be updated.
> >> When we use virtio_net NIC, the duplicated cloned skb will be filled into
> >> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
> >>
> >> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
> >> the duplicated skb's nonlinear data is independent.
> >>
> >> Signed-off-by: Sheng Lan <lansh...@huawei.com>
> >> Reported-by: Qin Ji <jiqin...@huawei.com>
> >>
> >> Fixes: 0afb51e7 ("netem: reinsert for duplication")  
> > 
> > This sounds like a bug in the other layers (either TCP or Virtio net)
> > not handling a cloned skb properly.
> >   
> 
> I have traced the route of skb by printk, let me take an example to describe 
> the problem to make it clearly:
> Mss value equals to 1448. Limit value is the split size when tcp do 
> tso_fragment, is depending on the size of the sending congestion window and 
> mss value.
> 
> TCP layer transmit the index file to client, the original skb1 size is large:
> ...
> tcp_write_xmit            (skb1->data_len == 62264, limit == 2*mss == 2896)
> tso_fragment              (it needs to be fragmented by limit value)
> skb_split                 (after split, skb1->data_len == 2896, 
> skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
> ...
> netem_enqueue             (netem construct a duplicate packet of skb1 by 
> skb_clone)
> skb2 = skb_clone(skb1)    (skb1->data_len == skb2->data_len == 2896, skb1 and 
> skb2 share the nonlinear data frags[0] == 2896)
> waiting 30ms              (skb1 and skb2 will be delayed in qdisc queue due 
> to the netem delay configuration)
> 
> 
> TCP layer receives new packets and trys to retransmit the skb1:
> tcp_rcv_established
> __tcp_push_pending_frames
> tcp_write_xmit            (skb1->data_len == 2896, cwnd size decreased or 
> packets in flight increased, cause the limit decreased to 1*mss == 1448)
> tso_fragment              (limit value is less than skb1->data_len, skb1 will 
> be fragmented again)


Maybe the fix is to stop TSO fragment from overwriting by doing something like:

diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 730bc44dbad9..5fe91d0224f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1856,7 +1856,7 @@ static int tso_fragment(struct sock *sk, enum tcp_queue 
tcp_queue,
        u8 flags;
 
        /* All of a TSO frame must be composed of paged data.  */
-       if (skb->len != skb->data_len)
+       if (skb->len != skb->data_len || skb_cloned(skb))
                return tcp_fragment(sk, tcp_queue, skb, len, mss_now, gfp);
 
        buff = sk_stream_alloc_skb(sk, 0, gfp, true);

Reply via email to