在 2019/2/26 23:52, Eric Dumazet 写道:
> 
> 
> On 02/26/2019 05:02 AM, Sheng Lan wrote:
>>
>>
>>
>>> On Mon, 25 Feb 2019 22:49:39 +0800
>>> Sheng Lan <lansh...@huawei.com> wrote:
>>>
>>>> From: Sheng Lan <lansh...@huawei.com>
>>>> Subject: [PATCH] net: netem: fix skb length BUG_ON in __skb_to_sgvec
>>>>
>>>> It can be reproduced by following steps:
>>>> 1. virtio_net NIC is configured with gso/tso on
>>>> 2. configure nginx as http server with an index file bigger than 1M bytes
>>>> 3. use tc netem to produce duplicate packets and delay:
>>>>    tc qdisc add dev eth0 root netem delay 100ms 10ms 30% duplicate 90%
>>>> 4. continually curl the nginx http server to get index file on client
>>>> 5. BUG_ON is seen quickly
>>>>
>>>> [10258690.371129] kernel BUG at net/core/skbuff.c:4028!
>>>> [10258690.371748] invalid opcode: 0000 [#1] SMP PTI
>>>> [10258690.372094] CPU: 5 PID: 0 Comm: swapper/5 Tainted: G        W        
>>>>  5.0.0-rc6 #2
>>>> [10258690.372094] RSP: 0018:ffffa05797b43da0 EFLAGS: 00010202
>>>> [10258690.372094] RBP: 00000000000005ea R08: 0000000000000000 R09: 
>>>> 00000000000005ea
>>>> [10258690.372094] R10: ffffa0579334d800 R11: 00000000000002c0 R12: 
>>>> 0000000000000002
>>>> [10258690.372094] R13: 0000000000000000 R14: ffffa05793122900 R15: 
>>>> ffffa0578f7cb028
>>>> [10258690.372094] FS:  0000000000000000(0000) GS:ffffa05797b40000(0000) 
>>>> knlGS:0000000000000000
>>>> [10258690.372094] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>>> [10258690.372094] CR2: 00007f1a6dc00868 CR3: 000000001000e000 CR4: 
>>>> 00000000000006e0
>>>> [10258690.372094] Call Trace:
>>>> [10258690.372094]  <IRQ>
>>>> [10258690.372094]  skb_to_sgvec+0x11/0x40
>>>> [10258690.372094]  start_xmit+0x38c/0x520 [virtio_net]
>>>> [10258690.372094]  dev_hard_start_xmit+0x9b/0x200
>>>> [10258690.372094]  sch_direct_xmit+0xff/0x260
>>>> [10258690.372094]  __qdisc_run+0x15e/0x4e0
>>>> [10258690.372094]  net_tx_action+0x137/0x210
>>>> [10258690.372094]  __do_softirq+0xd6/0x2a9
>>>> [10258690.372094]  irq_exit+0xde/0xf0
>>>> [10258690.372094]  smp_apic_timer_interrupt+0x74/0x140
>>>> [10258690.372094]  apic_timer_interrupt+0xf/0x20
>>>> [10258690.372094]  </IRQ>
>>>>
>>>> In __skb_to_sgvec, the skb->len is not equal to the sum of the skb's
>>>> linear data size and nonlinear data size, thus BUG_ON triggered. The
>>>> bad skb's nonlinear data size is less than skb->data_len, because the
>>>> skb is cloned and a part of related cloned skb's nonlinear data is
>>>> split off.
>>>>
>>>> Duplicate packet is cloned by skb_clone in netem_enqueue and may be delayed
>>>> some time in qdisc. Due to the delay time, the original skb will be pushed
>>>> again later in __tcp_push_pending_frames when tcp receives new packets.
>>>> In tcp_write_xmit, when the tcp_mss_split_point returns a smaller limit,
>>>> the original skb will be fragmented and the skb's nonlinear data will be
>>>> split off. The length of the skb cloned by netem will not be updated.
>>>> When we use virtio_net NIC, the duplicated cloned skb will be filled into
>>>> a scatter-gather list in __skb_to_sgvec and trigger the BUG_ON.
>>>>
>>>> Here I replace the skb_clone with skb_copy in netem_enqueue to ensure
>>>> the duplicated skb's nonlinear data is independent.
>>>>
>>>> Signed-off-by: Sheng Lan <lansh...@huawei.com>
>>>> Reported-by: Qin Ji <jiqin...@huawei.com>
>>>>
>>>> Fixes: 0afb51e7 ("netem: reinsert for duplication")
>>>
>>> This sounds like a bug in the other layers (either TCP or Virtio net)
>>> not handling a cloned skb properly.
>>>
>>
>> I have traced the route of skb by printk, let me take an example to describe 
>> the problem to make it clearly:
>> Mss value equals to 1448. Limit value is the split size when tcp do 
>> tso_fragment, is depending on the size of the sending congestion window and 
>> mss value.
>>
>> TCP layer transmit the index file to client, the original skb1 size is large:
>> ...
>> tcp_write_xmit            (skb1->data_len == 62264, limit == 2*mss == 2896)
>> tso_fragment              (it needs to be fragmented by limit value)
>> skb_split                 (after split, skb1->data_len == 2896, 
>> skb_shinfo(skb1)->frags[0] == 2896, skb_shinfo(skb1)->nr_frags == 1)
>> ...
>> netem_enqueue             (netem construct a duplicate packet of skb1 by 
>> skb_clone)
>> skb2 = skb_clone(skb1)    (skb1->data_len == skb2->data_len == 2896, skb1 
>> and skb2 share the nonlinear data frags[0] == 2896)
>> waiting 30ms              (skb1 and skb2 will be delayed in qdisc queue due 
>> to the netem delay configuration)
>>
>>
>> TCP layer receives new packets and trys to retransmit the skb1:
>> tcp_rcv_established
>> __tcp_push_pending_frames
>> tcp_write_xmit            (skb1->data_len == 2896, cwnd size decreased or 
>> packets in flight increased, cause the limit decreased to 1*mss == 1448)
> 
> tcp_write_xmit() only deals with packet in the write queue,
> they never were sent. They can not be any clone of them by definition, since
> skbs in the TCP write queue are private to TCP stack,
> 
> Once a packet is sent, the master skb is moved to the rtx rb-tree,
> while the clone is going through lower stacks.
> 
> When/if a retransmit is due, we always make sure there is no clone on it,
> look at the various calls to skb_unclone()

I traced again and found that the skb was not sent, master skb was still in 
write queue,
because the function tcp_transmit_skb() returns 1 (NET_XMIT_DROP), thus it can 
be retransmit.
I found the error value NET_XMIT_DROP returns from netem_enqueue(), when the 
length of qdisc queue
is greater than queue limit value.

In netem_enqueue() the skb is cloned before returning the NET_XMIT_DROP error 
value,
thus the master skb is still in write queue and be cloned in netem_enqueue(). 
This may cause the master
skb be retransmit and fragmented again while it is cloned.

I think there are potential risks that tso_fragment() will get a cloned skb if 
skb is cloned by lower layer.
I try to fix it by moving returning error value statment to the front of the 
skb_clone() in netem_enqueue(), and it works.
And netem_enqueue() constructs corrupt packets statment returns NET_XMIT_DROP 
too. To fix this completely should I move the
constructing corrupt statment to the front of the skb_clone() ?
Please correct me if I am wrong, and I need your advice.

Thanks

diff --git a/net/sched/sch_netem.c b/net/sched/sch_netem.c
index 75046ec..615a341 100644
--- a/net/sched/sch_netem.c
+++ b/net/sched/sch_netem.c
@@ -474,6 +474,9 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
        if (q->latency || q->jitter || q->rate)
                skb_orphan_partial(skb);

+       if (unlikely(sch->q.qlen >= sch->limit))
+               return qdisc_drop_all(skb, sch, to_free);
+
        /*
         * If we need to duplicate packet, then re-insert at top of the
         * qdisc tree, since parent queuer expects that only one
@@ -521,9 +524,6 @@ static int netem_enqueue(struct sk_buff *skb, struct Qdisc 
*sch,
                        1<<(prandom_u32() % 8);
        }

-       if (unlikely(sch->q.qlen >= sch->limit))
-               return qdisc_drop_all(skb, sch, to_free);
-
        qdisc_qstats_backlog_inc(sch, skb);

        cb = netem_skb_cb(skb);
-- 

Reply via email to