netfront: Fix NULL sring after live migration

Lin Liu （刘林） Thu, 01 Dec 2022 00:42:12 -0800

>[CAUTION - EXTERNAL EMAIL] DO NOT reply, click links, or open attachments 
>unless you have verified the sender and know the content is safe.
>
>On Tue, 29 Nov 2022 06:17:02 +0000 Lin Liu wrote:
>> A NAPI is setup for each network sring to poll data to kernel
>> The sring with source host is destroyed before live migration and
>> new sring with target host is setup after live migration.
>> The NAPI for the old sring is not deleted until setup new sring
>> with target host after migration. With busy_poll/busy_read enabled,
>> the NAPI can be polled before got deleted when resume VM.
>>
>> [50116.602938] BUG: unable to handle kernel NULL pointer dereference at
>> 0000000000000008
>> [50116.603047] IP: xennet_poll+0xae/0xd20
>> [50116.603090] PGD 0 P4D 0
>> [50116.603118] Oops: 0000 [#1] SMP PTI
>> [50116.604624] Call Trace:
>> [50116.604674]  ? finish_task_switch+0x71/0x230
>> [50116.604745]  ? timerqueue_del+0x1d/0x40
>> [50116.604807]  ? hrtimer_try_to_cancel+0xb5/0x110
>> [50116.604882]  ? xennet_alloc_rx_buffers+0x2a0/0x2a0
>> [50116.604958]  napi_busy_loop+0xdb/0x270
>> [50116.605017]  sock_poll+0x87/0x90
>> [50116.605066]  do_sys_poll+0x26f/0x580
>> [50116.605125]  ? tracing_map_insert+0x1d4/0x2f0
>> [50116.605196]  ? event_hist_trigger+0x14a/0x260
>
>You can trim all the ' ? ' entries from the stack trace,
>and the time stamps, FWIW. Makes it easier to read.


Sure, will do in next version

>> [50116.613598]  ? finish_task_switch+0x71/0x230
>> [50116.614131]  ? __schedule+0x256/0x890
>> [50116.614640]  ? recalc_sigpending+0x1b/0x50
>> [50116.615144]  ? xen_sched_clock+0x15/0x20
>> [50116.615643]  ? __rb_reserve_next+0x12d/0x140
>> [50116.616138]  ? ring_buffer_lock_reserve+0x123/0x3d0
>> [50116.616634]  ? event_triggers_call+0x87/0xb0
>> [50116.617138]  ? trace_event_buffer_commit+0x1c4/0x210
>> [50116.617625]  ? xen_clocksource_get_cycles+0x15/0x20
>> [50116.618112]  ? ktime_get_ts64+0x51/0xf0
>> [50116.618578]  SyS_ppoll+0x160/0x1a0
>> [50116.619029]  ? SyS_ppoll+0x160/0x1a0
>> [50116.619475]  do_syscall_64+0x73/0x130
>> [50116.619901]  entry_SYSCALL_64_after_hwframe+0x41/0xa6
>> ...
>> [50116.806230] RIP: xennet_poll+0xae/0xd20 RSP: ffffb4f041933900
>> [50116.806772] CR2: 0000000000000008
>> [50116.807337] ---[ end trace f8601785b354351c ]---
>>
>> xen frontend should remove the NAPIs for the old srings before live
>> migration as the bond srings are destroyed
>>
>> There is a tiny window between the srings are set to NULL and
>> the NAPIs are disabled, It is safe as the NAPI threads are still
>> frozen at that time
>>
>
>Since this is a fix please add a Fixes tag, and add [PATCH net]
>to the subject.
>

Will do in next version

>> diff --git a/drivers/net/xen-netfront.c b/drivers/net/xen-netfront.c
>> index 9af2b027c19c..dc404e05970c 100644
>> --- a/drivers/net/xen-netfront.c
>> +++ b/drivers/net/xen-netfront.c
>> @@ -1862,6 +1862,12 @@ static int netfront_resume(struct xenbus_device *dev)
>>        netif_tx_unlock_bh(info->netdev);
>>
>>        xennet_disconnect_backend(info);
>> +
>> +     rtnl_lock();
>> +     if (info->queues)
>> +             xennet_destroy_queues(info);
>> +     rtnl_unlock();

>Now all callers of xennet_disconnect_backend() destroy queues soon
>after, can we just move the destroy queues into disconnect ?

After the sring is destroyed, the queue and the bond NAPI should also be 
destroyed,
so Yes, destroy queues can be part of xennet_disconnect_backend,
However, some caller of xennet_disconnect_backend hold rtnl_lock while some 
others
not, I think it is simpler to keep seperate.
>
>>        return 0;
>>  }
>>

Re: [PATCH] drivers/net/netfront: Fix NULL sring after live migration

Reply via email to