Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path

2020-10-29 Thread si-wei liu



On 10/29/2020 2:53 PM, Michael S. Tsirkin wrote:

On Thu, Oct 15, 2020 at 01:17:14PM -0700, si-wei liu wrote:

On 10/15/2020 6:11 AM, Michael S. Tsirkin wrote:

On Thu, Oct 15, 2020 at 02:15:32PM +0800, Jason Wang wrote:

On 2020/10/14 上午7:42, si-wei liu wrote:

So what I suggest is to fix the pinning leakage first and do the
possible optimization on top (which is still questionable to me).

OK. Unfortunately, this was picked and got merged in upstream. So I will
post a follow up patch set to 1) revert the commit to the original
__get_free_page() implementation, and 2) fix the accounting and leakage
on top. Will it be fine?

Fine.

Thanks

Fine by me too.


Thanks, Michael & Jason. I will post the fix shortly. Stay tuned.

-Siwei

did I miss the patch?

You didn't, sorry. I was looking into a way to speed up the boot time 
for large memory guest by multi-threading the page pinning process, and 
it turns out I need more time on that. Will post the fix I have now 
soon, hopefully tomorrow.


-Siwei




Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path

2020-11-02 Thread si-wei liu



On 10/29/2020 2:53 PM, Michael S. Tsirkin wrote:

On Thu, Oct 15, 2020 at 01:17:14PM -0700, si-wei liu wrote:

On 10/15/2020 6:11 AM, Michael S. Tsirkin wrote:

On Thu, Oct 15, 2020 at 02:15:32PM +0800, Jason Wang wrote:

On 2020/10/14 上午7:42, si-wei liu wrote:

So what I suggest is to fix the pinning leakage first and do the
possible optimization on top (which is still questionable to me).

OK. Unfortunately, this was picked and got merged in upstream. So I will
post a follow up patch set to 1) revert the commit to the original
__get_free_page() implementation, and 2) fix the accounting and leakage
on top. Will it be fine?

Fine.

Thanks

Fine by me too.


Thanks, Michael & Jason. I will post the fix shortly. Stay tuned.

-Siwei

did I miss the patch?


The patch had been posted last Friday. See this thread:

https://lore.kernel.org/virtualization/1604043944-4897-2-git-send-email-si-wei@oracle.com/

-Siwei


Re: [PATCH net-next v5 1/4] virtio_ring: enable premapped mode whatever use_dma_api

2024-08-13 Thread Si-Wei Liu



Turning out this below commit to unconditionally enable premapped 
virtio-net:


commit f9dac92ba9081062a6477ee015bd3b8c5914efc4
Author: Xuan Zhuo 
Date:   Sat May 11 11:14:01 2024 +0800

leads to regression on VM with no ACCESS_PLATFORM, and with the sysctl 
value of:


- net.core.high_order_alloc_disable=1

which could see reliable crashes or scp failure (scp a file 100M in size 
to VM):


[  332.079333] __vm_enough_memory: pid: 18440, comm: sshd, bytes: 
5285790347661783040 not enough memory for the allocation

[  332.079651] [ cut here ]
[  332.079655] kernel BUG at mm/mmap.c:3514!
[  332.080095] invalid opcode:  [#1] PREEMPT SMP NOPTI
[  332.080826] CPU: 18 PID: 18440 Comm: sshd Kdump: loaded Not tainted 
6.10.0-2.x86_64 #2
[  332.081514] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-4.module+el8.9.0+90173+a3f3e83a 04/01/2014

[  332.082451] RIP: 0010:exit_mmap+0x3a1/0x3b0
[  332.082871] Code: be 01 00 00 00 48 89 df e8 0c 94 fe ff eb d7 be 01 
00 00 00 48 89 df e8 5d 98 fe ff eb be 31 f6 48 89 df e8 31 99 fe ff eb 
a8 <0f> 0b e8 68 bc ae 00 0f 1f 84 00 00 00 00 00 90 90 90 90 90 90 90

[  332.084230] RSP: 0018:9988b1c8f948 EFLAGS: 00010293
[  332.084635] RAX: 0406 RBX: 8d47583e7380 RCX: 

[  332.085171] RDX:  RSI:  RDI: 

[  332.085699] RBP: 008f R08:  R09: 

[  332.086233] R10:  R11:  R12: 
8d47583e7430
[  332.086761] R13: 8d47583e73c0 R14: 0406 R15: 
000495ae650dda58
[  332.087300] FS:  7ff443899980() GS:8df1c570() 
knlGS:

[  332.087888] CS:  0010 DS:  ES:  CR0: 80050033
[  332.088334] CR2: 55a42d30b730 CR3: 0102e956a004 CR4: 
00770ef0

[  332.088867] PKRU: 5554
[  332.089114] Call Trace:
[  332.089349] 
[  332.089556]  ? die+0x36/0x90
[  332.089818]  ? do_trap+0xed/0x110
[  332.090110]  ? exit_mmap+0x3a1/0x3b0
[  332.090411]  ? do_error_trap+0x6a/0xa0
[  332.090722]  ? exit_mmap+0x3a1/0x3b0
[  332.091029]  ? exc_invalid_op+0x50/0x80
[  332.091348]  ? exit_mmap+0x3a1/0x3b0
[  332.091648]  ? asm_exc_invalid_op+0x1a/0x20
[  332.091998]  ? exit_mmap+0x3a1/0x3b0
[  332.092299]  ? exit_mmap+0x1d6/0x3b0
[  332.092604] __mmput+0x3e/0x130
[  332.092882] dup_mm.constprop.0+0x10c/0x110
[  332.093226] copy_process+0xbd0/0x1570
[  332.093539] kernel_clone+0xbf/0x430
[  332.093838]  ? syscall_exit_work+0x103/0x130
[  332.094197] __do_sys_clone+0x66/0xa0
[  332.094506]  do_syscall_64+0x8c/0x1d0
[  332.094814]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.095198]  ? audit_reset_context+0x232/0x310
[  332.095558]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.095936]  ? syscall_exit_work+0x103/0x130
[  332.096288]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.096668]  ? syscall_exit_to_user_mode+0x7d/0x220
[  332.097059]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.097436]  ? do_syscall_64+0xba/0x1d0
[  332.097752]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.098137]  ? syscall_exit_to_user_mode+0x7d/0x220
[  332.098525]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.098903]  ? do_syscall_64+0xba/0x1d0
[  332.099227]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.099606]  ? __audit_filter_op+0xbe/0x140
[  332.099943]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.100328]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.100706]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.101089]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.101468]  ? wp_page_reuse+0x8e/0xb0
[  332.101779]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.102163]  ? do_wp_page+0xe6/0x470
[  332.102465]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.102843]  ? __handle_mm_fault+0x5ff/0x720
[  332.103197]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.103574]  ? __count_memcg_events+0x4d/0xd0
[  332.103938]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.104323]  ? count_memcg_events.constprop.0+0x26/0x50
[  332.104729]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.105114]  ? handle_mm_fault+0xae/0x320
[  332.105442]  ? srso_alias_return_thunk+0x5/0xfbef5
[  332.105820]  ? do_user_addr_fault+0x31f/0x6c0
[  332.106181]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  332.106576] RIP: 0033:0x7ff43f8f9a73
[  332.106876] Code: db 0f 85 28 01 00 00 64 4c 8b 0c 25 10 00 00 00 45 
31 c0 4d 8d 91 d0 02 00 00 31 d2 31 f6 bf 11
00 20 01 b8 38 00 00 00 0f 05 <48> 3d 00 f0 ff ff 0f 87 b9 00 00 00 41 
89 c5 85 c0 0f 85 c6 00 00
[  332.108163] RSP: 002b:7ffc690909b0 EFLAGS: 0246 ORIG_RAX: 
0038
[  332.108719] RAX: ffda RBX:  RCX: 
7ff43f8f9a73
[  332.109253] RDX:  RSI:  RDI: 
01200011
[  332.109782] RBP:  R08:  R09: 
7ff443899980
[  332.110313] R10: 7ff443899c50 R11: 0246 R12: 
0002
[  332.110842] R13: 562e56cd4780 R14: 

Re: [PATCH net-next v5 1/4] virtio_ring: enable premapped mode whatever use_dma_api

2024-08-13 Thread Si-Wei Liu

Hi Michael,

I'll look for someone else from Oracle to help you on this, as the 
relevant team already did verify internally that reverting all 4 patches 
from this series could help address the regression. Just reverting one 
single commit won't help.


  9719f039d328 virtio_net: remove the misleading comment
  defd28aa5acb virtio_net: rx remove premapped failover code
  a377ae542d8d virtio_net: big mode skip the unmap check
  f9dac92ba908 virtio_ring: enable premapped mode whatever use_dma_api

In case I fail to get someone to help, could you work with Darren 
(cc'ed) directly? He could reach out to the corresponding team in Oracle 
to help with testing.


Thanks,
-Siwei


On 8/13/2024 12:46 PM, Michael S. Tsirkin wrote:

Want to post a patchset to revert?





Re: [PATCH net-next v5 1/4] virtio_ring: enable premapped mode whatever use_dma_api

2024-08-19 Thread Si-Wei Liu

Hi,

May I know if this is really an intended fix to post officially, or just 
a workaround/probe to make the offset in page_frag happy when 
net_high_order_alloc_disable is true? In case it's the former, even 
though this could fix the issue, I would assume clamping to a smaller 
page_frag than a regular page size for every buffer may have certain 
performance regression for the merge-able buffer case? Can you justify 
the performance impact with some benchmark runs with larger MTU and 
merge-able rx buffers to prove the regression is negligible? You would 
need to compare against where you don't have the inadvertent 
virtnet_rq_dma cost on any page i.e. getting all 4 patches of this 
series reverted. Both tests with net_high_order_alloc_disable set to on 
and off are needed.


Thanks,
-Siwei

On 8/17/2024 6:20 AM, Xuan Zhuo wrote:

Hi, guys, I have a fix patch for this.
Could anybody test it?

Thanks.

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index af474cc191d0..426d68c2d01d 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -2492,13 +2492,15 @@ static unsigned int get_mergeable_buf_len(struct 
receive_queue *rq,
  {
 struct virtnet_info *vi = rq->vq->vdev->priv;
 const size_t hdr_len = vi->hdr_len;
-   unsigned int len;
+   unsigned int len, max_len;
+
+   max_len = PAGE_SIZE - ALIGN(sizeof(struct virtnet_rq_dma), 
L1_CACHE_BYTES);

 if (room)
-   return PAGE_SIZE - room;
+   return max_len - room;

 len = hdr_len + clamp_t(unsigned int, ewma_pkt_len_read(avg_pkt_len),
-   rq->min_buf_len, PAGE_SIZE - hdr_len);
+   rq->min_buf_len, max_len - hdr_len);

 return ALIGN(len, L1_CACHE_BYTES);
  }





Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-08-20 Thread Si-Wei Liu




On 8/20/2024 12:19 AM, Xuan Zhuo wrote:

leads to regression on VM with the sysctl value of:

- net.core.high_order_alloc_disable=1

which could see reliable crashes or scp failure (scp a file 100M in size
to VM):

The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
of a new frag. When the frag size is larger than PAGE_SIZE,
everything is fine. However, if the frag is only one page and the
total size of the buffer and virtnet_rq_dma is larger than one page, an
overflow may occur. In this case, if an overflow is possible, I adjust
the buffer size. If net.core.high_order_alloc_disable=1, the maximum
buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
the first buffer of the frag is affected.

Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api")
Reported-by: "Si-Wei Liu" 
Closes: 
http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
Signed-off-by: Xuan Zhuo 
---
  drivers/net/virtio_net.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c6af18948092..e5286a6da863 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, u32 
size, gfp_t gfp)
void *buf, *head;
dma_addr_t addr;
  
-	if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))

-   return NULL;
-
head = page_address(alloc_frag->page);
  
  	dma = head;

@@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
struct receive_queue *rq,
len = SKB_DATA_ALIGN(len) +
  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
  
+	if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))

+   return -ENOMEM;
+
Do you want to document the assumption that small packet case won't end 
up crossing the page frag boundary unlike the mergeable case? Add a 
comment block to explain or a WARN_ON() check against potential overflow 
would work with me.



buf = virtnet_rq_alloc(rq, len, gfp);
if (unlikely(!buf))
return -ENOMEM;
@@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
 */
len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
  
+	if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))

+   return -ENOMEM;
+
+   if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
alloc_frag->size)
+   len -= sizeof(struct virtnet_rq_dma);
+
This could address my previous concern for possibly regressing every 
buffer size for the mergeable case, thanks. Though I still don't get why 
carving up a small chunk from page_frag for storing the virtnet_rq_dma 
metadata, this would cause perf regression on certain MTU size that 
happens to end up with one more base page (and an extra descriptor as 
well) to be allocated compared to the previous code without the extra 
virtnet_rq_dma content. How hard would it be to allocate a dedicated 
struct to store the related information without affecting the (size of) 
datapath pages?


FWIW, out of the code review perspective, I've looked up the past 
conversations but didn't see comprehensive benchmark was done before 
removing the old code and making premap the sole default mode. Granted 
this would reduce the footprint of additional code and the associated 
maintaining cost immediately, but I would assume at least there should 
have been thorough performance runs upfront to guarantee no regression 
is seen with every possible use case, or the negative effect is 
comparatively negligible even though there's slight regression in some 
limited case. If that kind of perf measurement hadn't been done before 
getting accepted/merged, I think at least it should allow both modes to 
coexist for a while such that every user could gauge the performance effect.


Thanks,
-Siwei


buf = virtnet_rq_alloc(rq, len + room, gfp);
if (unlikely(!buf))
return -ENOMEM;





Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-08-20 Thread Si-Wei Liu




On 8/20/2024 1:09 PM, Michael S. Tsirkin wrote:

On Tue, Aug 20, 2024 at 12:44:46PM -0700, Si-Wei Liu wrote:


On 8/20/2024 12:19 AM, Xuan Zhuo wrote:

leads to regression on VM with the sysctl value of:

- net.core.high_order_alloc_disable=1

which could see reliable crashes or scp failure (scp a file 100M in size
to VM):

The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
of a new frag. When the frag size is larger than PAGE_SIZE,
everything is fine. However, if the frag is only one page and the
total size of the buffer and virtnet_rq_dma is larger than one page, an
overflow may occur. In this case, if an overflow is possible, I adjust
the buffer size. If net.core.high_order_alloc_disable=1, the maximum
buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
the first buffer of the frag is affected.

Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api")
Reported-by: "Si-Wei Liu" 
Closes: 
http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
Signed-off-by: Xuan Zhuo 
---
   drivers/net/virtio_net.c | 12 +---
   1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c6af18948092..e5286a6da863 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct receive_queue *rq, u32 
size, gfp_t gfp)
void *buf, *head;
dma_addr_t addr;
-   if (unlikely(!skb_page_frag_refill(size, alloc_frag, gfp)))
-   return NULL;
-
head = page_address(alloc_frag->page);
dma = head;
@@ -2421,6 +2418,9 @@ static int add_recvbuf_small(struct virtnet_info *vi, 
struct receive_queue *rq,
len = SKB_DATA_ALIGN(len) +
  SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+   if (unlikely(!skb_page_frag_refill(len, &rq->alloc_frag, gfp)))
+   return -ENOMEM;
+

Do you want to document the assumption that small packet case won't end up
crossing the page frag boundary unlike the mergeable case? Add a comment
block to explain or a WARN_ON() check against potential overflow would work
with me.


buf = virtnet_rq_alloc(rq, len, gfp);
if (unlikely(!buf))
return -ENOMEM;
@@ -2521,6 +2521,12 @@ static int add_recvbuf_mergeable(struct virtnet_info *vi,
 */
len = get_mergeable_buf_len(rq, &rq->mrg_avg_pkt_len, room);
+   if (unlikely(!skb_page_frag_refill(len + room, alloc_frag, gfp)))
+   return -ENOMEM;
+
+   if (!alloc_frag->offset && len + room + sizeof(struct virtnet_rq_dma) > 
alloc_frag->size)
+   len -= sizeof(struct virtnet_rq_dma);
+

This could address my previous concern for possibly regressing every buffer
size for the mergeable case, thanks. Though I still don't get why carving up
a small chunk from page_frag for storing the virtnet_rq_dma metadata, this
would cause perf regression on certain MTU size

4Kbyte MTU exactly?
Close to that, excluding all headers upfront (depending on virtio 
features and header layout). The size of struct virtnet_rq_dma is now 16 
bytes, this could lead to performance impact on roughly: 16 / 4096 = 0.4 
% across all MTU sizes, more obviously to be seen with order-0 page 
allocations.


-Siwei




that happens to end up with
one more base page (and an extra descriptor as well) to be allocated
compared to the previous code without the extra virtnet_rq_dma content. How
hard would it be to allocate a dedicated struct to store the related
information without affecting the (size of) datapath pages?

FWIW, out of the code review perspective, I've looked up the past
conversations but didn't see comprehensive benchmark was done before
removing the old code and making premap the sole default mode. Granted this
would reduce the footprint of additional code and the associated maintaining
cost immediately, but I would assume at least there should have been
thorough performance runs upfront to guarantee no regression is seen with
every possible use case, or the negative effect is comparatively negligible
even though there's slight regression in some limited case. If that kind of
perf measurement hadn't been done before getting accepted/merged, I think at
least it should allow both modes to coexist for a while such that every user
could gauge the performance effect.

Thanks,
-Siwei


buf = virtnet_rq_alloc(rq, len + room, gfp);
if (unlikely(!buf))
return -ENOMEM;





Re: [PATCH net] virtio-net: fix overflow inside virtnet_rq_alloc

2024-08-28 Thread Si-Wei Liu
Just in case Xuan missed the last email while his email server kept 
rejecting incoming emails in the last week.: the patch doesn't seem fix 
the regression.


Xuan, given this is not very hard to reproduce and we have clearly 
stated how to, could you try to get the patch verified in house before 
posting to upstream? Or you were unable to reproduce locally?


Thanks,
-Siwei

On 8/21/2024 9:47 AM, Darren Kenny wrote:

Hi Michael,

On Tuesday, 2024-08-20 at 12:50:39 -04, Michael S. Tsirkin wrote:

On Tue, Aug 20, 2024 at 03:19:13PM +0800, Xuan Zhuo wrote:

leads to regression on VM with the sysctl value of:

- net.core.high_order_alloc_disable=1





which could see reliable crashes or scp failure (scp a file 100M in size
to VM):

The issue is that the virtnet_rq_dma takes up 16 bytes at the beginning
of a new frag. When the frag size is larger than PAGE_SIZE,
everything is fine. However, if the frag is only one page and the
total size of the buffer and virtnet_rq_dma is larger than one page, an
overflow may occur. In this case, if an overflow is possible, I adjust
the buffer size. If net.core.high_order_alloc_disable=1, the maximum
buffer size is 4096 - 16. If net.core.high_order_alloc_disable=0, only
the first buffer of the frag is affected.

Fixes: f9dac92ba908 ("virtio_ring: enable premapped mode whatever use_dma_api")
Reported-by: "Si-Wei Liu" 
Closes: 
http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com
Signed-off-by: Xuan Zhuo 


Darren, could you pls test and confirm?

Unfortunately with this change I seem to still get a panic as soon as I start a
download using wget:

[  144.055630] Kernel panic - not syncing: corrupted stack end detected inside 
scheduler
[  144.056249] CPU: 8 PID: 37894 Comm: sleep Kdump: loaded Not tainted 
6.10.0-1.el8uek.x86_64 #2
[  144.056850] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 
1.16.0-4.module+el8.9.0+90173+a3f3e83a 04/01/2014
[  144.057585] Call Trace:
[  144.057791]  
[  144.057973]  panic+0x347/0x370
[  144.058223]  schedule_debug.isra.0+0xfb/0x100
[  144.058565]  __schedule+0x58/0x6a0
[  144.058838]  ? refill_stock+0x26/0x50
[  144.059120]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.059473]  do_task_dead+0x42/0x50
[  144.059752]  do_exit+0x31e/0x4b0
[  144.060011]  ? __audit_syscall_entry+0xee/0x150
[  144.060352]  do_group_exit+0x30/0x80
[  144.060633]  __x64_sys_exit_group+0x18/0x20
[  144.060946]  do_syscall_64+0x8c/0x1c0
[  144.061228]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.061570]  ? __audit_filter_op+0xbe/0x140
[  144.061873]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.062204]  ? audit_reset_context+0x232/0x310
[  144.062514]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.062851]  ? syscall_exit_work+0x103/0x130
[  144.063148]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.063473]  ? syscall_exit_to_user_mode+0x77/0x220
[  144.063813]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.064142]  ? do_syscall_64+0xb9/0x1c0
[  144.064411]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.064747]  ? do_syscall_64+0xb9/0x1c0
[  144.065018]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.065345]  ? do_read_fault+0x109/0x1b0
[  144.065628]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.065961]  ? do_fault+0x1aa/0x2f0
[  144.066212]  ? handle_pte_fault+0x102/0x1a0
[  144.066503]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.066836]  ? __handle_mm_fault+0x5ed/0x710
[  144.067137]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.067464]  ? __count_memcg_events+0x72/0x110
[  144.067779]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.068106]  ? count_memcg_events.constprop.0+0x26/0x50
[  144.068457]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.068788]  ? handle_mm_fault+0xae/0x320
[  144.069068]  ? srso_alias_return_thunk+0x5/0xfbef5
[  144.069395]  ? do_user_addr_fault+0x34a/0x6b0
[  144.069708]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  144.070049] RIP: 0033:0x7fc5524f9c66
[  144.070307] Code: Unable to access opcode bytes at 0x7fc5524f9c3c.
[  144.070720] RSP: 002b:7ffee052beb8 EFLAGS: 0246 ORIG_RAX: 
00e7
[  144.071214] RAX: ffda RBX: 7fc5527bb860 RCX: 7fc5524f9c66
[  144.071684] RDX:  RSI: 003c RDI: 
[  144.072146] RBP:  R08: 00e7 R09: ff78
[  144.072608] R10: 7ffee052bdef R11: 0246 R12: 7fc5527bb860
[  144.073076] R13: 0002 R14: 7fc5527c4528 R15: 
[  144.073543]  
[  144.074780] Kernel Offset: 0x37c0 from 0x8100 (relocation 
range: 0x8000-0xbfff)

Thanks,

Darren.


---
  drivers/net/virtio_net.c | 12 +---
  1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index c6af18948092..e5286a6da863 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -918,9 +918,6 @@ static void *virtnet_rq_alloc(struct rec

Re: [PATCH 0/3] Revert "virtio_net: rx enable premapped mode by default"

2024-09-11 Thread Si-Wei Liu




On 9/11/2024 7:22 AM, Michael S. Tsirkin wrote:

Thanks a lot!
Could you retest Xuan Zhuo original patch

Which one? I thought Darren already did so?

-Siwei

  just to make sure it does
not fix the issue?

On Wed, Sep 11, 2024 at 03:18:55PM +0100, Darren Kenny wrote:

For the record, I got a chance to test these changes and confirmed that
they resolved the issue for me when applied on 6.11-rc7.

Tested-by: Darren Kenny 

Thanks,

Darren.

PS - I'll try get to looking at the other potential fix when I have time.

On Tuesday, 2024-09-10 at 08:12:06 -04, Michael S. Tsirkin wrote:

On Fri, Sep 06, 2024 at 08:31:34PM +0800, Xuan Zhuo wrote:

Regression: 
http://lore.kernel.org/all/8b20cc28-45a9-4643-8e87-ba164a540...@oracle.com

I still think that the patch can fix the problem, I hope Darren can re-test it
or give me more info.

 
http://lore.kernel.org/all/20240820071913.68004-1-xuanz...@linux.alibaba.com

If that can not work or Darren can not reply in time, Michael you can try this
patch set.

Just making sure netdev maintainers see this, this patch is for net.


Thanks.

Xuan Zhuo (3):
   Revert "virtio_net: rx remove premapped failover code"
   Revert "virtio_net: big mode skip the unmap check"
   virtio_net: disable premapped mode by default

  drivers/net/virtio_net.c | 95 +++-
  1 file changed, 46 insertions(+), 49 deletions(-)

--
2.32.0.3.g01195cf9f





Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map

2021-02-16 Thread Si-Wei Liu




On 2/10/2021 7:45 AM, Eli Cohen wrote:

On Wed, Feb 10, 2021 at 12:59:03AM -0800, Si-Wei Liu wrote:


On 2/9/2021 7:53 PM, Jason Wang wrote:

On 2021/2/10 上午10:30, Si-Wei Liu wrote:


On 2/8/2021 10:37 PM, Jason Wang wrote:

On 2021/2/9 下午2:12, Eli Cohen wrote:

On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote:

On 2021/2/8 下午6:04, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote:

On 2021/2/8 下午2:37, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote:

On 2021/2/6 上午7:07, Si-Wei Liu wrote:

On 2/3/2021 11:36 PM, Eli Cohen wrote:

When a change of memory map
occurs, the hardware resources
are destroyed
and then re-created again with
the new memory map. In such
case, we need
to restore the hardware
available and used indices. The
driver failed to
restore the used index which is added here.

Also, since the driver also
fails to reset the available and
used
indices upon device reset, fix
this here to avoid regression
caused by
the fact that used index may not be zero upon device reset.

Fixes: 1a86b377aa21 ("vdpa/mlx5:
Add VDPA driver for supported
mlx5
devices")
Signed-off-by: Eli Cohen
---
v0 -> v1:
Clear indices upon device reset

      drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++
      1 file changed, 18 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 88dde3455bfd..b5fe6d2ad22f 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -87,6 +87,7 @@ struct mlx5_vq_restore_info {
      u64 device_addr;
      u64 driver_addr;
      u16 avail_index;
+    u16 used_index;
      bool ready;
      struct vdpa_callback cb;
      bool restore;
@@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue {
      u32 virtq_id;
      struct mlx5_vdpa_net *ndev;
      u16 avail_idx;
+    u16 used_idx;
      int fw_state;
        /* keep last in the struct */
@@ -804,6 +806,7 @@ static int
create_virtqueue(struct
mlx5_vdpa_net
*ndev, struct mlx5_vdpa_virtque
        obj_context =
MLX5_ADDR_OF(create_virtio_net_q_in,
in,
obj_context);
 
MLX5_SET(virtio_net_q_object,

obj_context, hw_available_index,
mvq->avail_idx);
+    MLX5_SET(virtio_net_q_object, obj_context, hw_used_index,
mvq->used_idx);
      MLX5_SET(virtio_net_q_object, obj_context,
queue_feature_bit_mask_12_3,
get_features_12_3(ndev->mvdev.actual_features));
      vq_ctx =
MLX5_ADDR_OF(virtio_net_q_object,
obj_context,
virtio_q_context);
@@ -1022,6 +1025,7 @@ static int
connect_qps(struct mlx5_vdpa_net
*ndev, struct mlx5_vdpa_virtqueue *m
      struct mlx5_virtq_attr {
      u8 state;
      u16 available_index;
+    u16 used_index;
      };
        static int
query_virtqueue(struct
mlx5_vdpa_net *ndev, struct
mlx5_vdpa_virtqueue *mvq,
@@ -1052,6 +1056,7 @@ static int query_virtqueue(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu
      memset(attr, 0, sizeof(*attr));
      attr->state =
MLX5_GET(virtio_net_q_object,
obj_context, state);
      attr->available_index = MLX5_GET(virtio_net_q_object,
obj_context, hw_available_index);
+    attr->used_index =
MLX5_GET(virtio_net_q_object,
obj_context,
hw_used_index);
      kfree(out);
      return 0;
      @@ -1535,6 +1540,16 @@
static void
teardown_virtqueues(struct
mlx5_vdpa_net *ndev)
      }
      }
      +static void clear_virtqueues(struct mlx5_vdpa_net *ndev)
+{
+    int i;
+
+    for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
+    ndev->vqs[i].avail_idx = 0;
+    ndev->vqs[i].used_idx = 0;
+    }
+}
+
      /* TODO: cross-endian support */
      static inline bool
mlx5_vdpa_is_little_endian(struct
mlx5_vdpa_dev
*mvdev)
      {
@@ -1610,6 +1625,7 @@ static int save_channel_info(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu
      return err;
        ri->avail_index = attr.available_index;
+    ri->used_index = attr.used_index;
      ri->ready = mvq->ready;
      ri->num_ent = mvq->num_ent;
      ri->desc_addr = mvq->desc_addr;
@@ -1654,6 +1670,7 @@ static void restore_channels_info(struct
mlx5_vdpa_net *ndev)
      continue;
        mvq->avail_idx = ri->avail_index;
+    mvq->used_idx = ri->used_index;
      mvq->ready = ri->ready;
      mvq->num_ent = ri->num_ent;
      mvq->desc_addr = ri->desc_addr;
@@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct
vdpa_device *vdev, u8 status)
      if (!status) {
 
mlx5_vdpa_info(mvdev,

"performing device reset\n");
      teardown_driver(ndev);
+    clear_virtqueues(ndev);

The clearing looks fine at the first
glance, as it aligns with the other
state cleanups floating around at
the same place. However, the thing
is
get_vq_state() is supposed to be
called right after to get sync'ed

Re: [PATCH 1/2] vdpa/mlx5: Fix suspend/resume index restoration

2021-02-17 Thread Si-Wei Liu




On 2/16/2021 8:20 AM, Eli Cohen wrote:

When we suspend the VM, the VDPA interface will be reset. When the VM is
resumed again, clear_virtqueues() will clear the available and used
indices resulting in hardware virqtqueue objects becoming out of sync.
We can avoid this function alltogether since qemu will clear them if
required, e.g. when the VM went through a reboot.

Moreover, since the hw available and used indices should always be
identical on query and should be restored to the same value same value
for virtqueues that complete in order, we set the single value provided
by set_vq_state(). In get_vq_state() we return the value of hardware
used index.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen 

Acked-by: Si-Wei Liu 


---
  drivers/vdpa/mlx5/net/mlx5_vnet.c | 17 -
  1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index b8e9d525d66c..a51b0f86afe2 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1169,6 +1169,7 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *m
return;
}
mvq->avail_idx = attr.available_index;
+   mvq->used_idx = attr.used_index;
  }
  
  static void suspend_vqs(struct mlx5_vdpa_net *ndev)

@@ -1426,6 +1427,7 @@ static int mlx5_vdpa_set_vq_state(struct vdpa_device 
*vdev, u16 idx,
return -EINVAL;
}
  
+	mvq->used_idx = state->avail_index;

mvq->avail_idx = state->avail_index;
return 0;
  }
@@ -1443,7 +1445,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device 
*vdev, u16 idx, struct vdpa
 * that cares about emulating the index after vq is stopped.
 */
if (!mvq->initialized) {
-   state->avail_index = mvq->avail_idx;
+   state->avail_index = mvq->used_idx;
return 0;
}
  
@@ -1452,7 +1454,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device *vdev, u16 idx, struct vdpa

mlx5_vdpa_warn(mvdev, "failed to query virtqueue\n");
return err;
}
-   state->avail_index = attr.available_index;
+   state->avail_index = attr.used_index;
return 0;
  }
  
@@ -1532,16 +1534,6 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev)

}
  }
  
-static void clear_virtqueues(struct mlx5_vdpa_net *ndev)

-{
-   int i;
-
-   for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
-   ndev->vqs[i].avail_idx = 0;
-   ndev->vqs[i].used_idx = 0;
-   }
-}
-
  /* TODO: cross-endian support */
  static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev)
  {
@@ -1777,7 +1769,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
if (!status) {
mlx5_vdpa_info(mvdev, "performing device reset\n");
teardown_driver(ndev);
-   clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
++mvdev->generation;




Re: [PATCH 1/2] vdpa/mlx5: Fix suspend/resume index restoration

2021-02-17 Thread Si-Wei Liu




On 2/17/2021 1:20 PM, Michael S. Tsirkin wrote:

On Wed, Feb 17, 2021 at 11:42:48AM -0800, Si-Wei Liu wrote:


On 2/16/2021 8:20 AM, Eli Cohen wrote:

When we suspend the VM, the VDPA interface will be reset. When the VM is
resumed again, clear_virtqueues() will clear the available and used
indices resulting in hardware virqtqueue objects becoming out of sync.
We can avoid this function alltogether since qemu will clear them if
required, e.g. when the VM went through a reboot.

Moreover, since the hw available and used indices should always be
identical on query and should be restored to the same value same value
for virtqueues that complete in order, we set the single value provided
by set_vq_state(). In get_vq_state() we return the value of hardware
used index.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen 

Acked-by: Si-Wei Liu 


Seems to also fix b35ccebe3ef76168aa2edaa35809c0232cb3578e, right?

I think so. It should have both "Fixes" tags.

-Siwei



---
   drivers/vdpa/mlx5/net/mlx5_vnet.c | 17 -
   1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index b8e9d525d66c..a51b0f86afe2 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1169,6 +1169,7 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *m
return;
}
mvq->avail_idx = attr.available_index;
+   mvq->used_idx = attr.used_index;
   }
   static void suspend_vqs(struct mlx5_vdpa_net *ndev)
@@ -1426,6 +1427,7 @@ static int mlx5_vdpa_set_vq_state(struct vdpa_device 
*vdev, u16 idx,
return -EINVAL;
}
+   mvq->used_idx = state->avail_index;
mvq->avail_idx = state->avail_index;
return 0;
   }
@@ -1443,7 +1445,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device 
*vdev, u16 idx, struct vdpa
 * that cares about emulating the index after vq is stopped.
 */
if (!mvq->initialized) {
-   state->avail_index = mvq->avail_idx;
+   state->avail_index = mvq->used_idx;
return 0;
}
@@ -1452,7 +1454,7 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device 
*vdev, u16 idx, struct vdpa
mlx5_vdpa_warn(mvdev, "failed to query virtqueue\n");
return err;
}
-   state->avail_index = attr.available_index;
+   state->avail_index = attr.used_index;
return 0;
   }
@@ -1532,16 +1534,6 @@ static void teardown_virtqueues(struct mlx5_vdpa_net 
*ndev)
}
   }
-static void clear_virtqueues(struct mlx5_vdpa_net *ndev)
-{
-   int i;
-
-   for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
-   ndev->vqs[i].avail_idx = 0;
-   ndev->vqs[i].used_idx = 0;
-   }
-}
-
   /* TODO: cross-endian support */
   static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev)
   {
@@ -1777,7 +1769,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
if (!status) {
mlx5_vdpa_info(mvdev, "performing device reset\n");
teardown_driver(ndev);
-   clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
++mvdev->generation;




Re: [PATCH v2 3/3] vdpa/mlx5: defer clear_virtqueues to until DRIVER_OK

2021-02-17 Thread Si-Wei Liu




On 2/16/2021 7:21 AM, Eli Cohen wrote:

On Thu, Feb 11, 2021 at 09:33:14AM +0200, Eli Cohen wrote:

On Wed, Feb 10, 2021 at 01:48:00PM -0800, Si-Wei Liu wrote:

While virtq is stopped,  get_vq_state() is supposed to
be  called to  get  sync'ed  with  the latest internal
avail_index from device. The saved avail_index is used
to restate  the virtq  once device is started.  Commit
b35ccebe3ef7 introduced the clear_virtqueues() routine
to  reset  the saved  avail_index,  however, the index
gets cleared a bit earlier before get_vq_state() tries
to read it. This would cause consistency problems when
virtq is restarted, e.g. through a series of link down
and link up events. We  could  defer  the  clearing of
avail_index  to  until  the  device  is to be started,
i.e. until  VIRTIO_CONFIG_S_DRIVER_OK  is set again in
set_status().

Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change 
map")
Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 

Acked-by: Eli Cohen 


I take it back. I think we don't need to clear the indexes at all. In
case we need to restore indexes we'll get the right values through
set_vq_state(). If we suspend the virtqueue due to VM being suspended,
qemu will query first and will provide the the queried value. In case of
VM reboot, it will provide 0 in set_vq_state().

I am sending a patch that addresses both reboot and suspend.
With set_vq_state() repurposed to restoring used_index I'm fine with 
this approach.


Do I have to repost a v3 of this series while dropping the 3rd patch?

-Siwei



---
  drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 7c1f789..ce6aae8 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1777,7 +1777,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
if (!status) {
mlx5_vdpa_info(mvdev, "performing device reset\n");
teardown_driver(ndev);
-   clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
++mvdev->generation;
@@ -1786,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
  
  	if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) {

if (status & VIRTIO_CONFIG_S_DRIVER_OK) {
+   clear_virtqueues(ndev);
err = setup_driver(ndev);
if (err) {
mlx5_vdpa_warn(mvdev, "failed to setup 
driver\n");
--
1.8.3.1





Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map

2021-02-18 Thread Si-Wei Liu




On 2/17/2021 8:44 PM, Jason Wang wrote:


On 2021/2/10 下午4:59, Si-Wei Liu wrote:



On 2/9/2021 7:53 PM, Jason Wang wrote:


On 2021/2/10 上午10:30, Si-Wei Liu wrote:



On 2/8/2021 10:37 PM, Jason Wang wrote:


On 2021/2/9 下午2:12, Eli Cohen wrote:

On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote:

On 2021/2/8 下午6:04, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote:

On 2021/2/8 下午2:37, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote:

On 2021/2/6 上午7:07, Si-Wei Liu wrote:

On 2/3/2021 11:36 PM, Eli Cohen wrote:
When a change of memory map occurs, the hardware resources 
are destroyed
and then re-created again with the new memory map. In such 
case, we need
to restore the hardware available and used indices. The 
driver failed to

restore the used index which is added here.

Also, since the driver also fails to reset the available 
and used
indices upon device reset, fix this here to avoid 
regression caused by

the fact that used index may not be zero upon device reset.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for 
supported mlx5

devices")
Signed-off-by: Eli Cohen
---
v0 -> v1:
Clear indices upon device reset

     drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 
++

     1 file changed, 18 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 88dde3455bfd..b5fe6d2ad22f 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -87,6 +87,7 @@ struct mlx5_vq_restore_info {
     u64 device_addr;
     u64 driver_addr;
     u16 avail_index;
+    u16 used_index;
     bool ready;
     struct vdpa_callback cb;
     bool restore;
@@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue {
     u32 virtq_id;
     struct mlx5_vdpa_net *ndev;
     u16 avail_idx;
+    u16 used_idx;
     int fw_state;
       /* keep last in the struct */
@@ -804,6 +806,7 @@ static int create_virtqueue(struct 
mlx5_vdpa_net

*ndev, struct mlx5_vdpa_virtque
       obj_context = 
MLX5_ADDR_OF(create_virtio_net_q_in, in,

obj_context);
     MLX5_SET(virtio_net_q_object, obj_context, 
hw_available_index,

mvq->avail_idx);
+    MLX5_SET(virtio_net_q_object, obj_context, 
hw_used_index,

mvq->used_idx);
     MLX5_SET(virtio_net_q_object, obj_context,
queue_feature_bit_mask_12_3,
get_features_12_3(ndev->mvdev.actual_features));
     vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, 
obj_context,

virtio_q_context);
@@ -1022,6 +1025,7 @@ static int connect_qps(struct 
mlx5_vdpa_net

*ndev, struct mlx5_vdpa_virtqueue *m
     struct mlx5_virtq_attr {
     u8 state;
     u16 available_index;
+    u16 used_index;
     };
       static int query_virtqueue(struct mlx5_vdpa_net 
*ndev, struct

mlx5_vdpa_virtqueue *mvq,
@@ -1052,6 +1056,7 @@ static int query_virtqueue(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu
     memset(attr, 0, sizeof(*attr));
     attr->state = MLX5_GET(virtio_net_q_object, 
obj_context, state);
     attr->available_index = 
MLX5_GET(virtio_net_q_object,

obj_context, hw_available_index);
+    attr->used_index = MLX5_GET(virtio_net_q_object, 
obj_context,

hw_used_index);
     kfree(out);
     return 0;
     @@ -1535,6 +1540,16 @@ static void 
teardown_virtqueues(struct

mlx5_vdpa_net *ndev)
     }
     }
     +static void clear_virtqueues(struct mlx5_vdpa_net 
*ndev)

+{
+    int i;
+
+    for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
+    ndev->vqs[i].avail_idx = 0;
+    ndev->vqs[i].used_idx = 0;
+    }
+}
+
     /* TODO: cross-endian support */
     static inline bool mlx5_vdpa_is_little_endian(struct 
mlx5_vdpa_dev

*mvdev)
     {
@@ -1610,6 +1625,7 @@ static int save_channel_info(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu
     return err;
       ri->avail_index = attr.available_index;
+    ri->used_index = attr.used_index;
     ri->ready = mvq->ready;
     ri->num_ent = mvq->num_ent;
     ri->desc_addr = mvq->desc_addr;
@@ -1654,6 +1670,7 @@ static void 
restore_channels_info(struct

mlx5_vdpa_net *ndev)
     continue;
       mvq->avail_idx = ri->avail_index;
+    mvq->used_idx = ri->used_index;
     mvq->ready = ri->ready;
     mvq->num_ent = ri->num_ent;
     mvq->desc_addr = ri->desc_addr;
@@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct
vdpa_device *vdev, u8 status)
     if (!status) {
     mlx5_vdpa_info(mvdev, "performing device 
reset\n");

     teardown_driver(ndev);
+    clear_virtqueues(ndev);
The clearing looks fine at the first glance, as it aligns 
with the other
state cleanups floating around at the same place. However, 
the thing is
get_vq_state() is supposed to be called right after to get 
sync'ed with
the latest internal

[PATCH] vdpa/mlx5: set_features should allow reset to zero

2021-02-19 Thread Si-Wei Liu
Commit 452639a64ad8 ("vdpa: make sure set_features is invoked
for legacy") made an exception for legacy guests to reset
features to 0, when config space is accessed before features
are set. We should relieve the verify_min_features() check
and allow features reset to 0 for this case.

It's worth noting that not just legacy guests could access
config space before features are set. For instance, when
feature VIRTIO_NET_F_MTU is advertised some modern driver
will try to access and validate the MTU present in the config
space before virtio features are set. Rejecting reset to 0
prematurely causes correct MTU and link status unable to load
for the very first config space access, rendering issues like
guest showing inaccurate MTU value, or failure to reject
out-of-range MTU.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +--
 1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 7c1f789..540dd67 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1490,14 +1490,6 @@ static u64 mlx5_vdpa_get_features(struct vdpa_device 
*vdev)
return mvdev->mlx_features;
 }
 
-static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features)
-{
-   if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM)))
-   return -EOPNOTSUPP;
-
-   return 0;
-}
-
 static int setup_virtqueues(struct mlx5_vdpa_net *ndev)
 {
int err;
@@ -1558,18 +1550,13 @@ static int mlx5_vdpa_set_features(struct vdpa_device 
*vdev, u64 features)
 {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
-   int err;
 
print_features(mvdev, features, true);
 
-   err = verify_min_features(mvdev, features);
-   if (err)
-   return err;
-
ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features;
ndev->config.mtu = cpu_to_mlx5vdpa16(mvdev, ndev->mtu);
ndev->config.status |= cpu_to_mlx5vdpa16(mvdev, VIRTIO_NET_S_LINK_UP);
-   return err;
+   return 0;
 }
 
 static void mlx5_vdpa_set_config_cb(struct vdpa_device *vdev, struct 
vdpa_callback *cb)
-- 
1.8.3.1



Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map

2021-02-19 Thread Si-Wei Liu




On 2/18/2021 7:10 PM, Jason Wang wrote:


On 2021/2/18 8:43 下午, Si-Wei Liu wrote:



On 2/17/2021 8:44 PM, Jason Wang wrote:


On 2021/2/10 下午4:59, Si-Wei Liu wrote:



On 2/9/2021 7:53 PM, Jason Wang wrote:


On 2021/2/10 上午10:30, Si-Wei Liu wrote:



On 2/8/2021 10:37 PM, Jason Wang wrote:


On 2021/2/9 下午2:12, Eli Cohen wrote:

On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote:

On 2021/2/8 下午6:04, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote:

On 2021/2/8 下午2:37, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote:

On 2021/2/6 上午7:07, Si-Wei Liu wrote:

On 2/3/2021 11:36 PM, Eli Cohen wrote:
When a change of memory map occurs, the hardware 
resources are destroyed
and then re-created again with the new memory map. In 
such case, we need
to restore the hardware available and used indices. The 
driver failed to

restore the used index which is added here.

Also, since the driver also fails to reset the available 
and used
indices upon device reset, fix this here to avoid 
regression caused by

the fact that used index may not be zero upon device reset.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for 
supported mlx5

devices")
Signed-off-by: Eli Cohen
---
v0 -> v1:
Clear indices upon device reset

     drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 
++

     1 file changed, 18 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 88dde3455bfd..b5fe6d2ad22f 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -87,6 +87,7 @@ struct mlx5_vq_restore_info {
     u64 device_addr;
     u64 driver_addr;
     u16 avail_index;
+    u16 used_index;
     bool ready;
     struct vdpa_callback cb;
     bool restore;
@@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue {
     u32 virtq_id;
     struct mlx5_vdpa_net *ndev;
     u16 avail_idx;
+    u16 used_idx;
     int fw_state;
       /* keep last in the struct */
@@ -804,6 +806,7 @@ static int create_virtqueue(struct 
mlx5_vdpa_net

*ndev, struct mlx5_vdpa_virtque
       obj_context = 
MLX5_ADDR_OF(create_virtio_net_q_in, in,

obj_context);
     MLX5_SET(virtio_net_q_object, obj_context, 
hw_available_index,

mvq->avail_idx);
+    MLX5_SET(virtio_net_q_object, obj_context, 
hw_used_index,

mvq->used_idx);
     MLX5_SET(virtio_net_q_object, obj_context,
queue_feature_bit_mask_12_3,
get_features_12_3(ndev->mvdev.actual_features));
     vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, 
obj_context,

virtio_q_context);
@@ -1022,6 +1025,7 @@ static int connect_qps(struct 
mlx5_vdpa_net

*ndev, struct mlx5_vdpa_virtqueue *m
     struct mlx5_virtq_attr {
     u8 state;
     u16 available_index;
+    u16 used_index;
     };
       static int query_virtqueue(struct mlx5_vdpa_net 
*ndev, struct

mlx5_vdpa_virtqueue *mvq,
@@ -1052,6 +1056,7 @@ static int query_virtqueue(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu
     memset(attr, 0, sizeof(*attr));
     attr->state = MLX5_GET(virtio_net_q_object, 
obj_context, state);
     attr->available_index = 
MLX5_GET(virtio_net_q_object,

obj_context, hw_available_index);
+    attr->used_index = MLX5_GET(virtio_net_q_object, 
obj_context,

hw_used_index);
     kfree(out);
     return 0;
     @@ -1535,6 +1540,16 @@ static void 
teardown_virtqueues(struct

mlx5_vdpa_net *ndev)
     }
     }
     +static void clear_virtqueues(struct mlx5_vdpa_net 
*ndev)

+{
+    int i;
+
+    for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
+    ndev->vqs[i].avail_idx = 0;
+    ndev->vqs[i].used_idx = 0;
+    }
+}
+
     /* TODO: cross-endian support */
     static inline bool 
mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev

*mvdev)
     {
@@ -1610,6 +1625,7 @@ static int save_channel_info(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu
     return err;
       ri->avail_index = attr.available_index;
+    ri->used_index = attr.used_index;
     ri->ready = mvq->ready;
     ri->num_ent = mvq->num_ent;
     ri->desc_addr = mvq->desc_addr;
@@ -1654,6 +1670,7 @@ static void 
restore_channels_info(struct

mlx5_vdpa_net *ndev)
     continue;
       mvq->avail_idx = ri->avail_index;
+    mvq->used_idx = ri->used_index;
     mvq->ready = ri->ready;
     mvq->num_ent = ri->num_ent;
     mvq->desc_addr = ri->desc_addr;
@@ -1768,6 +1785,7 @@ static void 
mlx5_vdpa_set_status(struct

vdpa_device *vdev, u8 status)
     if (!status) {
     mlx5_vdpa_info(mvdev, "performing device 
reset\n");

     teardown_driver(ndev);
+    clear_virtqueues(ndev);
The clearing looks fine at the first glance, as it aligns 
with the other
state cleanups floating around at the same place. 
However, the thing is
get_vq_st

Re: [PATCH 1/2] vdpa/mlx5: Fix suspend/resume index restoration

2021-02-19 Thread Si-Wei Liu




On 2/17/2021 11:42 AM, Si-Wei Liu wrote:



On 2/16/2021 8:20 AM, Eli Cohen wrote:

When we suspend the VM, the VDPA interface will be reset. When the VM is
resumed again, clear_virtqueues() will clear the available and used
indices resulting in hardware virqtqueue objects becoming out of sync.
We can avoid this function alltogether since qemu will clear them if
required, e.g. when the VM went through a reboot.

Moreover, since the hw available and used indices should always be
identical on query and should be restored to the same value same value
for virtqueues that complete in order, we set the single value provided
by set_vq_state(). In get_vq_state() we return the value of hardware
used index.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 
devices")

Signed-off-by: Eli Cohen 

Acked-by: Si-Wei Liu 


I'd take it back for the moment, according to Jason's latest comment. 
Discussion still going.





---
  drivers/vdpa/mlx5/net/mlx5_vnet.c | 17 -
  1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c

index b8e9d525d66c..a51b0f86afe2 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1169,6 +1169,7 @@ static void suspend_vq(struct mlx5_vdpa_net 
*ndev, struct mlx5_vdpa_virtqueue *m

  return;
  }
  mvq->avail_idx = attr.available_index;
+    mvq->used_idx = attr.used_index;
  }
    static void suspend_vqs(struct mlx5_vdpa_net *ndev)
@@ -1426,6 +1427,7 @@ static int mlx5_vdpa_set_vq_state(struct 
vdpa_device *vdev, u16 idx,

  return -EINVAL;
  }
  +    mvq->used_idx = state->avail_index;
  mvq->avail_idx = state->avail_index;
This is where things starts getting interesting. According to Jason, the 
original expectation of this API would be to restore the internal 
last_avail_index in the hardware. With Mellanox network vDPA hardware 
implementation, there's no such last_avail_index thing in the hardware 
but only a single last_used_index representing both indices, which 
should always be the same and in sync with each other. So from migration 
point of view, it appears logical to restore the saved last_avail_index 
to the last_used_index in the hardware, right? But what is the point to 
restore mvq->avail_idx?


Actually, this code path is being repurposed to address the index reset 
problem in the device reset scenario. Where the mvq->avail_idx and 
mvq->used_idx are both reset to 0 once device is reset. This is a bit 
crossing the boundary to me.




  return 0;
  }
@@ -1443,7 +1445,7 @@ static int mlx5_vdpa_get_vq_state(struct 
vdpa_device *vdev, u16 idx, struct vdpa

   * that cares about emulating the index after vq is stopped.
   */
  if (!mvq->initialized) {
-    state->avail_index = mvq->avail_idx;
+    state->avail_index = mvq->used_idx;
This is where the last_used_index gets loaded from the hardware (when 
device is stopped).



  return 0;
  }
  @@ -1452,7 +1454,7 @@ static int mlx5_vdpa_get_vq_state(struct 
vdpa_device *vdev, u16 idx, struct vdpa

  mlx5_vdpa_warn(mvdev, "failed to query virtqueue\n");
  return err;
  }
-    state->avail_index = attr.available_index;
+    state->avail_index = attr.used_index;

This code path never gets called from userspace (when device is running).

-Siwei


  return 0;
  }
  @@ -1532,16 +1534,6 @@ static void teardown_virtqueues(struct 
mlx5_vdpa_net *ndev)

  }
  }
  -static void clear_virtqueues(struct mlx5_vdpa_net *ndev)
-{
-    int i;
-
-    for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
-    ndev->vqs[i].avail_idx = 0;
-    ndev->vqs[i].used_idx = 0;
-    }
-}
-
  /* TODO: cross-endian support */
  static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev 
*mvdev)

  {
@@ -1777,7 +1769,6 @@ static void mlx5_vdpa_set_status(struct 
vdpa_device *vdev, u8 status)

  if (!status) {
  mlx5_vdpa_info(mvdev, "performing device reset\n");
  teardown_driver(ndev);
-    clear_virtqueues(ndev);
  mlx5_vdpa_destroy_mr(&ndev->mvdev);
  ndev->mvdev.status = 0;
  ++mvdev->generation;






Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map

2021-02-19 Thread Si-Wei Liu




On 2/19/2021 6:38 PM, Jason Wang wrote:



Right now the value is exposed to userspace via GET_VRING_BASE, so 
only last_avail_idx is synced. If we need sync last_used_idx, we 
should also sync pending indices which requires more thoughts.
Technically it doesn't sound right - crossing the boundary a bit even 
with simplified form of assumption. But depending on how userspace 
could make use of this API, it doesn't seem it breaks existing 
functionality for the moment.



I don't get here, maybe you can explain a little bit more? 

Please refer to the email I just sent.

https://lore.kernel.org/lkml/033b0806-4037-5755-a1fa-91dbb58ba...@oracle.com/

-Siwei


Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero

2021-02-22 Thread Si-Wei Liu




On 2/21/2021 8:14 PM, Jason Wang wrote:


On 2021/2/19 7:54 下午, Si-Wei Liu wrote:

Commit 452639a64ad8 ("vdpa: make sure set_features is invoked
for legacy") made an exception for legacy guests to reset
features to 0, when config space is accessed before features
are set. We should relieve the verify_min_features() check
and allow features reset to 0 for this case.

It's worth noting that not just legacy guests could access
config space before features are set. For instance, when
feature VIRTIO_NET_F_MTU is advertised some modern driver
will try to access and validate the MTU present in the config
space before virtio features are set.



This looks like a spec violation:

"

The following driver-read-only field, mtu only exists if 
VIRTIO_NET_F_MTU is set. This field specifies the maximum MTU for the 
driver to use.

"

Do we really want to workaround this?


Isn't the commit 452639a64ad8 itself is a workaround for legacy guest?

I think the point is, since there's legacy guest we'd have to support, 
this host side workaround is unavoidable. Although I agree the violating 
driver should be fixed (yes, it's in today's upstream kernel which 
exists for a while now).


-Siwei



Thanks



Rejecting reset to 0
prematurely causes correct MTU and link status unable to load
for the very first config space access, rendering issues like
guest showing inaccurate MTU value, or failure to reject
out-of-range MTU.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 
devices")

Signed-off-by: Si-Wei Liu 
---
  drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +--
  1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c

index 7c1f789..540dd67 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1490,14 +1490,6 @@ static u64 mlx5_vdpa_get_features(struct 
vdpa_device *vdev)

  return mvdev->mlx_features;
  }
  -static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 
features)

-{
-    if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM)))
-    return -EOPNOTSUPP;
-
-    return 0;
-}
-
  static int setup_virtqueues(struct mlx5_vdpa_net *ndev)
  {
  int err;
@@ -1558,18 +1550,13 @@ static int mlx5_vdpa_set_features(struct 
vdpa_device *vdev, u64 features)

  {
  struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
  struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
-    int err;
    print_features(mvdev, features, true);
  -    err = verify_min_features(mvdev, features);
-    if (err)
-    return err;
-
  ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features;
  ndev->config.mtu = cpu_to_mlx5vdpa16(mvdev, ndev->mtu);
  ndev->config.status |= cpu_to_mlx5vdpa16(mvdev, 
VIRTIO_NET_S_LINK_UP);

-    return err;
+    return 0;
  }
    static void mlx5_vdpa_set_config_cb(struct vdpa_device *vdev, 
struct vdpa_callback *cb)






Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero

2021-02-22 Thread Si-Wei Liu




On 2/21/2021 11:34 PM, Michael S. Tsirkin wrote:

On Mon, Feb 22, 2021 at 12:14:17PM +0800, Jason Wang wrote:

On 2021/2/19 7:54 下午, Si-Wei Liu wrote:

Commit 452639a64ad8 ("vdpa: make sure set_features is invoked
for legacy") made an exception for legacy guests to reset
features to 0, when config space is accessed before features
are set. We should relieve the verify_min_features() check
and allow features reset to 0 for this case.

It's worth noting that not just legacy guests could access
config space before features are set. For instance, when
feature VIRTIO_NET_F_MTU is advertised some modern driver
will try to access and validate the MTU present in the config
space before virtio features are set.


This looks like a spec violation:

"

The following driver-read-only field, mtu only exists if VIRTIO_NET_F_MTU is
set.
This field specifies the maximum MTU for the driver to use.
"

Do we really want to workaround this?

Thanks

And also:

The driver MUST follow this sequence to initialize a device:
1. Reset the device.
2. Set the ACKNOWLEDGE status bit: the guest OS has noticed the device.
3. Set the DRIVER status bit: the guest OS knows how to drive the device.
4. Read device feature bits, and write the subset of feature bits understood by 
the OS and driver to the
device. During this step the driver MAY read (but MUST NOT write) the 
device-specific configuration
fields to check that it can support the device before accepting it.
5. Set the FEATURES_OK status bit. The driver MUST NOT accept new feature bits 
after this step.
6. Re-read device status to ensure the FEATURES_OK bit is still set: otherwise, 
the device does not
support our subset of features and the device is unusable.
7. Perform device-specific setup, including discovery of virtqueues for the 
device, optional per-bus setup,
reading and possibly writing the device’s virtio configuration space, and 
population of virtqueues.
8. Set the DRIVER_OK status bit. At this point the device is “live”.


so accessing config space before FEATURES_OK is a spec violation, right?
It is, but it's not relevant to what this commit tries to address. I 
thought the legacy guest still needs to be supported.


Having said, a separate patch has to be posted to fix the guest driver 
issue where this discrepancy is introduced to virtnet_validate() (since 
commit fe36cbe067). But it's not technically related to this patch.


-Siwei





Rejecting reset to 0
prematurely causes correct MTU and link status unable to load
for the very first config space access, rendering issues like
guest showing inaccurate MTU value, or failure to reject
out-of-range MTU.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Si-Wei Liu 
---
   drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +--
   1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 7c1f789..540dd67 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1490,14 +1490,6 @@ static u64 mlx5_vdpa_get_features(struct vdpa_device 
*vdev)
return mvdev->mlx_features;
   }
-static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features)
-{
-   if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM)))
-   return -EOPNOTSUPP;
-
-   return 0;
-}
-
   static int setup_virtqueues(struct mlx5_vdpa_net *ndev)
   {
int err;
@@ -1558,18 +1550,13 @@ static int mlx5_vdpa_set_features(struct vdpa_device 
*vdev, u64 features)
   {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
-   int err;
print_features(mvdev, features, true);
-   err = verify_min_features(mvdev, features);
-   if (err)
-   return err;
-
ndev->mvdev.actual_features = features & ndev->mvdev.mlx_features;
ndev->config.mtu = cpu_to_mlx5vdpa16(mvdev, ndev->mtu);
ndev->config.status |= cpu_to_mlx5vdpa16(mvdev, VIRTIO_NET_S_LINK_UP);
-   return err;
+   return 0;
   }
   static void mlx5_vdpa_set_config_cb(struct vdpa_device *vdev, struct 
vdpa_callback *cb)




Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero

2021-02-23 Thread Si-Wei Liu




On 2/23/2021 5:26 AM, Michael S. Tsirkin wrote:

On Tue, Feb 23, 2021 at 10:03:57AM +0800, Jason Wang wrote:

On 2021/2/23 9:12 上午, Si-Wei Liu wrote:


On 2/21/2021 11:34 PM, Michael S. Tsirkin wrote:

On Mon, Feb 22, 2021 at 12:14:17PM +0800, Jason Wang wrote:

On 2021/2/19 7:54 下午, Si-Wei Liu wrote:

Commit 452639a64ad8 ("vdpa: make sure set_features is invoked
for legacy") made an exception for legacy guests to reset
features to 0, when config space is accessed before features
are set. We should relieve the verify_min_features() check
and allow features reset to 0 for this case.

It's worth noting that not just legacy guests could access
config space before features are set. For instance, when
feature VIRTIO_NET_F_MTU is advertised some modern driver
will try to access and validate the MTU present in the config
space before virtio features are set.

This looks like a spec violation:

"

The following driver-read-only field, mtu only exists if
VIRTIO_NET_F_MTU is
set.
This field specifies the maximum MTU for the driver to use.
"

Do we really want to workaround this?

Thanks

And also:

The driver MUST follow this sequence to initialize a device:
1. Reset the device.
2. Set the ACKNOWLEDGE status bit: the guest OS has noticed the device.
3. Set the DRIVER status bit: the guest OS knows how to drive the
device.
4. Read device feature bits, and write the subset of feature bits
understood by the OS and driver to the
device. During this step the driver MAY read (but MUST NOT write)
the device-specific configuration
fields to check that it can support the device before accepting it.
5. Set the FEATURES_OK status bit. The driver MUST NOT accept new
feature bits after this step.
6. Re-read device status to ensure the FEATURES_OK bit is still set:
otherwise, the device does not
support our subset of features and the device is unusable.
7. Perform device-specific setup, including discovery of virtqueues
for the device, optional per-bus setup,
reading and possibly writing the device’s virtio configuration
space, and population of virtqueues.
8. Set the DRIVER_OK status bit. At this point the device is “live”.


so accessing config space before FEATURES_OK is a spec violation, right?

It is, but it's not relevant to what this commit tries to address. I
thought the legacy guest still needs to be supported.

Having said, a separate patch has to be posted to fix the guest driver
issue where this discrepancy is introduced to virtnet_validate() (since
commit fe36cbe067). But it's not technically related to this patch.

-Siwei


I think it's a bug to read config space in validate, we should move it to
virtnet_probe().

Thanks

I take it back, reading but not writing seems to be explicitly allowed by spec.
So our way to detect a legacy guest is bogus, need to think what is
the best way to handle this.
Then maybe revert commit fe36cbe067 and friends, and have QEMU detect 
legacy guest? Supposedly only config space write access needs to be 
guarded before setting FEATURES_OK.


-Siwie




Rejecting reset to 0
prematurely causes correct MTU and link status unable to load
for the very first config space access, rendering issues like
guest showing inaccurate MTU value, or failure to reject
out-of-range MTU.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for
supported mlx5 devices")
Signed-off-by: Si-Wei Liu 
---
    drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +--
    1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 7c1f789..540dd67 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1490,14 +1490,6 @@ static u64
mlx5_vdpa_get_features(struct vdpa_device *vdev)
    return mvdev->mlx_features;
    }
-static int verify_min_features(struct mlx5_vdpa_dev *mvdev,
u64 features)
-{
-    if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM)))
-    return -EOPNOTSUPP;
-
-    return 0;
-}
-
    static int setup_virtqueues(struct mlx5_vdpa_net *ndev)
    {
    int err;
@@ -1558,18 +1550,13 @@ static int
mlx5_vdpa_set_features(struct vdpa_device *vdev, u64
features)
    {
    struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
    struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
-    int err;
    print_features(mvdev, features, true);
-    err = verify_min_features(mvdev, features);
-    if (err)
-    return err;
-
    ndev->mvdev.actual_features = features &
ndev->mvdev.mlx_features;
    ndev->config.mtu = cpu_to_mlx5vdpa16(mvdev, ndev->mtu);
    ndev->config.status |= cpu_to_mlx5vdpa16(mvdev,
VIRTIO_NET_S_LINK_UP);
-    return err;
+    return 0;
    }
    static void mlx5_vdpa_set_config_cb(struct vdpa_device
*vdev, struct vdpa_callback *cb)




Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero

2021-02-24 Thread Si-Wei Liu




On 2/23/2021 9:04 PM, Michael S. Tsirkin wrote:

On Tue, Feb 23, 2021 at 11:35:57AM -0800, Si-Wei Liu wrote:


On 2/23/2021 5:26 AM, Michael S. Tsirkin wrote:

On Tue, Feb 23, 2021 at 10:03:57AM +0800, Jason Wang wrote:

On 2021/2/23 9:12 上午, Si-Wei Liu wrote:

On 2/21/2021 11:34 PM, Michael S. Tsirkin wrote:

On Mon, Feb 22, 2021 at 12:14:17PM +0800, Jason Wang wrote:

On 2021/2/19 7:54 下午, Si-Wei Liu wrote:

Commit 452639a64ad8 ("vdpa: make sure set_features is invoked
for legacy") made an exception for legacy guests to reset
features to 0, when config space is accessed before features
are set. We should relieve the verify_min_features() check
and allow features reset to 0 for this case.

It's worth noting that not just legacy guests could access
config space before features are set. For instance, when
feature VIRTIO_NET_F_MTU is advertised some modern driver
will try to access and validate the MTU present in the config
space before virtio features are set.

This looks like a spec violation:

"

The following driver-read-only field, mtu only exists if
VIRTIO_NET_F_MTU is
set.
This field specifies the maximum MTU for the driver to use.
"

Do we really want to workaround this?

Thanks

And also:

The driver MUST follow this sequence to initialize a device:
1. Reset the device.
2. Set the ACKNOWLEDGE status bit: the guest OS has noticed the device.
3. Set the DRIVER status bit: the guest OS knows how to drive the
device.
4. Read device feature bits, and write the subset of feature bits
understood by the OS and driver to the
device. During this step the driver MAY read (but MUST NOT write)
the device-specific configuration
fields to check that it can support the device before accepting it.
5. Set the FEATURES_OK status bit. The driver MUST NOT accept new
feature bits after this step.
6. Re-read device status to ensure the FEATURES_OK bit is still set:
otherwise, the device does not
support our subset of features and the device is unusable.
7. Perform device-specific setup, including discovery of virtqueues
for the device, optional per-bus setup,
reading and possibly writing the device’s virtio configuration
space, and population of virtqueues.
8. Set the DRIVER_OK status bit. At this point the device is “live”.


so accessing config space before FEATURES_OK is a spec violation, right?

It is, but it's not relevant to what this commit tries to address. I
thought the legacy guest still needs to be supported.

Having said, a separate patch has to be posted to fix the guest driver
issue where this discrepancy is introduced to virtnet_validate() (since
commit fe36cbe067). But it's not technically related to this patch.

-Siwei

I think it's a bug to read config space in validate, we should move it to
virtnet_probe().

Thanks

I take it back, reading but not writing seems to be explicitly allowed by spec.
So our way to detect a legacy guest is bogus, need to think what is
the best way to handle this.

Then maybe revert commit fe36cbe067 and friends, and have QEMU detect legacy
guest? Supposedly only config space write access needs to be guarded before
setting FEATURES_OK.

-Siwie

Detecting it isn't enough though, we will need a new ioctl to notify
the kernel that it's a legacy guest. Ugh :(
Well, although I think adding an ioctl is doable, may I know what the 
use case there will be for kernel to leverage such info directly? Is 
there a case QEMU can't do with dedicate ioctls later if there's indeed 
differentiation (legacy v.s. modern) needed?


One of the reason I asked is if this ioctl becomes a mandate for 
vhost-vdpa kernel. QEMU would reject initialize vhost-vdpa if doesn't 
see this ioctl coming?


If it's optional, suppose the kernel may need it only when it becomes 
necessary?


Thanks,
-Siwei






Rejecting reset to 0
prematurely causes correct MTU and link status unable to load
for the very first config space access, rendering issues like
guest showing inaccurate MTU value, or failure to reject
out-of-range MTU.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for
supported mlx5 devices")
Signed-off-by: Si-Wei Liu 
---
     drivers/vdpa/mlx5/net/mlx5_vnet.c | 15 +--
     1 file changed, 1 insertion(+), 14 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 7c1f789..540dd67 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1490,14 +1490,6 @@ static u64
mlx5_vdpa_get_features(struct vdpa_device *vdev)
     return mvdev->mlx_features;
     }
-static int verify_min_features(struct mlx5_vdpa_dev *mvdev,
u64 features)
-{
-    if (!(features & BIT_ULL(VIRTIO_F_ACCESS_PLATFORM)))
-    return -EOPNOTSUPP;
-
-    return 0;
-}
-
     static int setup_virtqueues(struct mlx5_vdpa_net *ndev)
     {
     int err;
@@ -1558,18 +1550,13 @@ static int
mlx5_vdpa_set_features(struct vdpa_device *vdev, u64

Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero

2021-02-25 Thread Si-Wei Liu



Hi Michael,

Are you okay to live without this ioctl for now? I think QEMU is the one 
that needs to be fixed and will have to be made legacy guest aware. I 
think the kernel can just honor the feature negotiation result done by 
QEMU and do as what's told to.Will you agree?


If it's fine, I would proceed to reverting commit fe36cbe067 and related 
code in question from the kernel.


Thanks,
-Siwei

On 2/24/2021 10:24 AM, Si-Wei Liu wrote:

Detecting it isn't enough though, we will need a new ioctl to notify
the kernel that it's a legacy guest. Ugh :(
Well, although I think adding an ioctl is doable, may I know what the 
use case there will be for kernel to leverage such info directly? Is 
there a case QEMU can't do with dedicate ioctls later if there's 
indeed differentiation (legacy v.s. modern) needed?


One of the reason I asked is if this ioctl becomes a mandate for 
vhost-vdpa kernel. QEMU would reject initialize vhost-vdpa if doesn't 
see this ioctl coming?


If it's optional, suppose the kernel may need it only when it becomes 
necessary?


Thanks,
-Siwei




Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

2021-02-01 Thread Si-Wei Liu
On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen  wrote:
>
> suspend_vq should only suspend the VQ on not save the current available
> index. This is done when a change of map occurs when the driver calls
> save_channel_info().

Hmmm, suspend_vq() is also called by teardown_vq(), the latter of
which doesn't save the available index as save_channel_info() doesn't
get called in that path at all. How does it handle the case that
aget_vq_state() is called from userspace (e.g. QEMU) while the
hardware VQ object was torn down, but userspace still wants to access
the queue index?

Refer to 
https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/

vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)

QEMU will complain with the above warning while VM is being rebooted
or shut down.

Looks to me either the kernel driver should cover this requirement, or
the userspace has to bear the burden in saving the index and not call
into kernel if VQ is destroyed.

-Siwei


>
> Signed-off-by: Eli Cohen 
> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 
>  1 file changed, 8 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 88dde3455bfd..549ded074ff3 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1148,8 +1148,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, struct 
> mlx5_vdpa_virtqueue *mvq)
>
>  static void suspend_vq(struct mlx5_vdpa_net *ndev, struct 
> mlx5_vdpa_virtqueue *mvq)
>  {
> -   struct mlx5_virtq_attr attr;
> -
> if (!mvq->initialized)
> return;
>
> @@ -1158,12 +1156,6 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, 
> struct mlx5_vdpa_virtqueue *m
>
> if (modify_virtqueue(ndev, mvq, 
> MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
> mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n");
> -
> -   if (query_virtqueue(ndev, mvq, &attr)) {
> -   mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
> -   return;
> -   }
> -   mvq->avail_idx = attr.available_index;
>  }
>
>  static void suspend_vqs(struct mlx5_vdpa_net *ndev)
> --
> 2.29.2
>


Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

2021-02-01 Thread Si-Wei Liu
On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu  wrote:
>
> On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen  wrote:
> >
> > suspend_vq should only suspend the VQ on not save the current available
> > index. This is done when a change of map occurs when the driver calls
> > save_channel_info().
>
> Hmmm, suspend_vq() is also called by teardown_vq(), the latter of
> which doesn't save the available index as save_channel_info() doesn't
> get called in that path at all. How does it handle the case that
> aget_vq_state() is called from userspace (e.g. QEMU) while the
> hardware VQ object was torn down, but userspace still wants to access
> the queue index?
>
> Refer to 
> https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/
>
> vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
> vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
>
> QEMU will complain with the above warning while VM is being rebooted
> or shut down.
>
> Looks to me either the kernel driver should cover this requirement, or
> the userspace has to bear the burden in saving the index and not call
> into kernel if VQ is destroyed.

Actually, the userspace doesn't have the insights whether virt queue
will be destroyed if just changing the device status via set_status().
Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave like
so. Hence this still looks to me to be Mellanox specifics and
mlx5_vdpa implementation detail that shouldn't expose to userspace.
>
> -Siwei
>
>
> >
> > Signed-off-by: Eli Cohen 
> > ---
> >  drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 
> >  1 file changed, 8 deletions(-)
> >
> > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > index 88dde3455bfd..549ded074ff3 100644
> > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > @@ -1148,8 +1148,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, 
> > struct mlx5_vdpa_virtqueue *mvq)
> >
> >  static void suspend_vq(struct mlx5_vdpa_net *ndev, struct 
> > mlx5_vdpa_virtqueue *mvq)
> >  {
> > -   struct mlx5_virtq_attr attr;
> > -
> > if (!mvq->initialized)
> > return;
> >
> > @@ -1158,12 +1156,6 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, 
> > struct mlx5_vdpa_virtqueue *m
> >
> > if (modify_virtqueue(ndev, mvq, 
> > MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
> > mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n");
> > -
> > -   if (query_virtqueue(ndev, mvq, &attr)) {
> > -   mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
> > -   return;
> > -   }
> > -   mvq->avail_idx = attr.available_index;
> >  }
> >
> >  static void suspend_vqs(struct mlx5_vdpa_net *ndev)
> > --
> > 2.29.2
> >


Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

2021-02-01 Thread Si-Wei Liu
On Mon, Feb 1, 2021 at 7:13 PM Jason Wang  wrote:
>
>
> On 2021/2/2 上午3:17, Si-Wei Liu wrote:
> > On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu  wrote:
> >> On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen  wrote:
> >>> suspend_vq should only suspend the VQ on not save the current available
> >>> index. This is done when a change of map occurs when the driver calls
> >>> save_channel_info().
> >> Hmmm, suspend_vq() is also called by teardown_vq(), the latter of
> >> which doesn't save the available index as save_channel_info() doesn't
> >> get called in that path at all. How does it handle the case that
> >> aget_vq_state() is called from userspace (e.g. QEMU) while the
> >> hardware VQ object was torn down, but userspace still wants to access
> >> the queue index?
> >>
> >> Refer to 
> >> https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/
> >>
> >> vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
> >> vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
> >>
> >> QEMU will complain with the above warning while VM is being rebooted
> >> or shut down.
> >>
> >> Looks to me either the kernel driver should cover this requirement, or
> >> the userspace has to bear the burden in saving the index and not call
> >> into kernel if VQ is destroyed.
> > Actually, the userspace doesn't have the insights whether virt queue
> > will be destroyed if just changing the device status via set_status().
> > Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave like
> > so. Hence this still looks to me to be Mellanox specifics and
> > mlx5_vdpa implementation detail that shouldn't expose to userspace.
>
>
> So I think we can simply drop this patch?

Yep, I think so. To be honest I don't know why it has anything to do
with the memory hotplug issue.

-Siwei

>
> Thanks
>
>
> >> -Siwei
> >>
> >>
> >>> Signed-off-by: Eli Cohen 
> >>> ---
> >>>   drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 
> >>>   1 file changed, 8 deletions(-)
> >>>
> >>> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> >>> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> >>> index 88dde3455bfd..549ded074ff3 100644
> >>> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> >>> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> >>> @@ -1148,8 +1148,6 @@ static int setup_vq(struct mlx5_vdpa_net *ndev, 
> >>> struct mlx5_vdpa_virtqueue *mvq)
> >>>
> >>>   static void suspend_vq(struct mlx5_vdpa_net *ndev, struct 
> >>> mlx5_vdpa_virtqueue *mvq)
> >>>   {
> >>> -   struct mlx5_virtq_attr attr;
> >>> -
> >>>  if (!mvq->initialized)
> >>>  return;
> >>>
> >>> @@ -1158,12 +1156,6 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, 
> >>> struct mlx5_vdpa_virtqueue *m
> >>>
> >>>  if (modify_virtqueue(ndev, mvq, 
> >>> MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
> >>>  mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend 
> >>> failed\n");
> >>> -
> >>> -   if (query_virtqueue(ndev, mvq, &attr)) {
> >>> -   mlx5_vdpa_warn(&ndev->mvdev, "failed to query 
> >>> virtqueue\n");
> >>> -   return;
> >>> -   }
> >>> -   mvq->avail_idx = attr.available_index;
> >>>   }
> >>>
> >>>   static void suspend_vqs(struct mlx5_vdpa_net *ndev)
> >>> --
> >>> 2.29.2
> >>>
>


Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

2021-02-02 Thread Si-Wei Liu
Thanks Eli and Jason for clarifications. See inline.

On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen  wrote:
>
> On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang wrote:
> >
> > On 2021/2/2 下午12:15, Si-Wei Liu wrote:
> > > On Mon, Feb 1, 2021 at 7:13 PM Jason Wang  wrote:
> > > >
> > > > On 2021/2/2 上午3:17, Si-Wei Liu wrote:
> > > > > On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu  
> > > > > wrote:
> > > > > > On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen  wrote:
> > > > > > > suspend_vq should only suspend the VQ on not save the current 
> > > > > > > available
> > > > > > > index. This is done when a change of map occurs when the driver 
> > > > > > > calls
> > > > > > > save_channel_info().
> > > > > > Hmmm, suspend_vq() is also called by teardown_vq(), the latter of
> > > > > > which doesn't save the available index as save_channel_info() 
> > > > > > doesn't
> > > > > > get called in that path at all. How does it handle the case that
> > > > > > aget_vq_state() is called from userspace (e.g. QEMU) while the
> > > > > > hardware VQ object was torn down, but userspace still wants to 
> > > > > > access
> > > > > > the queue index?
> > > > > >
> > > > > > Refer to 
> > > > > > https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/
> > > > > >
> > > > > > vhost VQ 0 ring restore failed: -1: Resource temporarily 
> > > > > > unavailable (11)
> > > > > > vhost VQ 1 ring restore failed: -1: Resource temporarily 
> > > > > > unavailable (11)
> > > > > >
> > > > > > QEMU will complain with the above warning while VM is being rebooted
> > > > > > or shut down.
> > > > > >
> > > > > > Looks to me either the kernel driver should cover this requirement, 
> > > > > > or
> > > > > > the userspace has to bear the burden in saving the index and not 
> > > > > > call
> > > > > > into kernel if VQ is destroyed.
> > > > > Actually, the userspace doesn't have the insights whether virt queue
> > > > > will be destroyed if just changing the device status via set_status().
> > > > > Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave like
> > > > > so. Hence this still looks to me to be Mellanox specifics and
> > > > > mlx5_vdpa implementation detail that shouldn't expose to userspace.
> > > >
> > > > So I think we can simply drop this patch?
> > > Yep, I think so. To be honest I don't know why it has anything to do
> > > with the memory hotplug issue.
> >
> >
> > Eli may know more, my understanding is that, during memory hotplut, qemu
> > need to propagate new memory mappings via set_map(). For mellanox, it means
> > it needs to rebuild memory keys, so the virtqueue needs to be suspended.
> >
>
> I think Siwei was asking why the first patch was related to the hotplug
> issue.

I was thinking how consistency is assured when saving/restoring this
h/w avail_index against the one in the virtq memory, particularly in
the region_add/.region_del() context (e.g. the hotplug case). Problem
is I don't see explicit memory barrier when guest thread updates the
avail_index, how does the device make sure the h/w avail_index is
uptodate while guest may race with updating the virtq's avail_index in
the mean while? Maybe I completely miss something obvious?

Thanks,
-Siwei

>
> But you're correct. When memory is added, I get a new memory map. This
> requires me to build a new memory key object which covers the new memory
> map. Since the virtqueue objects are referencing this memory key, I need
> to destroy them and build new ones referncing the new memory key.
>
> > Thanks
> >
> >
> > >
> > > -Siwei
> > >
> > > > Thanks
> > > >
> > > >
> > > > > > -Siwei
> > > > > >
> > > > > >
> > > > > > > Signed-off-by: Eli Cohen 
> > > > > > > ---
> > > > > > >drivers/vdpa/mlx5/net/mlx5_vnet.c | 8 
> > > > > > >1 file changed, 8 deletions(-)
> > > > > > >
> > > > > > > diff --gi

Re: [PATCH] vdpa/mlx5: Restore the hardware used index after change map

2021-02-02 Thread Si-Wei Liu
On Tue, Feb 2, 2021 at 6:34 AM Eli Cohen  wrote:
>
> When a change of memory map occurs, the hardware resources are destroyed
> and then re-created again with the new memory map. In such case, we need
> to restore the hardware available and used indices. The driver failed to
> restore the used index which is added here.
>
> Fixes 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
> Signed-off-by: Eli Cohen 
> ---
> This patch is being sent again a single patch the fixes hot memory
> addtion to a qemy process.
>
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 +++
>  1 file changed, 7 insertions(+)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 88dde3455bfd..839f57c64a6f 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info {
> u64 device_addr;
> u64 driver_addr;
> u16 avail_index;
> +   u16 used_index;
> bool ready;
> struct vdpa_callback cb;
> bool restore;
> @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue {
> u32 virtq_id;
> struct mlx5_vdpa_net *ndev;
> u16 avail_idx;
> +   u16 used_idx;
> int fw_state;
>
> /* keep last in the struct */
> @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, 
> struct mlx5_vdpa_virtque
>
> obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context);
> MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, 
> mvq->avail_idx);
> +   MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, 
> mvq->used_idx);

The saved indexes will apply to the new virtqueue object whenever it
is created. In virtio spec, these indexes will reset back to zero when
the virtio device is reset. But I don't see how it's done today. IOW,
I don't see where avail_idx and used_idx get cleared from the mvq for
device reset via set_status().

-Siwei


> MLX5_SET(virtio_net_q_object, obj_context, 
> queue_feature_bit_mask_12_3,
>  get_features_12_3(ndev->mvdev.actual_features));
> vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, 
> virtio_q_context);
> @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, 
> struct mlx5_vdpa_virtqueue *m
>  struct mlx5_virtq_attr {
> u8 state;
> u16 available_index;
> +   u16 used_index;
>  };
>
>  static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct 
> mlx5_vdpa_virtqueue *mvq,
> @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, 
> struct mlx5_vdpa_virtqueu
> memset(attr, 0, sizeof(*attr));
> attr->state = MLX5_GET(virtio_net_q_object, obj_context, state);
> attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, 
> hw_available_index);
> +   attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, 
> hw_used_index);
> kfree(out);
> return 0;
>
> @@ -1610,6 +1615,7 @@ static int save_channel_info(struct mlx5_vdpa_net 
> *ndev, struct mlx5_vdpa_virtqu
> return err;
>
> ri->avail_index = attr.available_index;
> +   ri->used_index = attr.used_index;
> ri->ready = mvq->ready;
> ri->num_ent = mvq->num_ent;
> ri->desc_addr = mvq->desc_addr;
> @@ -1654,6 +1660,7 @@ static void restore_channels_info(struct mlx5_vdpa_net 
> *ndev)
> continue;
>
> mvq->avail_idx = ri->avail_index;
> +   mvq->used_idx = ri->used_index;
> mvq->ready = ri->ready;
> mvq->num_ent = ri->num_ent;
> mvq->desc_addr = ri->desc_addr;
> --
> 2.29.2
>


Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

2021-02-02 Thread Si-Wei Liu
On Tue, Feb 2, 2021 at 1:23 AM Eli Cohen  wrote:
>
> On Tue, Feb 02, 2021 at 12:38:51AM -0800, Si-Wei Liu wrote:
> > Thanks Eli and Jason for clarifications. See inline.
> >
> > On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen  wrote:
> > >
> > > On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang wrote:
> > > >
> > > > On 2021/2/2 下午12:15, Si-Wei Liu wrote:
> > > > > On Mon, Feb 1, 2021 at 7:13 PM Jason Wang  wrote:
> > > > > >
> > > > > > On 2021/2/2 上午3:17, Si-Wei Liu wrote:
> > > > > > > On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu 
> > > > > > >  wrote:
> > > > > > > > On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen  
> > > > > > > > wrote:
> > > > > > > > > suspend_vq should only suspend the VQ on not save the current 
> > > > > > > > > available
> > > > > > > > > index. This is done when a change of map occurs when the 
> > > > > > > > > driver calls
> > > > > > > > > save_channel_info().
> > > > > > > > Hmmm, suspend_vq() is also called by teardown_vq(), the latter 
> > > > > > > > of
> > > > > > > > which doesn't save the available index as save_channel_info() 
> > > > > > > > doesn't
> > > > > > > > get called in that path at all. How does it handle the case that
> > > > > > > > aget_vq_state() is called from userspace (e.g. QEMU) while the
> > > > > > > > hardware VQ object was torn down, but userspace still wants to 
> > > > > > > > access
> > > > > > > > the queue index?
> > > > > > > >
> > > > > > > > Refer to 
> > > > > > > > https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/
> > > > > > > >
> > > > > > > > vhost VQ 0 ring restore failed: -1: Resource temporarily 
> > > > > > > > unavailable (11)
> > > > > > > > vhost VQ 1 ring restore failed: -1: Resource temporarily 
> > > > > > > > unavailable (11)
> > > > > > > >
> > > > > > > > QEMU will complain with the above warning while VM is being 
> > > > > > > > rebooted
> > > > > > > > or shut down.
> > > > > > > >
> > > > > > > > Looks to me either the kernel driver should cover this 
> > > > > > > > requirement, or
> > > > > > > > the userspace has to bear the burden in saving the index and 
> > > > > > > > not call
> > > > > > > > into kernel if VQ is destroyed.
> > > > > > > Actually, the userspace doesn't have the insights whether virt 
> > > > > > > queue
> > > > > > > will be destroyed if just changing the device status via 
> > > > > > > set_status().
> > > > > > > Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave 
> > > > > > > like
> > > > > > > so. Hence this still looks to me to be Mellanox specifics and
> > > > > > > mlx5_vdpa implementation detail that shouldn't expose to 
> > > > > > > userspace.
> > > > > >
> > > > > > So I think we can simply drop this patch?
> > > > > Yep, I think so. To be honest I don't know why it has anything to do
> > > > > with the memory hotplug issue.
> > > >
> > > >
> > > > Eli may know more, my understanding is that, during memory hotplut, qemu
> > > > need to propagate new memory mappings via set_map(). For mellanox, it 
> > > > means
> > > > it needs to rebuild memory keys, so the virtqueue needs to be suspended.
> > > >
> > >
> > > I think Siwei was asking why the first patch was related to the hotplug
> > > issue.
> >
> > I was thinking how consistency is assured when saving/restoring this
> > h/w avail_index against the one in the virtq memory, particularly in
> > the region_add/.region_del() context (e.g. the hotplug case). Problem
> > is I don't see explicit memory barrier when guest thread updates the
> > avail_index, how does the device make sure the

Re: [PATCH] vdpa/mlx5: set_features should allow reset to zero

2021-03-01 Thread Si-Wei Liu




On 2/28/2021 1:27 PM, Michael S. Tsirkin wrote:

On Thu, Feb 25, 2021 at 04:56:42PM -0800, Si-Wei Liu wrote:

Hi Michael,

Are you okay to live without this ioctl for now? I think QEMU is the one
that needs to be fixed and will have to be made legacy guest aware. I think
the kernel can just honor the feature negotiation result done by QEMU and do
as what's told to.Will you agree?

If it's fine, I would proceed to reverting commit fe36cbe067 and related
code in question from the kernel.

Thanks,
-Siwei


Not really, I don't see why that's a good idea.  fe36cbe067 is the code
checking MTU before FEATURES_OK. Spec explicitly allows that.


Alright, but what I meant was this commit
452639a64ad8 ("vdpa: make sure set_features is invoked for legacy").

But I got why you need it in another email (for BE host/guest).

-Siwei


Re: [PATCH] vdpa/mlx5: Restore the hardware used index after change map

2021-02-03 Thread Si-Wei Liu
On Tue, Feb 2, 2021 at 10:48 PM Eli Cohen  wrote:
>
> On Tue, Feb 02, 2021 at 09:14:02AM -0800, Si-Wei Liu wrote:
> > On Tue, Feb 2, 2021 at 6:34 AM Eli Cohen  wrote:
> > >
> > > When a change of memory map occurs, the hardware resources are destroyed
> > > and then re-created again with the new memory map. In such case, we need
> > > to restore the hardware available and used indices. The driver failed to
> > > restore the used index which is added here.
> > >
> > > Fixes 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 
> > > devices")
> > > Signed-off-by: Eli Cohen 
> > > ---
> > > This patch is being sent again a single patch the fixes hot memory
> > > addtion to a qemy process.
> > >
> > >  drivers/vdpa/mlx5/net/mlx5_vnet.c | 7 +++
> > >  1 file changed, 7 insertions(+)
> > >
> > > diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> > > b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > index 88dde3455bfd..839f57c64a6f 100644
> > > --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> > > @@ -87,6 +87,7 @@ struct mlx5_vq_restore_info {
> > > u64 device_addr;
> > > u64 driver_addr;
> > > u16 avail_index;
> > > +   u16 used_index;
> > > bool ready;
> > > struct vdpa_callback cb;
> > > bool restore;
> > > @@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue {
> > > u32 virtq_id;
> > > struct mlx5_vdpa_net *ndev;
> > > u16 avail_idx;
> > > +   u16 used_idx;
> > > int fw_state;
> > >
> > > /* keep last in the struct */
> > > @@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net 
> > > *ndev, struct mlx5_vdpa_virtque
> > >
> > > obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, 
> > > obj_context);
> > > MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, 
> > > mvq->avail_idx);
> > > +   MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, 
> > > mvq->used_idx);
> >
> > The saved indexes will apply to the new virtqueue object whenever it
> > is created. In virtio spec, these indexes will reset back to zero when
> > the virtio device is reset. But I don't see how it's done today. IOW,
> > I don't see where avail_idx and used_idx get cleared from the mvq for
> > device reset via set_status().
> >
>
> Right, but this is not strictly related to this patch. I will post
> another patch to fix this.

Better to post these two patches in a series.Or else it may cause VM
reboot problem as that is where the device gets reset. The avail_index
did not as the correct value will be written to by driver right after,
but used_idx introduced by this patch is supplied by device hence this
patch alone would introduce regression.

>
> BTW, can you describe a secnario that would cause a reset (through
> calling set_status()) that happens after the VQ has been used?

You can try reboot the guest, that'll be the easy way to test.

-Siwei

>
> > -Siwei
> >
> >
> > > MLX5_SET(virtio_net_q_object, obj_context, 
> > > queue_feature_bit_mask_12_3,
> > >  get_features_12_3(ndev->mvdev.actual_features));
> > > vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, 
> > > virtio_q_context);
> > > @@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, 
> > > struct mlx5_vdpa_virtqueue *m
> > >  struct mlx5_virtq_attr {
> > > u8 state;
> > > u16 available_index;
> > > +   u16 used_index;
> > >  };
> > >
> > >  static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct 
> > > mlx5_vdpa_virtqueue *mvq,
> > > @@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net 
> > > *ndev, struct mlx5_vdpa_virtqueu
> > > memset(attr, 0, sizeof(*attr));
> > > attr->state = MLX5_GET(virtio_net_q_object, obj_context, state);
> > > attr->available_index = MLX5_GET(virtio_net_q_object, 
> > > obj_context, hw_available_index);
> > > +   attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, 
> > > hw_used_index);
> > > kfree(out);
> > > return 0;
> > >
> > > @@ -1610,6 +1615,7 @@ static int save_channel_info(struct mlx5_vdpa_net 
> > > *ndev, struct mlx5_vdpa_virtqu
> > > return err;
> > >
> > > ri->avail_index = attr.available_index;
> > > +   ri->used_index = attr.used_index;
> > > ri->ready = mvq->ready;
> > > ri->num_ent = mvq->num_ent;
> > > ri->desc_addr = mvq->desc_addr;
> > > @@ -1654,6 +1660,7 @@ static void restore_channels_info(struct 
> > > mlx5_vdpa_net *ndev)
> > > continue;
> > >
> > > mvq->avail_idx = ri->avail_index;
> > > +   mvq->used_idx = ri->used_index;
> > > mvq->ready = ri->ready;
> > > mvq->num_ent = ri->num_ent;
> > > mvq->desc_addr = ri->desc_addr;
> > > --
> > > 2.29.2
> > >


Re: [PATCH 1/2] vdpa/mlx5: Avoid unnecessary query virtqueue

2021-02-03 Thread Si-Wei Liu
On Tue, Feb 2, 2021 at 9:16 PM Jason Wang  wrote:
>
>
> On 2021/2/3 上午1:54, Si-Wei Liu wrote:
> > On Tue, Feb 2, 2021 at 1:23 AM Eli Cohen  wrote:
> >> On Tue, Feb 02, 2021 at 12:38:51AM -0800, Si-Wei Liu wrote:
> >>> Thanks Eli and Jason for clarifications. See inline.
> >>>
> >>> On Mon, Feb 1, 2021 at 11:06 PM Eli Cohen  wrote:
> >>>> On Tue, Feb 02, 2021 at 02:02:25PM +0800, Jason Wang wrote:
> >>>>> On 2021/2/2 下午12:15, Si-Wei Liu wrote:
> >>>>>> On Mon, Feb 1, 2021 at 7:13 PM Jason Wang  wrote:
> >>>>>>> On 2021/2/2 上午3:17, Si-Wei Liu wrote:
> >>>>>>>> On Mon, Feb 1, 2021 at 10:51 AM Si-Wei Liu  
> >>>>>>>> wrote:
> >>>>>>>>> On Thu, Jan 28, 2021 at 5:46 AM Eli Cohen  wrote:
> >>>>>>>>>> suspend_vq should only suspend the VQ on not save the current 
> >>>>>>>>>> available
> >>>>>>>>>> index. This is done when a change of map occurs when the driver 
> >>>>>>>>>> calls
> >>>>>>>>>> save_channel_info().
> >>>>>>>>> Hmmm, suspend_vq() is also called by teardown_vq(), the latter of
> >>>>>>>>> which doesn't save the available index as save_channel_info() 
> >>>>>>>>> doesn't
> >>>>>>>>> get called in that path at all. How does it handle the case that
> >>>>>>>>> aget_vq_state() is called from userspace (e.g. QEMU) while the
> >>>>>>>>> hardware VQ object was torn down, but userspace still wants to 
> >>>>>>>>> access
> >>>>>>>>> the queue index?
> >>>>>>>>>
> >>>>>>>>> Refer to 
> >>>>>>>>> https://lore.kernel.org/netdev/1601583511-15138-1-git-send-email-si-wei@oracle.com/
> >>>>>>>>>
> >>>>>>>>> vhost VQ 0 ring restore failed: -1: Resource temporarily 
> >>>>>>>>> unavailable (11)
> >>>>>>>>> vhost VQ 1 ring restore failed: -1: Resource temporarily 
> >>>>>>>>> unavailable (11)
> >>>>>>>>>
> >>>>>>>>> QEMU will complain with the above warning while VM is being rebooted
> >>>>>>>>> or shut down.
> >>>>>>>>>
> >>>>>>>>> Looks to me either the kernel driver should cover this requirement, 
> >>>>>>>>> or
> >>>>>>>>> the userspace has to bear the burden in saving the index and not 
> >>>>>>>>> call
> >>>>>>>>> into kernel if VQ is destroyed.
> >>>>>>>> Actually, the userspace doesn't have the insights whether virt queue
> >>>>>>>> will be destroyed if just changing the device status via 
> >>>>>>>> set_status().
> >>>>>>>> Looking at other vdpa driver in tree i.e. ifcvf it doesn't behave 
> >>>>>>>> like
> >>>>>>>> so. Hence this still looks to me to be Mellanox specifics and
> >>>>>>>> mlx5_vdpa implementation detail that shouldn't expose to userspace.
> >>>>>>> So I think we can simply drop this patch?
> >>>>>> Yep, I think so. To be honest I don't know why it has anything to do
> >>>>>> with the memory hotplug issue.
> >>>>>
> >>>>> Eli may know more, my understanding is that, during memory hotplut, qemu
> >>>>> need to propagate new memory mappings via set_map(). For mellanox, it 
> >>>>> means
> >>>>> it needs to rebuild memory keys, so the virtqueue needs to be suspended.
> >>>>>
> >>>> I think Siwei was asking why the first patch was related to the hotplug
> >>>> issue.
> >>> I was thinking how consistency is assured when saving/restoring this
> >>> h/w avail_index against the one in the virtq memory, particularly in
> >>> the region_add/.region_del() context (e.g. the hotplug case). Problem
> >>> is I don't see explicit memory barrier when guest thread updates the
> >>> avail_index, how does the devic

Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map

2021-02-05 Thread Si-Wei Liu




On 2/3/2021 11:36 PM, Eli Cohen wrote:

When a change of memory map occurs, the hardware resources are destroyed
and then re-created again with the new memory map. In such case, we need
to restore the hardware available and used indices. The driver failed to
restore the used index which is added here.

Also, since the driver also fails to reset the available and used
indices upon device reset, fix this here to avoid regression caused by
the fact that used index may not be zero upon device reset.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Eli Cohen 
---
v0 -> v1:
Clear indices upon device reset

  drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++
  1 file changed, 18 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 88dde3455bfd..b5fe6d2ad22f 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -87,6 +87,7 @@ struct mlx5_vq_restore_info {
u64 device_addr;
u64 driver_addr;
u16 avail_index;
+   u16 used_index;
bool ready;
struct vdpa_callback cb;
bool restore;
@@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue {
u32 virtq_id;
struct mlx5_vdpa_net *ndev;
u16 avail_idx;
+   u16 used_idx;
int fw_state;
  
  	/* keep last in the struct */

@@ -804,6 +806,7 @@ static int create_virtqueue(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtque
  
  	obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in, obj_context);

MLX5_SET(virtio_net_q_object, obj_context, hw_available_index, 
mvq->avail_idx);
+   MLX5_SET(virtio_net_q_object, obj_context, hw_used_index, 
mvq->used_idx);
MLX5_SET(virtio_net_q_object, obj_context, queue_feature_bit_mask_12_3,
 get_features_12_3(ndev->mvdev.actual_features));
vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context, 
virtio_q_context);
@@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net *ndev, struct 
mlx5_vdpa_virtqueue *m
  struct mlx5_virtq_attr {
u8 state;
u16 available_index;
+   u16 used_index;
  };
  
  static int query_virtqueue(struct mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueue *mvq,

@@ -1052,6 +1056,7 @@ static int query_virtqueue(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtqueu
memset(attr, 0, sizeof(*attr));
attr->state = MLX5_GET(virtio_net_q_object, obj_context, state);
attr->available_index = MLX5_GET(virtio_net_q_object, obj_context, 
hw_available_index);
+   attr->used_index = MLX5_GET(virtio_net_q_object, obj_context, 
hw_used_index);
kfree(out);
return 0;
  
@@ -1535,6 +1540,16 @@ static void teardown_virtqueues(struct mlx5_vdpa_net *ndev)

}
  }
  
+static void clear_virtqueues(struct mlx5_vdpa_net *ndev)

+{
+   int i;
+
+   for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
+   ndev->vqs[i].avail_idx = 0;
+   ndev->vqs[i].used_idx = 0;
+   }
+}
+
  /* TODO: cross-endian support */
  static inline bool mlx5_vdpa_is_little_endian(struct mlx5_vdpa_dev *mvdev)
  {
@@ -1610,6 +1625,7 @@ static int save_channel_info(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtqu
return err;
  
  	ri->avail_index = attr.available_index;

+   ri->used_index = attr.used_index;
ri->ready = mvq->ready;
ri->num_ent = mvq->num_ent;
ri->desc_addr = mvq->desc_addr;
@@ -1654,6 +1670,7 @@ static void restore_channels_info(struct mlx5_vdpa_net 
*ndev)
continue;
  
  		mvq->avail_idx = ri->avail_index;

+   mvq->used_idx = ri->used_index;
mvq->ready = ri->ready;
mvq->num_ent = ri->num_ent;
mvq->desc_addr = ri->desc_addr;
@@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
if (!status) {
mlx5_vdpa_info(mvdev, "performing device reset\n");
teardown_driver(ndev);
+   clear_virtqueues(ndev);
The clearing looks fine at the first glance, as it aligns with the other 
state cleanups floating around at the same place. However, the thing is 
get_vq_state() is supposed to be called right after to get sync'ed with 
the latest internal avail_index from device while vq is stopped. The 
index was saved in the driver software at vq suspension, but before the 
virtq object is destroyed. We shouldn't clear the avail_index too early.


Possibly it can be postponed to where VIRTIO_CONFIG_S_DRIVER_OK gets set 
again, i.e. right before the setup_driver() in mlx5_vdpa_set_status()?


-Siwei


mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
ndev->mvdev.mlx_features = 0;




[PATCH 1/3] mlx5_vdpa: should exclude header length and fcs from mtu

2021-02-06 Thread Si-Wei Liu
When feature VIRTIO_NET_F_MTU is negotiated on mlx5_vdpa,
22 extra bytes worth of MTU length is shown in guest.
This is because the mlx5_query_port_max_mtu API returns
the "hardware" MTU value, which does not just contain the
Ethernet payload, but includes extra lengths starting
from the Ethernet header up to the FCS altogether.

Fix the MTU so packets won't get dropped silently.

Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  4 
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 15 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index 08f742f..b6cc53b 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -4,9 +4,13 @@
 #ifndef __MLX5_VDPA_H__
 #define __MLX5_VDPA_H__
 
+#include 
+#include 
 #include 
 #include 
 
+#define MLX5V_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
+
 struct mlx5_vdpa_direct_mr {
u64 start;
u64 end;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index dc88559..b8416c4 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1907,6 +1907,19 @@ static int mlx5_get_vq_irq(struct vdpa_device *vdv, u16 
idx)
.free = mlx5_vdpa_free,
 };
 
+static int query_mtu(struct mlx5_core_dev *mdev, u16 *mtu)
+{
+   u16 hw_mtu;
+   int err;
+
+   err = mlx5_query_nic_vport_mtu(mdev, &hw_mtu);
+   if (err)
+   return err;
+
+   *mtu = hw_mtu - MLX5V_ETH_HARD_MTU;
+   return 0;
+}
+
 static int alloc_resources(struct mlx5_vdpa_net *ndev)
 {
struct mlx5_vdpa_net_resources *res = &ndev->res;
@@ -1992,7 +2005,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
init_mvqs(ndev);
mutex_init(&ndev->reslock);
config = &ndev->config;
-   err = mlx5_query_nic_vport_mtu(mdev, &ndev->mtu);
+   err = query_mtu(mdev, &ndev->mtu);
if (err)
goto err_mtu;
 
-- 
1.8.3.1



[PATCH 2/3] mlx5_vdpa: fix feature negotiation across device reset

2021-02-06 Thread Si-Wei Liu
The mlx_features denotes the capability for which
set of virtio features is supported by device. In
principle, this field needs not be cleared during
virtio device reset, as this capability is static
and does not change across reset.

In fact, the current code may have the assumption
that mlx_features can be reloaded from firmware
via the .get_features ops after device is reset
(via the .set_status ops), which is unfortunately
not true. The userspace VMM might save a copy
of backend capable features and won't call into
kernel again to get it on reset. This causes all
virtio features getting disabled on newly created
virtqs after device reset, while guest would hold
mismatched view of available features. For e.g.,
the guest may still assume tx checksum offload
is available after reset and feature negotiation,
causing frames with bogus (incomplete) checksum
transmitted on the wire.

Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index b8416c4..aa6f8cd 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1788,7 +1788,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
-   ndev->mvdev.mlx_features = 0;
++mvdev->generation;
return;
}
-- 
1.8.3.1



[PATCH 3/3] mlx5_vdpa: defer clear_virtqueues to until DRIVER_OK

2021-02-06 Thread Si-Wei Liu
While virtq is stopped,  get_vq_state() is supposed to
be  called to  get  sync'ed  with  the latest internal
avail_index from device. The saved avail_index is used
to restate  the virtq  once device is started.  Commit
b35ccebe3ef7 introduced the clear_virtqueues() routine
to  reset  the saved  avail_index,  however, the index
gets cleared a bit earlier before get_vq_state() tries
to read it. This would cause consistency problems when
virtq is restarted, e.g. through a series of link down
and link up events. We  could  defer  the  clearing of
avail_index  to  until  the  device  is to be started,
i.e. until  VIRTIO_CONFIG_S_DRIVER_OK  is set again in
set_status().

Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change 
map")
Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index aa6f8cd..444ab58 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1785,7 +1785,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
if (!status) {
mlx5_vdpa_info(mvdev, "performing device reset\n");
teardown_driver(ndev);
-   clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
++mvdev->generation;
@@ -1794,6 +1793,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
 
if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) {
if (status & VIRTIO_CONFIG_S_DRIVER_OK) {
+   clear_virtqueues(ndev);
err = setup_driver(ndev);
if (err) {
mlx5_vdpa_warn(mvdev, "failed to setup 
driver\n");
-- 
1.8.3.1



Re: [PATCH 2/3] mlx5_vdpa: fix feature negotiation across device reset

2021-02-08 Thread Si-Wei Liu




On 2/7/2021 9:35 PM, Eli Cohen wrote:

On Sat, Feb 06, 2021 at 04:29:23AM -0800, Si-Wei Liu wrote:

The mlx_features denotes the capability for which
set of virtio features is supported by device. In
principle, this field needs not be cleared during
virtio device reset, as this capability is static
and does not change across reset.

In fact, the current code may have the assumption
that mlx_features can be reloaded from firmware
via the .get_features ops after device is reset
(via the .set_status ops), which is unfortunately
not true. The userspace VMM might save a copy
of backend capable features and won't call into
kernel again to get it on reset. This causes all
virtio features getting disabled on newly created
virtqs after device reset, while guest would hold
mismatched view of available features. For e.g.,
the guest may still assume tx checksum offload
is available after reset and feature negotiation,
causing frames with bogus (incomplete) checksum
transmitted on the wire.

Signed-off-by: Si-Wei Liu 
---
  drivers/vdpa/mlx5/net/mlx5_vnet.c | 1 -
  1 file changed, 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index b8416c4..aa6f8cd 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1788,7 +1788,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
-   ndev->mvdev.mlx_features = 0;
++mvdev->generation;
return;
}

Since we assume that device capabilities don't change, I think I would
get the features through a call done in mlx5v_probe after the netdev
object is created and change mlx5_vdpa_get_features() to just return
ndev->mvdev.mlx_features.
Yep, it makes sense. Will post a revised patch. If vdpa tool allows 
reconfiguration post probing, the code has to be reconciled then.




Did you actually see this issue in action? If you did, can you share
with us how you trigerred this?
Issue is indeed seen in action. The mismatched tx-checksum offload as 
described in the commit message was one of such examples. You would need 
a guest reboot though (triggering device reset via the .set_status ops 
and zero'ed mlx_features was loaded to deduce new actual_features for 
creating the h/w virtq object) for repro.


-Siwei



--
1.8.3.1





Re: [PATCH 3/3] mlx5_vdpa: defer clear_virtqueues to until DRIVER_OK

2021-02-08 Thread Si-Wei Liu




On 2/7/2021 9:48 PM, Eli Cohen wrote:

On Sat, Feb 06, 2021 at 04:29:24AM -0800, Si-Wei Liu wrote:

While virtq is stopped,  get_vq_state() is supposed to
be  called to  get  sync'ed  with  the latest internal
avail_index from device. The saved avail_index is used
to restate  the virtq  once device is started.  Commit
b35ccebe3ef7 introduced the clear_virtqueues() routine
to  reset  the saved  avail_index,  however, the index
gets cleared a bit earlier before get_vq_state() tries
to read it. This would cause consistency problems when
virtq is restarted, e.g. through a series of link down
and link up events. We  could  defer  the  clearing of
avail_index  to  until  the  device  is to be started,
i.e. until  VIRTIO_CONFIG_S_DRIVER_OK  is set again in
set_status().


Not sure I understand the scenario. You are talking about reset of the
device followed by up/down events on the interface. How can you trigger
this?
Currently it's not possible to trigger link up/down events with upstream 
QEMU due lack of config/control interrupt implementation. And live 
migration could be another scenario that cannot be satisfied because of 
inconsistent queue state. They share the same root of cause as captured 
here.


-Siwei




Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change 
map")
Signed-off-by: Si-Wei Liu 
---
  drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index aa6f8cd..444ab58 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1785,7 +1785,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
if (!status) {
mlx5_vdpa_info(mvdev, "performing device reset\n");
teardown_driver(ndev);
-   clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
++mvdev->generation;
@@ -1794,6 +1793,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
  
  	if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) {

if (status & VIRTIO_CONFIG_S_DRIVER_OK) {
+   clear_virtqueues(ndev);
err = setup_driver(ndev);
if (err) {
mlx5_vdpa_warn(mvdev, "failed to setup 
driver\n");
--
1.8.3.1





Re: [PATCH 3/3] mlx5_vdpa: defer clear_virtqueues to until DRIVER_OK

2021-02-09 Thread Si-Wei Liu




On 2/8/2021 7:37 PM, Jason Wang wrote:


On 2021/2/6 下午8:29, Si-Wei Liu wrote:

While virtq is stopped,  get_vq_state() is supposed to
be  called to  get  sync'ed  with  the latest internal
avail_index from device. The saved avail_index is used
to restate  the virtq  once device is started.  Commit
b35ccebe3ef7 introduced the clear_virtqueues() routine
to  reset  the saved  avail_index,  however, the index
gets cleared a bit earlier before get_vq_state() tries
to read it. This would cause consistency problems when
virtq is restarted, e.g. through a series of link down
and link up events. We  could  defer  the  clearing of
avail_index  to  until  the  device  is to be started,
i.e. until  VIRTIO_CONFIG_S_DRIVER_OK  is set again in
set_status().

Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index 
after change map")

Signed-off-by: Si-Wei Liu 
---
  drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c

index aa6f8cd..444ab58 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1785,7 +1785,6 @@ static void mlx5_vdpa_set_status(struct 
vdpa_device *vdev, u8 status)

  if (!status) {
  mlx5_vdpa_info(mvdev, "performing device reset\n");
  teardown_driver(ndev);
-    clear_virtqueues(ndev);
  mlx5_vdpa_destroy_mr(&ndev->mvdev);
  ndev->mvdev.status = 0;
  ++mvdev->generation;
@@ -1794,6 +1793,7 @@ static void mlx5_vdpa_set_status(struct 
vdpa_device *vdev, u8 status)

    if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) {
  if (status & VIRTIO_CONFIG_S_DRIVER_OK) {
+    clear_virtqueues(ndev);



Rethink about this. As mentioned in another thread, this in fact 
breaks set_vq_state().  (See vhost_virtqueue_start() -> 
vhost_vdpa_set_vring_base() in qemu codes).

I assume that the clearing for vhost-vdpa would be done via (qemu code),

vhost_dev_start()->vhost_vdpa_dev_start()->vhost_vdpa_call(status | 
VIRTIO_CONFIG_S_DRIVER_OK)


which is _after_ vhost_virtqueue_start() gets called to restore the 
avail_idx to h/w in vhost_dev_start(). What am I missing here?


-Siwei




The issue is that the avail idx is forgot, we need keep it.

Thanks



  err = setup_driver(ndev);
  if (err) {
  mlx5_vdpa_warn(mvdev, "failed to setup driver\n");






Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map

2021-02-09 Thread Si-Wei Liu




On 2/8/2021 10:37 PM, Jason Wang wrote:


On 2021/2/9 下午2:12, Eli Cohen wrote:

On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote:

On 2021/2/8 下午6:04, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote:

On 2021/2/8 下午2:37, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote:

On 2021/2/6 上午7:07, Si-Wei Liu wrote:

On 2/3/2021 11:36 PM, Eli Cohen wrote:
When a change of memory map occurs, the hardware resources are 
destroyed
and then re-created again with the new memory map. In such 
case, we need
to restore the hardware available and used indices. The driver 
failed to

restore the used index which is added here.

Also, since the driver also fails to reset the available and used
indices upon device reset, fix this here to avoid regression 
caused by

the fact that used index may not be zero upon device reset.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported 
mlx5

devices")
Signed-off-by: Eli Cohen
---
v0 -> v1:
Clear indices upon device reset

     drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++
     1 file changed, 18 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 88dde3455bfd..b5fe6d2ad22f 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -87,6 +87,7 @@ struct mlx5_vq_restore_info {
     u64 device_addr;
     u64 driver_addr;
     u16 avail_index;
+    u16 used_index;
     bool ready;
     struct vdpa_callback cb;
     bool restore;
@@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue {
     u32 virtq_id;
     struct mlx5_vdpa_net *ndev;
     u16 avail_idx;
+    u16 used_idx;
     int fw_state;
       /* keep last in the struct */
@@ -804,6 +806,7 @@ static int create_virtqueue(struct 
mlx5_vdpa_net

*ndev, struct mlx5_vdpa_virtque
       obj_context = MLX5_ADDR_OF(create_virtio_net_q_in, in,
obj_context);
     MLX5_SET(virtio_net_q_object, obj_context, 
hw_available_index,

mvq->avail_idx);
+    MLX5_SET(virtio_net_q_object, obj_context, hw_used_index,
mvq->used_idx);
     MLX5_SET(virtio_net_q_object, obj_context,
queue_feature_bit_mask_12_3,
get_features_12_3(ndev->mvdev.actual_features));
     vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, obj_context,
virtio_q_context);
@@ -1022,6 +1025,7 @@ static int connect_qps(struct mlx5_vdpa_net
*ndev, struct mlx5_vdpa_virtqueue *m
     struct mlx5_virtq_attr {
     u8 state;
     u16 available_index;
+    u16 used_index;
     };
       static int query_virtqueue(struct mlx5_vdpa_net *ndev, 
struct

mlx5_vdpa_virtqueue *mvq,
@@ -1052,6 +1056,7 @@ static int query_virtqueue(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu
     memset(attr, 0, sizeof(*attr));
     attr->state = MLX5_GET(virtio_net_q_object, 
obj_context, state);

     attr->available_index = MLX5_GET(virtio_net_q_object,
obj_context, hw_available_index);
+    attr->used_index = MLX5_GET(virtio_net_q_object, 
obj_context,

hw_used_index);
     kfree(out);
     return 0;
     @@ -1535,6 +1540,16 @@ static void 
teardown_virtqueues(struct

mlx5_vdpa_net *ndev)
     }
     }
     +static void clear_virtqueues(struct mlx5_vdpa_net *ndev)
+{
+    int i;
+
+    for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
+    ndev->vqs[i].avail_idx = 0;
+    ndev->vqs[i].used_idx = 0;
+    }
+}
+
     /* TODO: cross-endian support */
     static inline bool mlx5_vdpa_is_little_endian(struct 
mlx5_vdpa_dev

*mvdev)
     {
@@ -1610,6 +1625,7 @@ static int save_channel_info(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu
     return err;
       ri->avail_index = attr.available_index;
+    ri->used_index = attr.used_index;
     ri->ready = mvq->ready;
     ri->num_ent = mvq->num_ent;
     ri->desc_addr = mvq->desc_addr;
@@ -1654,6 +1670,7 @@ static void restore_channels_info(struct
mlx5_vdpa_net *ndev)
     continue;
       mvq->avail_idx = ri->avail_index;
+    mvq->used_idx = ri->used_index;
     mvq->ready = ri->ready;
     mvq->num_ent = ri->num_ent;
     mvq->desc_addr = ri->desc_addr;
@@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct
vdpa_device *vdev, u8 status)
     if (!status) {
     mlx5_vdpa_info(mvdev, "performing device reset\n");
     teardown_driver(ndev);
+    clear_virtqueues(ndev);
The clearing looks fine at the first glance, as it aligns with 
the other
state cleanups floating around at the same place. However, the 
thing is
get_vq_state() is supposed to be called right after to get 
sync'ed with
the latest internal avail_index from device while vq is 
stopped. The
index was saved in the driver software at vq suspension, but 
before the
virtq object is destroyed. We shouldn't clear the avai

Re: [PATCH v1] vdpa/mlx5: Restore the hardware used index after change map

2021-02-10 Thread Si-Wei Liu




On 2/9/2021 7:53 PM, Jason Wang wrote:


On 2021/2/10 上午10:30, Si-Wei Liu wrote:



On 2/8/2021 10:37 PM, Jason Wang wrote:


On 2021/2/9 下午2:12, Eli Cohen wrote:

On Tue, Feb 09, 2021 at 11:20:14AM +0800, Jason Wang wrote:

On 2021/2/8 下午6:04, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 05:04:27PM +0800, Jason Wang wrote:

On 2021/2/8 下午2:37, Eli Cohen wrote:

On Mon, Feb 08, 2021 at 12:27:18PM +0800, Jason Wang wrote:

On 2021/2/6 上午7:07, Si-Wei Liu wrote:

On 2/3/2021 11:36 PM, Eli Cohen wrote:
When a change of memory map occurs, the hardware resources 
are destroyed
and then re-created again with the new memory map. In such 
case, we need
to restore the hardware available and used indices. The 
driver failed to

restore the used index which is added here.

Also, since the driver also fails to reset the available and 
used
indices upon device reset, fix this here to avoid regression 
caused by

the fact that used index may not be zero upon device reset.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for 
supported mlx5

devices")
Signed-off-by: Eli Cohen
---
v0 -> v1:
Clear indices upon device reset

     drivers/vdpa/mlx5/net/mlx5_vnet.c | 18 ++
     1 file changed, 18 insertions(+)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 88dde3455bfd..b5fe6d2ad22f 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -87,6 +87,7 @@ struct mlx5_vq_restore_info {
     u64 device_addr;
     u64 driver_addr;
     u16 avail_index;
+    u16 used_index;
     bool ready;
     struct vdpa_callback cb;
     bool restore;
@@ -121,6 +122,7 @@ struct mlx5_vdpa_virtqueue {
     u32 virtq_id;
     struct mlx5_vdpa_net *ndev;
     u16 avail_idx;
+    u16 used_idx;
     int fw_state;
       /* keep last in the struct */
@@ -804,6 +806,7 @@ static int create_virtqueue(struct 
mlx5_vdpa_net

*ndev, struct mlx5_vdpa_virtque
       obj_context = 
MLX5_ADDR_OF(create_virtio_net_q_in, in,

obj_context);
     MLX5_SET(virtio_net_q_object, obj_context, 
hw_available_index,

mvq->avail_idx);
+    MLX5_SET(virtio_net_q_object, obj_context, hw_used_index,
mvq->used_idx);
     MLX5_SET(virtio_net_q_object, obj_context,
queue_feature_bit_mask_12_3,
get_features_12_3(ndev->mvdev.actual_features));
     vq_ctx = MLX5_ADDR_OF(virtio_net_q_object, 
obj_context,

virtio_q_context);
@@ -1022,6 +1025,7 @@ static int connect_qps(struct 
mlx5_vdpa_net

*ndev, struct mlx5_vdpa_virtqueue *m
     struct mlx5_virtq_attr {
     u8 state;
     u16 available_index;
+    u16 used_index;
     };
       static int query_virtqueue(struct mlx5_vdpa_net 
*ndev, struct

mlx5_vdpa_virtqueue *mvq,
@@ -1052,6 +1056,7 @@ static int query_virtqueue(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqueu
     memset(attr, 0, sizeof(*attr));
     attr->state = MLX5_GET(virtio_net_q_object, 
obj_context, state);

     attr->available_index = MLX5_GET(virtio_net_q_object,
obj_context, hw_available_index);
+    attr->used_index = MLX5_GET(virtio_net_q_object, 
obj_context,

hw_used_index);
     kfree(out);
     return 0;
     @@ -1535,6 +1540,16 @@ static void 
teardown_virtqueues(struct

mlx5_vdpa_net *ndev)
     }
     }
     +static void clear_virtqueues(struct mlx5_vdpa_net *ndev)
+{
+    int i;
+
+    for (i = ndev->mvdev.max_vqs - 1; i >= 0; i--) {
+    ndev->vqs[i].avail_idx = 0;
+    ndev->vqs[i].used_idx = 0;
+    }
+}
+
     /* TODO: cross-endian support */
     static inline bool mlx5_vdpa_is_little_endian(struct 
mlx5_vdpa_dev

*mvdev)
     {
@@ -1610,6 +1625,7 @@ static int save_channel_info(struct
mlx5_vdpa_net *ndev, struct mlx5_vdpa_virtqu
     return err;
       ri->avail_index = attr.available_index;
+    ri->used_index = attr.used_index;
     ri->ready = mvq->ready;
     ri->num_ent = mvq->num_ent;
     ri->desc_addr = mvq->desc_addr;
@@ -1654,6 +1670,7 @@ static void restore_channels_info(struct
mlx5_vdpa_net *ndev)
     continue;
       mvq->avail_idx = ri->avail_index;
+    mvq->used_idx = ri->used_index;
     mvq->ready = ri->ready;
     mvq->num_ent = ri->num_ent;
     mvq->desc_addr = ri->desc_addr;
@@ -1768,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct
vdpa_device *vdev, u8 status)
     if (!status) {
     mlx5_vdpa_info(mvdev, "performing device 
reset\n");

     teardown_driver(ndev);
+    clear_virtqueues(ndev);
The clearing looks fine at the first glance, as it aligns 
with the other
state cleanups floating around at the same place. However, 
the thing is
get_vq_state() is supposed to be called right after to get 
sync'ed with
the latest internal avail_index from device while vq is 
stopped. The
index was saved in the driver software a

[PATCH v2 2/3] vdpa/mlx5: fix feature negotiation across device reset

2021-02-10 Thread Si-Wei Liu
The mlx_features denotes the capability for which
set of virtio features is supported by device. In
principle, this field needs not be cleared during
virtio device reset, as this capability is static
and does not change across reset.

In fact, the current code may have the assumption
that mlx_features can be reloaded from firmware
via the .get_features ops after device is reset
(via the .set_status ops), which is unfortunately
not true. The userspace VMM might save a copy
of backend capable features and won't call into
kernel again to get it on reset. This causes all
virtio features getting disabled on newly created
virtqs after device reset, while guest would hold
mismatched view of available features. For e.g.,
the guest may still assume tx checksum offload
is available after reset and feature negotiation,
causing frames with bogus (incomplete) checksum
transmitted on the wire.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 25 +++--
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index b8416c4..7c1f789 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1486,16 +1486,8 @@ static u64 mlx_to_vritio_features(u16 dev_features)
 static u64 mlx5_vdpa_get_features(struct vdpa_device *vdev)
 {
struct mlx5_vdpa_dev *mvdev = to_mvdev(vdev);
-   struct mlx5_vdpa_net *ndev = to_mlx5_vdpa_ndev(mvdev);
-   u16 dev_features;
 
-   dev_features = MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, 
device_features_bits_mask);
-   ndev->mvdev.mlx_features = mlx_to_vritio_features(dev_features);
-   if (MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, virtio_version_1_0))
-   ndev->mvdev.mlx_features |= BIT_ULL(VIRTIO_F_VERSION_1);
-   ndev->mvdev.mlx_features |= BIT_ULL(VIRTIO_F_ACCESS_PLATFORM);
-   print_features(mvdev, ndev->mvdev.mlx_features, false);
-   return ndev->mvdev.mlx_features;
+   return mvdev->mlx_features;
 }
 
 static int verify_min_features(struct mlx5_vdpa_dev *mvdev, u64 features)
@@ -1788,7 +1780,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
-   ndev->mvdev.mlx_features = 0;
++mvdev->generation;
return;
}
@@ -1907,6 +1898,19 @@ static int mlx5_get_vq_irq(struct vdpa_device *vdv, u16 
idx)
.free = mlx5_vdpa_free,
 };
 
+static void query_virtio_features(struct mlx5_vdpa_net *ndev)
+{
+   struct mlx5_vdpa_dev *mvdev = &ndev->mvdev;
+   u16 dev_features;
+
+   dev_features = MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, 
device_features_bits_mask);
+   mvdev->mlx_features = mlx_to_vritio_features(dev_features);
+   if (MLX5_CAP_DEV_VDPA_EMULATION(mvdev->mdev, virtio_version_1_0))
+   mvdev->mlx_features |= BIT_ULL(VIRTIO_F_VERSION_1);
+   mvdev->mlx_features |= BIT_ULL(VIRTIO_F_ACCESS_PLATFORM);
+   print_features(mvdev, mvdev->mlx_features, false);
+}
+
 static int query_mtu(struct mlx5_core_dev *mdev, u16 *mtu)
 {
u16 hw_mtu;
@@ -2005,6 +2009,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
init_mvqs(ndev);
mutex_init(&ndev->reslock);
config = &ndev->config;
+   query_virtio_features(ndev);
err = query_mtu(mdev, &ndev->mtu);
if (err)
goto err_mtu;
-- 
1.8.3.1



[PATCH v2 3/3] vdpa/mlx5: defer clear_virtqueues to until DRIVER_OK

2021-02-10 Thread Si-Wei Liu
While virtq is stopped,  get_vq_state() is supposed to
be  called to  get  sync'ed  with  the latest internal
avail_index from device. The saved avail_index is used
to restate  the virtq  once device is started.  Commit
b35ccebe3ef7 introduced the clear_virtqueues() routine
to  reset  the saved  avail_index,  however, the index
gets cleared a bit earlier before get_vq_state() tries
to read it. This would cause consistency problems when
virtq is restarted, e.g. through a series of link down
and link up events. We  could  defer  the  clearing of
avail_index  to  until  the  device  is to be started,
i.e. until  VIRTIO_CONFIG_S_DRIVER_OK  is set again in
set_status().

Fixes: b35ccebe3ef7 ("vdpa/mlx5: Restore the hardware used index after change 
map")
Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 7c1f789..ce6aae8 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1777,7 +1777,6 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
if (!status) {
mlx5_vdpa_info(mvdev, "performing device reset\n");
teardown_driver(ndev);
-   clear_virtqueues(ndev);
mlx5_vdpa_destroy_mr(&ndev->mvdev);
ndev->mvdev.status = 0;
++mvdev->generation;
@@ -1786,6 +1785,7 @@ static void mlx5_vdpa_set_status(struct vdpa_device 
*vdev, u8 status)
 
if ((status ^ ndev->mvdev.status) & VIRTIO_CONFIG_S_DRIVER_OK) {
if (status & VIRTIO_CONFIG_S_DRIVER_OK) {
+   clear_virtqueues(ndev);
err = setup_driver(ndev);
if (err) {
mlx5_vdpa_warn(mvdev, "failed to setup 
driver\n");
-- 
1.8.3.1



[PATCH v2 0/3] mlx5_vdpa bug fixes

2021-02-10 Thread Si-Wei Liu
This set attempts to fix a few independent issues recently found
in mlx5_vdpa driver. Please find details for each in the commit
message.

Patch 1 and patch 3 are already Ack'ed by Jason Wang. Patch 2 is
reworked to move virtio feature capability query to mlx5v_probe()
as suggested by Eli.

--
v1->v2: move feature capability query to probing (Eli)

Si-Wei Liu (3):
  vdpa/mlx5: should exclude header length and fcs from mtu
  vdpa/mlx5: fix feature negotiation across device reset
  vdpa/mlx5: defer clear_virtqueues to until DRIVER_OK

 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  4 
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 42 +++---
 2 files changed, 34 insertions(+), 12 deletions(-)

-- 
1.8.3.1



[PATCH v2 1/3] vdpa/mlx5: should exclude header length and fcs from mtu

2021-02-10 Thread Si-Wei Liu
When feature VIRTIO_NET_F_MTU is negotiated on mlx5_vdpa,
22 extra bytes worth of MTU length is shown in guest.
This is because the mlx5_query_port_max_mtu API returns
the "hardware" MTU value, which does not just contain the
 Ethernet payload, but includes extra lengths starting
from the Ethernet header up to the FCS altogether.

Fix the MTU so packets won't get dropped silently.

Fixes: 1a86b377aa21 ("vdpa/mlx5: Add VDPA driver for supported mlx5 devices")
Signed-off-by: Si-Wei Liu 
Acked-by: Jason Wang 
Acked-by: Eli Cohen 
---
 drivers/vdpa/mlx5/core/mlx5_vdpa.h |  4 
 drivers/vdpa/mlx5/net/mlx5_vnet.c  | 15 ++-
 2 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/drivers/vdpa/mlx5/core/mlx5_vdpa.h 
b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
index 08f742f..b6cc53b 100644
--- a/drivers/vdpa/mlx5/core/mlx5_vdpa.h
+++ b/drivers/vdpa/mlx5/core/mlx5_vdpa.h
@@ -4,9 +4,13 @@
 #ifndef __MLX5_VDPA_H__
 #define __MLX5_VDPA_H__
 
+#include 
+#include 
 #include 
 #include 
 
+#define MLX5V_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
+
 struct mlx5_vdpa_direct_mr {
u64 start;
u64 end;
diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index dc88559..b8416c4 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1907,6 +1907,19 @@ static int mlx5_get_vq_irq(struct vdpa_device *vdv, u16 
idx)
.free = mlx5_vdpa_free,
 };
 
+static int query_mtu(struct mlx5_core_dev *mdev, u16 *mtu)
+{
+   u16 hw_mtu;
+   int err;
+
+   err = mlx5_query_nic_vport_mtu(mdev, &hw_mtu);
+   if (err)
+   return err;
+
+   *mtu = hw_mtu - MLX5V_ETH_HARD_MTU;
+   return 0;
+}
+
 static int alloc_resources(struct mlx5_vdpa_net *ndev)
 {
struct mlx5_vdpa_net_resources *res = &ndev->res;
@@ -1992,7 +2005,7 @@ static int mlx5v_probe(struct auxiliary_device *adev,
init_mvqs(ndev);
mutex_init(&ndev->reslock);
config = &ndev->config;
-   err = mlx5_query_nic_vport_mtu(mdev, &ndev->mtu);
+   err = query_mtu(mdev, &ndev->mtu);
if (err)
goto err_mtu;
 
-- 
1.8.3.1



Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-21 Thread si-wei liu




On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

there's no motivation in the systemd/udevd community at
this point to refactor the rename logic and make it work well with
3-netdev.

What would the fix be? Skip slave devices?

There's nothing user can get if just skipping slave devices - the name 
is still unchanged and unpredictable e.g. eth0, or eth1 the next reboot, 
while the rest may conform to the naming scheme (ens3 and such). There's 
no way one can fix this in userspace alone - when the failover is 
created the enslaved netdev was opened by the kernel earlier than the 
userspace is made aware of, and there's no negotiation protocol for 
kernel to know when userspace has done initial renaming of the 
interface. I would expect netdev list should at least provide the 
direction in general for how this can be solved...


-Siwei




Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-21 Thread si-wei liu




On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:



On 2/21/2019 7:33 PM, si-wei liu wrote:



On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

there's no motivation in the systemd/udevd community at
this point to refactor the rename logic and make it work well with
3-netdev.

What would the fix be? Skip slave devices?

There's nothing user can get if just skipping slave devices - the 
name is still unchanged and unpredictable e.g. eth0, or eth1 the next 
reboot, while the rest may conform to the naming scheme (ens3 and 
such). There's no way one can fix this in userspace alone - when the 
failover is created the enslaved netdev was opened by the kernel 
earlier than the userspace is made aware of, and there's no 
negotiation protocol for kernel to know when userspace has done 
initial renaming of the interface. I would expect netdev list should 
at least provide the direction in general for how this can be solved...



Is there an issue if slave device names are not predictable? The user/admin 
scripts are expected
to only work with the master failover device.

Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with 
predictable interface name. Third-party app which was built upon 
specifying certain interface name can't be modified to chase dynamic names.


Specifically, we have pre-canned image that uses ethtool to fine tune VF 
offload settings post boot for specific workload. Those images won't 
work well if the name is constantly changing just after couple rounds of 
live migration.



Moreover, you were suggesting hiding the lower slave devices anyway. There was 
some discussion
about moving them to a hidden network namespace so that they are not visible 
from the default namespace.
I looked into this sometime back, but did not find the right kernel api to 
create a network namespace within
kernel. If so, we could use this mechanism to simulate a 1-netdev model.
Yes, that's one possible implementation (IMHO the key is to make 
1-netdev model as much transparent to a real NIC as possible, while a 
hidden netns is just the vehicle). However, I recall there was 
resistance around this discussion that even the concept of hiding itself 
is a taboo for Linux netdev. I would like to summon potential 
alternatives before concluding 1-netdev is the only solution too soon.


Thanks,
-Siwei




-Siwei






Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-25 Thread si-wei liu



On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:


On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:


On 2/21/2019 7:33 PM, si-wei liu wrote:


On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

 there's no motivation in the systemd/udevd community at
 this point to refactor the rename logic and make it work well with
 3-netdev.

What would the fix be? Skip slave devices?


There's nothing user can get if just skipping slave devices - the
name is still unchanged and unpredictable e.g. eth0, or eth1 the
next reboot, while the rest may conform to the naming scheme (ens3
and such). There's no way one can fix this in userspace alone - when
the failover is created the enslaved netdev was opened by the kernel
earlier than the userspace is made aware of, and there's no
negotiation protocol for kernel to know when userspace has done
initial renaming of the interface. I would expect netdev list should
at least provide the direction in general for how this can be
solved...


I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?
No. I never believed this particular issue can be fixed in userspace 
alone. Previously someone had said it could be, but I never see any work 
or relevant discussion ever happened in various userspace communities 
(for e.g. dracut, initramfs-tools, systemd, udev, and NetworkManager). 
IMHO the root of the issue derives from the kernel, it makes more sense 
to start from netdev, work out and decide on a solution: see what can be 
done in the kernel in order to fix it, then after that engage userspace 
community for the feasibility...



Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.



See attached diagram.







Is there an issue if slave device names are not predictable? The user/admin 
scripts are expected
to only work with the master failover device.

Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with
predictable interface name. Third-party app which was built upon specifying
certain interface name can't be modified to chase dynamic names.

Specifically, we have pre-canned image that uses ethtool to fine tune VF
offload settings post boot for specific workload. Those images won't work
well if the name is constantly changing just after couple rounds of live
migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.
I was elaborating a few examples that the expectation and assumption 
that user/admin scripts only deal with master failover device is 
incorrect. It had never been taken good care of, although I did try to 
emphasize it from the very beginning.


Basically what you said about propagating the ethtool configuration down 
to the slave is the key pursuance of 1-netdev model. However, what I am 
seeking now is any alternative that can also fix the specific udev 
rename problem, before concluding that 1-netdev is the only solution. 
Generally a 1-netdev scheme would take time to implement, while I'm 
trying to find a way out to fix this particular naming problem under 
3-netdev.





Moreover, you were suggesting hiding the lower slave devices anyway. There was 
some discussion
about moving them to a hidden network namespace so that they are not visible 
from the default namespace.
I looked into this sometime back, but did not find the right kernel api to 
create a

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-26 Thread si-wei liu




On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:

On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:


On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:

On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:

On 2/21/2019 7:33 PM, si-wei liu wrote:

On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

  there's no motivation in the systemd/udevd community at
  this point to refactor the rename logic and make it work well with
  3-netdev.

What would the fix be? Skip slave devices?


There's nothing user can get if just skipping slave devices - the
name is still unchanged and unpredictable e.g. eth0, or eth1 the
next reboot, while the rest may conform to the naming scheme (ens3
and such). There's no way one can fix this in userspace alone - when
the failover is created the enslaved netdev was opened by the kernel
earlier than the userspace is made aware of, and there's no
negotiation protocol for kernel to know when userspace has done
initial renaming of the interface. I would expect netdev list should
at least provide the direction in general for how this can be
solved...

I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?

No. I never believed this particular issue can be fixed in userspace alone.
Previously someone had said it could be, but I never see any work or
relevant discussion ever happened in various userspace communities (for e.g.
dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
of the issue derives from the kernel, it makes more sense to start from
netdev, work out and decide on a solution: see what can be done in the
kernel in order to fix it, then after that engage userspace community for
the feasibility...


Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.


See attached diagram.






Is there an issue if slave device names are not predictable? The user/admin 
scripts are expected
to only work with the master failover device.

Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with
predictable interface name. Third-party app which was built upon specifying
certain interface name can't be modified to chase dynamic names.

Specifically, we have pre-canned image that uses ethtool to fine tune VF
offload settings post boot for specific workload. Those images won't work
well if the name is constantly changing just after couple rounds of live
migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.

I was elaborating a few examples that the expectation and assumption that
user/admin scripts only deal with master failover device is incorrect. It
had never been taken good care of, although I did try to emphasize it from
the very beginning.

Basically what you said about propagating the ethtool configuration down to
the slave is the key pursuance of 1-netdev model. However, what I am seeking
now is any alternative that can also fix the specific udev rename problem,
before concluding that 1-netdev is the only solution. Generally a 1-netdev
scheme would take time to implement, while I'm trying to find a way out to
fix this particular naming problem under 3-netdev.


Moreover, you were suggesting hiding the lower slave devices anyway. There was 
some discussion
about moving them to a hidden network namespace so that they are not visible 
from the default names

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-26 Thread si-wei liu




On 2/25/2019 6:05 PM, Michael S. Tsirkin wrote:

On Mon, Feb 25, 2019 at 05:39:12PM -0800, Stephen Hemminger wrote:

Moreover, you were suggesting hiding the lower slave devices anyway. There was 
some discussion
about moving them to a hidden network namespace so that they are not visible 
from the default namespace.
I looked into this sometime back, but did not find the right kernel api to 
create a network namespace within
kernel. If so, we could use this mechanism to simulate a 1-netdev model.

Yes, that's one possible implementation (IMHO the key is to make 1-netdev
model as much transparent to a real NIC as possible, while a hidden netns is
just the vehicle). However, I recall there was resistance around this
discussion that even the concept of hiding itself is a taboo for Linux
netdev. I would like to summon potential alternatives before concluding
1-netdev is the only solution too soon.

Thanks,
-Siwei

Your scripts would not work at all then, right?

At this point we don't claim images with such usage as SR-IOV live
migrate-able. We would flag it as live migrate-able until this ethtool
config issue is fully addressed and a transparent live migration
solution emerges in upstream eventually.

The hyper-v netvsc with 1-dev model uses a timeout to allow  udev to do its 
rename.
I proposed a patch to key state change off of the udev rename, but that patch 
was
rejected.

Of course that would mean nothing works without udev - was
that the objection? Could you help me find that discussion pls?
Yeah, the kernel should work with and without udev rename - typically 
the kernel is agnostic of upcoming rename. User may opt out for kernel 
provided names (particularly on older distros) then no rename would ever 
happen.


I ever thought about this approach but didn't think it would fit. But, 
what is the historical reason that prevents slave from being renamed 
after being opened? Could we specialize a code path for this kernel 
created device, as net_failover shouldn't carry over any history burden?



Thanks,
-Siwei




Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-27 Thread si-wei liu




On 2/27/2019 1:57 PM, Stephen Hemminger wrote:

On Tue, 26 Feb 2019 16:17:21 -0800
si-wei liu  wrote:


On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:

On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:

On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:

On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:

On 2/21/2019 7:33 PM, si-wei liu wrote:

On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

   there's no motivation in the systemd/udevd community at
   this point to refactor the rename logic and make it work well with
   3-netdev.

What would the fix be? Skip slave devices?
  

There's nothing user can get if just skipping slave devices - the
name is still unchanged and unpredictable e.g. eth0, or eth1 the
next reboot, while the rest may conform to the naming scheme (ens3
and such). There's no way one can fix this in userspace alone - when
the failover is created the enslaved netdev was opened by the kernel
earlier than the userspace is made aware of, and there's no
negotiation protocol for kernel to know when userspace has done
initial renaming of the interface. I would expect netdev list should
at least provide the direction in general for how this can be
solved...

I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?

No. I never believed this particular issue can be fixed in userspace alone.
Previously someone had said it could be, but I never see any work or
relevant discussion ever happened in various userspace communities (for e.g.
dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
of the issue derives from the kernel, it makes more sense to start from
netdev, work out and decide on a solution: see what can be done in the
kernel in order to fix it, then after that engage userspace community for
the feasibility...
  

Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.
  

See attached diagram.
  


  

Is there an issue if slave device names are not predictable? The user/admin 
scripts are expected
to only work with the master failover device.

Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with
predictable interface name. Third-party app which was built upon specifying
certain interface name can't be modified to chase dynamic names.

Specifically, we have pre-canned image that uses ethtool to fine tune VF
offload settings post boot for specific workload. Those images won't work
well if the name is constantly changing just after couple rounds of live
migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.

I was elaborating a few examples that the expectation and assumption that
user/admin scripts only deal with master failover device is incorrect. It
had never been taken good care of, although I did try to emphasize it from
the very beginning.

Basically what you said about propagating the ethtool configuration down to
the slave is the key pursuance of 1-netdev model. However, what I am seeking
now is any alternative that can also fix the specific udev rename problem,
before concluding that 1-netdev is the only solution. Generally a 1-netdev
scheme would take time to implement, while I'm trying to find a way out to
fix this particular naming problem under 3-netdev.
  

Moreover, you were suggesting hiding the lower slave devices anyway. There was 

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-27 Thread si-wei liu




On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:

On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:


On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:

On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:

On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:

On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:

On 2/21/2019 7:33 PM, si-wei liu wrote:

On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

   there's no motivation in the systemd/udevd community at
   this point to refactor the rename logic and make it work well with
   3-netdev.

What would the fix be? Skip slave devices?


There's nothing user can get if just skipping slave devices - the
name is still unchanged and unpredictable e.g. eth0, or eth1 the
next reboot, while the rest may conform to the naming scheme (ens3
and such). There's no way one can fix this in userspace alone - when
the failover is created the enslaved netdev was opened by the kernel
earlier than the userspace is made aware of, and there's no
negotiation protocol for kernel to know when userspace has done
initial renaming of the interface. I would expect netdev list should
at least provide the direction in general for how this can be
solved...

I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?

No. I never believed this particular issue can be fixed in userspace alone.
Previously someone had said it could be, but I never see any work or
relevant discussion ever happened in various userspace communities (for e.g.
dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
of the issue derives from the kernel, it makes more sense to start from
netdev, work out and decide on a solution: see what can be done in the
kernel in order to fix it, then after that engage userspace community for
the feasibility...


Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.


See attached diagram.





Is there an issue if slave device names are not predictable? The user/admin 
scripts are expected
to only work with the master failover device.

Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with
predictable interface name. Third-party app which was built upon specifying
certain interface name can't be modified to chase dynamic names.

Specifically, we have pre-canned image that uses ethtool to fine tune VF
offload settings post boot for specific workload. Those images won't work
well if the name is constantly changing just after couple rounds of live
migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.

I was elaborating a few examples that the expectation and assumption that
user/admin scripts only deal with master failover device is incorrect. It
had never been taken good care of, although I did try to emphasize it from
the very beginning.

Basically what you said about propagating the ethtool configuration down to
the slave is the key pursuance of 1-netdev model. However, what I am seeking
now is any alternative that can also fix the specific udev rename problem,
before concluding that 1-netdev is the only solution. Generally a 1-netdev
scheme would take time to implement, while I'm trying to find a way out to
fix this particular naming problem under 3-netdev.


Moreover, you were suggesting hiding the lower slave devices anyway. There was 
some

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-27 Thread si-wei liu




On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:

On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:


On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:

On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:

On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:

On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:

On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:

On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:

On 2/21/2019 7:33 PM, si-wei liu wrote:

On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

there's no motivation in the systemd/udevd community at
this point to refactor the rename logic and make it work well with
3-netdev.

What would the fix be? Skip slave devices?


There's nothing user can get if just skipping slave devices - the
name is still unchanged and unpredictable e.g. eth0, or eth1 the
next reboot, while the rest may conform to the naming scheme (ens3
and such). There's no way one can fix this in userspace alone - when
the failover is created the enslaved netdev was opened by the kernel
earlier than the userspace is made aware of, and there's no
negotiation protocol for kernel to know when userspace has done
initial renaming of the interface. I would expect netdev list should
at least provide the direction in general for how this can be
solved...

I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?

No. I never believed this particular issue can be fixed in userspace alone.
Previously someone had said it could be, but I never see any work or
relevant discussion ever happened in various userspace communities (for e.g.
dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
of the issue derives from the kernel, it makes more sense to start from
netdev, work out and decide on a solution: see what can be done in the
kernel in order to fix it, then after that engage userspace community for
the feasibility...


Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.


See attached diagram.




Is there an issue if slave device names are not predictable? The user/admin 
scripts are expected
to only work with the master failover device.

Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with
predictable interface name. Third-party app which was built upon specifying
certain interface name can't be modified to chase dynamic names.

Specifically, we have pre-canned image that uses ethtool to fine tune VF
offload settings post boot for specific workload. Those images won't work
well if the name is constantly changing just after couple rounds of live
migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.

I was elaborating a few examples that the expectation and assumption that
user/admin scripts only deal with master failover device is incorrect. It
had never been taken good care of, although I did try to emphasize it from
the very beginning.

Basically what you said about propagating the ethtool configuration down to
the slave is the key pursuance of 1-netdev model. However, what I am seeking
now is any alternative that can also fix the specific udev rename problem,
before concluding that 1-netdev is the only solution. Generally a 1-netdev
scheme would take time to implement, while I'm trying to find a way out to
fix this particular namin

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-28 Thread si-wei liu




On 2/27/2019 4:41 PM, Michael S. Tsirkin wrote:

On Wed, Feb 27, 2019 at 04:38:00PM -0800, si-wei liu wrote:


On 2/27/2019 3:50 PM, Michael S. Tsirkin wrote:

On Wed, Feb 27, 2019 at 03:34:56PM -0800, si-wei liu wrote:

On 2/27/2019 2:38 PM, Michael S. Tsirkin wrote:

On Tue, Feb 26, 2019 at 04:17:21PM -0800, si-wei liu wrote:

On 2/25/2019 6:08 PM, Michael S. Tsirkin wrote:

On Mon, Feb 25, 2019 at 04:58:07PM -0800, si-wei liu wrote:

On 2/22/2019 7:14 AM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 11:55:11PM -0800, si-wei liu wrote:

On 2/21/2019 11:00 PM, Samudrala, Sridhar wrote:

On 2/21/2019 7:33 PM, si-wei liu wrote:

On 2/21/2019 5:39 PM, Michael S. Tsirkin wrote:

On Thu, Feb 21, 2019 at 05:14:44PM -0800, Siwei Liu wrote:

Sorry for replying to this ancient thread. There was some remaining
issue that I don't think the initial net_failover patch got addressed
cleanly, see:

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1815268

The renaming of 'eth0' to 'ens4' fails because the udev userspace was
not specifically writtten for such kernel automatic enslavement.
Specifically, if it is a bond or team, the slave would typically get
renamed *before* virtual device gets created, that's what udev can
control (without getting netdev opened early by the other part of
kernel) and other userspace components for e.g. initramfs,
init-scripts can coordinate well in between. The in-kernel
auto-enslavement of net_failover breaks this userspace convention,
which don't provides a solution if user care about consistent naming
on the slave netdevs specifically.

Previously this issue had been specifically called out when IFF_HIDDEN
and the 1-netdev was proposed, but no one gives out a solution to this
problem ever since. Please share your mind how to proceed and solve
this userspace issue if netdev does not welcome a 1-netdev model.

Above says:

 there's no motivation in the systemd/udevd community at
 this point to refactor the rename logic and make it work well with
 3-netdev.

What would the fix be? Skip slave devices?


There's nothing user can get if just skipping slave devices - the
name is still unchanged and unpredictable e.g. eth0, or eth1 the
next reboot, while the rest may conform to the naming scheme (ens3
and such). There's no way one can fix this in userspace alone - when
the failover is created the enslaved netdev was opened by the kernel
earlier than the userspace is made aware of, and there's no
negotiation protocol for kernel to know when userspace has done
initial renaming of the interface. I would expect netdev list should
at least provide the direction in general for how this can be
solved...

I was just wondering what did you mean when you said
"refactor the rename logic and make it work well with 3-netdev" -
was there a proposal udev rejected?

No. I never believed this particular issue can be fixed in userspace alone.
Previously someone had said it could be, but I never see any work or
relevant discussion ever happened in various userspace communities (for e.g.
dracut, initramfs-tools, systemd, udev, and NetworkManager). IMHO the root
of the issue derives from the kernel, it makes more sense to start from
netdev, work out and decide on a solution: see what can be done in the
kernel in order to fix it, then after that engage userspace community for
the feasibility...


Anyway, can we write a time diagram for what happens in which order that
leads to failure?  That would help look for triggers that we can tie
into, or add new ones.


See attached diagram.


Is there an issue if slave device names are not predictable? The user/admin 
scripts are expected
to only work with the master failover device.

Where does this expectation come from?

Admin users may have ethtool or tc configurations that need to deal with
predictable interface name. Third-party app which was built upon specifying
certain interface name can't be modified to chase dynamic names.

Specifically, we have pre-canned image that uses ethtool to fine tune VF
offload settings post boot for specific workload. Those images won't work
well if the name is constantly changing just after couple rounds of live
migration.

It should be possible to specify the ethtool configuration on the
master and have it automatically propagated to the slave.

BTW this is something we should look at IMHO.

I was elaborating a few examples that the expectation and assumption that
user/admin scripts only deal with master failover device is incorrect. It
had never been taken good care of, although I did try to emphasize it from
the very beginning.

Basically what you said about propagating the ethtool configuration down to
the slave is the key pursuance of 1-netdev model. However, what I am seeking
now is any alternative that can also fix the specific udev rename problem,
before concluding that 1-netdev is the only solution. Generally a 1

Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-02-28 Thread si-wei liu




On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:

On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:

Will the
change break userspace further?

-Siwei

Didn't you show userspace is already broken. You can't "further
break it", rename already fails.

It's a race, userspace tends to give slave a user(space) desired name but
sometimes may fail due to this race. Today if failover master is not up,
rename would succeed anyway. While what you proposed prohibits user from
providing a name in all circumstances if I understand you correctly. That's
what I meant of breaking userspace further. On the other hand, you seem to
tighten the kernel default naming to udev predictable names, which is
derived from only recent systemd-udevd, while there exists many possible
userspace naming schemes out of that. Users today who deliberately chooses
to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
kernel provided names would expect the ethX pattern, with this change
admin/user scripts which matches the ethX pattern could potentially break.

Whatever crashes with a name not matching ethX will crash on the
standby interface *anyway*.
With udev predictable naming disabled they should not. It's not hard for 
user to look for device attribute to persistent the name well, in a 
consistent and reliable way.




So I think what you are saying is that someone might have already
written scripts and gotten them to work on v4.17 when STANDBY was
included and these scripts rely on ethX. Now these scripts
will break.
The controversial part is the new kernel naming pattern. Initially I 
thought there shouldn't be such crazy scripts relying on the pattern, 
but when I worked on cloud-init it I realized that there's already a lot 
of software taking assumption around the 'eth0' name. In the past I've 
seen random scripts that parses the ethX name assumes (incorrectly) the 
name ends up with digits, or even the digits and name are 1:1 mapped. Of 
course, you can say these are bugs in scripts themselves.


Anyway, I'll let others in the netdev to comment on this new scheme, 
maybe that's the concern of merely myself. The good part of your 
proposal is that we can get consistent slave name, which still plays its 
role until we move towards making slave names less relevant, i.e. 
ideally a 1-netdev model. I think we both agree that the master matters 
more than the slave names.




Maybe it is still early enough (just half a year passed) that the
number of these users would be small.  So how about a kernel config
option and maybe a module parameter to rename the primary?  People can
then opt in to the old broken behaviour.
Were I could I would ask  why a similar opt-in (kernel config or module 
parameter) couldn't be implemented to open up the rename restriction on 
slave, net_failover in particular. What I felt about this rename 
restriction was more because of historical reason than anything else, 
while net_failover is comparatively a new type of link that we are now 
designing proper use case it should support, and can get it shaped to 
whatever it fits. My personal view is that the slave can't be renamed 
when master is running is just implementation details that got 
incorrectly exposed to userspace apps for many years. It's old behavior 
with historical reason for sure, but I don't think this applies to 
net_failover.


(FWIW as one previous bond maintainer for another OS, we relieved the 
rename restriction slaves 13 year ago, while no single complaint or 
issue was ever raised because of this change over the years, neither 
from the customers of tens of millions of installation base, nor the 
FOSS software running atop. Of course, Linux is different so that 
experience doesn't count.)


Thanks,
-Siwei




Re: [virtio-dev] Re: net_failover slave udev renaming (was Re: [RFC PATCH net-next v6 4/4] netvsc: refactor notifier/event handling code to use the bypass framework)

2019-03-01 Thread si-wei liu




On 3/1/2019 5:27 AM, Michael S. Tsirkin wrote:

On Thu, Feb 28, 2019 at 05:30:56PM -0800, si-wei liu wrote:


On 2/28/2019 6:26 AM, Michael S. Tsirkin wrote:

On Thu, Feb 28, 2019 at 01:32:12AM -0800, si-wei liu wrote:

Will the
change break userspace further?

-Siwei

Didn't you show userspace is already broken. You can't "further
break it", rename already fails.

It's a race, userspace tends to give slave a user(space) desired name but
sometimes may fail due to this race. Today if failover master is not up,
rename would succeed anyway. While what you proposed prohibits user from
providing a name in all circumstances if I understand you correctly. That's
what I meant of breaking userspace further. On the other hand, you seem to
tighten the kernel default naming to udev predictable names, which is
derived from only recent systemd-udevd, while there exists many possible
userspace naming schemes out of that. Users today who deliberately chooses
to disable predictable naming (net.ifnames=0 biosdevname=0) and fall back to
kernel provided names would expect the ethX pattern, with this change
admin/user scripts which matches the ethX pattern could potentially break.

Whatever crashes with a name not matching ethX will crash on the
standby interface *anyway*.

With udev predictable naming disabled they should not. It's not hard for
user to look for device attribute to persistent the name well, in a
consistent and reliable way.

Well that's special code for failover already. So far we just
taught userspace to skip renaming slave interfaces.
I think today kernel provided names never collapse, e.g. master gets 
eth0 then standby will get eth1. It's the userspace specified name that 
suffers name clashing, mostly the default predictable naming pattern 
from systemd-udevd.


Kernel should not assume there's only one naming pattern in userspace. 
Users can customize naming with udev rules in /etc which do not conform 
to the default udevd pattern at all. It's pretty legitimate use case.






So I think what you are saying is that someone might have already
written scripts and gotten them to work on v4.17 when STANDBY was
included and these scripts rely on ethX. Now these scripts
will break.

The controversial part is the new kernel naming pattern. Initially I thought
there shouldn't be such crazy scripts relying on the pattern, but when I
worked on cloud-init it I realized that there's already a lot of software
taking assumption around the 'eth0' name. In the past I've seen random
scripts that parses the ethX name assumes (incorrectly) the name ends up
with digits, or even the digits and name are 1:1 mapped. Of course, you can
say these are bugs in scripts themselves.

No what I say is that they will crash on rename of standby too.
What do you mean crashing on standby rename? First off, if master is not 
up, rename on both standby and primary should not fail. If master is up, 
the standby should be named before userspace brings up the master, so 
what's the issue you talked about?


Thanks,
-Siwei




Anyway, I'll let others in the netdev to comment on this new scheme, maybe
that's the concern of merely myself. The good part of your proposal is that
we can get consistent slave name, which still plays its role until we move
towards making slave names less relevant, i.e. ideally a 1-netdev model. I
think we both agree that the master matters more than the slave names.

Maybe it is still early enough (just half a year passed) that the
number of these users would be small.  So how about a kernel config
option and maybe a module parameter to rename the primary?  People can
then opt in to the old broken behaviour.

Were I could I would ask  why a similar opt-in (kernel config or module
parameter) couldn't be implemented to open up the rename restriction on
slave, net_failover in particular. What I felt about this rename restriction
was more because of historical reason than anything else, while net_failover
is comparatively a new type of link that we are now designing proper use
case it should support, and can get it shaped to whatever it fits. My
personal view is that the slave can't be renamed when master is running is
just implementation details that got incorrectly exposed to userspace apps
for many years. It's old behavior with historical reason for sure, but I
don't think this applies to net_failover.

(FWIW as one previous bond maintainer for another OS, we relieved the rename
restriction slaves 13 year ago, while no single complaint or issue was ever
raised because of this change over the years, neither from the customers of
tens of millions of installation base, nor the FOSS software running atop.
Of course, Linux is different so that experience doesn't count.)

Thanks,
-Siwei





[RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-04 Thread Si-Wei Liu
When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

For that to work, now introduce a module-level tunable,
"slave_rename_ok" that allows users to lift up the rename restriction on
failover slave which is already UP. Although it's possible this change
potentially break userspace component (most likely configuration scripts
or management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of the failover slave. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run. The default
for the "slave_rename_ok" is set to true(1). If userspace doesn't have
the right support in place meanwhile users don't care about reliable
userspace naming, the value can be set to false(0).

Signed-off-by: si-wei@oracle.com
Reviewed-by: Liran Alon 
---
 include/linux/netdevice.h |  3 +++
 net/core/dev.c|  3 ++-
 net/core/failover.c   | 11 +--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8ab..6d9e4e0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1487,6 +1487,7 @@ struct net_device_ops {
  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
  * @IFF_FAILOVER: device is a failover master device
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
IFF_NO_RX_HANDLER   = 1<<26,
IFF_FAILOVER= 1<<27,
IFF_FAILOVER_SLAVE  = 1<<28,
+   IFF_SLAVE_RENAME_OK = 1<<29,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
 #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
+#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK
 
 /**
  * struct net_device - The DEVICE structure.
diff --git a/net/core/dev.c b/net/core/dev.c
index 722d50d..ae070de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
BUG_ON(!dev_net(dev));
 
net = dev_net(dev);
-   if (dev->flags & IFF_UP)
+   if (dev->flags & IFF_UP &&
+   !(dev->priv_flags & IFF_SLAVE_RENAME_OK))
return -EBUSY;
 
write_seqcount_begin(&devnet_rename_seq);
diff --git a/net/core/failover.c b/net/core/failover.c
index 4a92a98..1fd8bbb 100644
--- a/net/core/failover.c
+++ b/net/core/failover.c
@@ -16,6 +16,11 @@
 
 static LIST_HEAD(failover_list);
 static DEFINE_SPINLOCK(failover_lo

[RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-04 Thread Si-Wei Liu
When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

For that to work, now introduce a module-level tunable,
"slave_rename_ok" that allows users to lift up the rename restriction on
failover slave which is already UP. Although it's possible this change
potentially break userspace component (most likely configuration scripts
or management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of the failover slave. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run. The default
for the "slave_rename_ok" is set to true(1). If userspace doesn't have
the right support in place meanwhile users don't care about reliable
userspace naming, the value can be set to false(0).

Signed-off-by: si-wei@oracle.com
Reviewed-by: Liran Alon 
---
 include/linux/netdevice.h |  3 +++
 net/core/dev.c|  3 ++-
 net/core/failover.c   | 11 +--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8ab..6d9e4e0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1487,6 +1487,7 @@ struct net_device_ops {
  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
  * @IFF_FAILOVER: device is a failover master device
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
IFF_NO_RX_HANDLER   = 1<<26,
IFF_FAILOVER= 1<<27,
IFF_FAILOVER_SLAVE  = 1<<28,
+   IFF_SLAVE_RENAME_OK = 1<<29,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
 #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
+#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK
 
 /**
  * struct net_device - The DEVICE structure.
diff --git a/net/core/dev.c b/net/core/dev.c
index 722d50d..ae070de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
BUG_ON(!dev_net(dev));
 
net = dev_net(dev);
-   if (dev->flags & IFF_UP)
+   if (dev->flags & IFF_UP &&
+   !(dev->priv_flags & IFF_SLAVE_RENAME_OK))
return -EBUSY;
 
write_seqcount_begin(&devnet_rename_seq);
diff --git a/net/core/failover.c b/net/core/failover.c
index 4a92a98..1fd8bbb 100644
--- a/net/core/failover.c
+++ b/net/core/failover.c
@@ -16,6 +16,11 @@
 
 static LIST_HEAD(failover_list);
 static DEFINE_SPINLOCK(failover_lo

[RFC PATCH net] failover: allow name change on IFF_UP slave interfaces

2019-03-04 Thread Si-Wei Liu
When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

For that to work, now introduce a module-level tunable,
"slave_rename_ok" that allows users to lift up the rename restriction on
failover slave which is already UP. Although it's possible this change
potentially break userspace component (most likely configuration scripts
or management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of the failover slave. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run. The default
for the "slave_rename_ok" is set to true(1). If userspace doesn't have
the right support in place meanwhile users don't care about reliable
userspace naming, the value can be set to false(0).

Signed-off-by: si-wei@oracle.com
Reviewed-by: Liran Alon 
---
 include/linux/netdevice.h |  3 +++
 net/core/dev.c|  3 ++-
 net/core/failover.c   | 11 +--
 3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8ab..6d9e4e0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1487,6 +1487,7 @@ struct net_device_ops {
  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
  * @IFF_FAILOVER: device is a failover master device
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
IFF_NO_RX_HANDLER   = 1<<26,
IFF_FAILOVER= 1<<27,
IFF_FAILOVER_SLAVE  = 1<<28,
+   IFF_SLAVE_RENAME_OK = 1<<29,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
 #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
+#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK
 
 /**
  * struct net_device - The DEVICE structure.
diff --git a/net/core/dev.c b/net/core/dev.c
index 722d50d..ae070de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
BUG_ON(!dev_net(dev));
 
net = dev_net(dev);
-   if (dev->flags & IFF_UP)
+   if (dev->flags & IFF_UP &&
+   !(dev->priv_flags & IFF_SLAVE_RENAME_OK))
return -EBUSY;
 
write_seqcount_begin(&devnet_rename_seq);
diff --git a/net/core/failover.c b/net/core/failover.c
index 4a92a98..1fd8bbb 100644
--- a/net/core/failover.c
+++ b/net/core/failover.c
@@ -16,6 +16,11 @@
 
 static LIST_HEAD(failover_list);
 static DEFINE_SPINLOCK(failover_lo

Re: [RFC PATCH net] failover: allow name change on IFF_UP slave interfaces

2019-03-04 Thread si-wei liu
Please disregard patch emails previously sent with similar subject. The 
patch target is set to net rather than net-next. Any discussion or 
comment around this patch should go after this email.


-Siwei


On 3/4/2019 4:53 PM, Si-Wei Liu wrote:

When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

For that to work, now introduce a module-level tunable,
"slave_rename_ok" that allows users to lift up the rename restriction on
failover slave which is already UP. Although it's possible this change
potentially break userspace component (most likely configuration scripts
or management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of the failover slave. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run. The default
for the "slave_rename_ok" is set to true(1). If userspace doesn't have
the right support in place meanwhile users don't care about reliable
userspace naming, the value can be set to false(0).

Signed-off-by: si-wei@oracle.com
Reviewed-by: Liran Alon 
---
  include/linux/netdevice.h |  3 +++
  net/core/dev.c|  3 ++-
  net/core/failover.c   | 11 +--
  3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8ab..6d9e4e0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1487,6 +1487,7 @@ struct net_device_ops {
   * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
   * @IFF_FAILOVER: device is a failover master device
   * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running
   */
  enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
IFF_NO_RX_HANDLER   = 1<<26,
IFF_FAILOVER= 1<<27,
IFF_FAILOVER_SLAVE  = 1<<28,
+   IFF_SLAVE_RENAME_OK = 1<<29,
  };
  
  #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN

@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
  #define IFF_NO_RX_HANDLER IFF_NO_RX_HANDLER
  #define IFF_FAILOVER  IFF_FAILOVER
  #define IFF_FAILOVER_SLAVEIFF_FAILOVER_SLAVE
+#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK
  
  /**

   *struct net_device - The DEVICE structure.
diff --git a/net/core/dev.c b/net/core/dev.c
index 722d50d..ae070de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
BUG_ON(!dev_net(dev));
  
  	net = dev_net(dev);

-   if (dev->flags & IFF_UP)
+   

Re: [RFC PATCH net] failover: allow name change on IFF_UP slave interfaces

2019-03-04 Thread si-wei liu
Sorry for multiple sends. The patch is exactly same. I added a few 
people who were missing int the cc lines in the first attemt. And 
corrected the subject line in the second attempt.


-Siwei


On 3/4/2019 6:04 PM, David Miller wrote:

Why did you send this three times?

What's different in each of these copies?





Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-05 Thread si-wei liu




On 3/4/2019 6:33 PM, Michael S. Tsirkin wrote:

On Mon, Mar 04, 2019 at 07:50:59PM -0500, Si-Wei Liu wrote:

When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

For that to work, now introduce a module-level tunable,
"slave_rename_ok" that allows users to lift up the rename restriction on
failover slave which is already UP. Although it's possible this change
potentially break userspace component (most likely configuration scripts
or management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of the failover slave. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run. The default
for the "slave_rename_ok" is set to true(1). If userspace doesn't have
the right support in place meanwhile users don't care about reliable
userspace naming, the value can be set to false(0).

Signed-off-by: si-wei@oracle.com
Reviewed-by: Liran Alon 

Not sure which of the versions I should reply to.

Sorry for multiple copies sent. It's fine to reply to this one.



I have a vague idea: would it work to *not* set
IFF_UP on slave devices at all?
Hmm, I ever thought about this option, and it appears this solution is 
more invasive than required to convert existing scripts, despite the 
controversy of introducing internal netdev state to differentiate user 
visible state. Either we disallow slave to be brought up by user, or to 
not set IFF_UP flag but instead use the internal one, could end up with 
substantial behavioral change that breaks scripts. Consider any admin 
script that does `ip link set dev ... up' successfully just assumes the 
link is up and subsequent operation can be done as usual. While it *may* 
work for dracut (yet to be verified), I'm a bit concerned that there are 
more scripts to be converted than those that don't follow volatile 
failover slave names. It's technically doable, but may not worth the 
effort (in terms of porting existing scripts/apps).


Thanks
-Siwei



Would this reduce the chances of existing scripts such as dracut being
confused?

And this leaves open the option for scripts to address
slaves by checking some custom attribute.


---
  include/linux/netdevice.h |  3 +++
  net/core/dev.c|  3 ++-
  net/core/failover.c   | 11 +--
  3 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8ab..6d9e4e0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1487,6 +1487,7 @@ struct net_device_ops {
   * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
   * @IFF_FAILOVER: device is a failover master device
   * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device

Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-05 Thread si-wei liu




On 3/5/2019 11:24 AM, Stephen Hemminger wrote:

On Tue, 5 Mar 2019 11:19:32 -0800
si-wei liu  wrote:


I have a vague idea: would it work to *not* set
IFF_UP on slave devices at all?

Hmm, I ever thought about this option, and it appears this solution is
more invasive than required to convert existing scripts, despite the
controversy of introducing internal netdev state to differentiate user
visible state. Either we disallow slave to be brought up by user, or to
not set IFF_UP flag but instead use the internal one, could end up with
substantial behavioral change that breaks scripts. Consider any admin
script that does `ip link set dev ... up' successfully just assumes the
link is up and subsequent operation can be done as usual. While it *may*
work for dracut (yet to be verified), I'm a bit concerned that there are
more scripts to be converted than those that don't follow volatile
failover slave names. It's technically doable, but may not worth the
effort (in terms of porting existing scripts/apps).

Thanks
-Siwei

Won't work for most devices.  Many devices turn off PHY and link layer
if not IFF_UP
True, that's what I said about introducing internal state for those 
driver and other kernel component. Very invasive change indeed.


-Siwei


Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-05 Thread si-wei liu




On 3/5/2019 12:28 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 11:19:32AM -0800, si-wei liu wrote:


On 3/4/2019 6:33 PM, Michael S. Tsirkin wrote:

On Mon, Mar 04, 2019 at 07:50:59PM -0500, Si-Wei Liu wrote:

When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

For that to work, now introduce a module-level tunable,
"slave_rename_ok" that allows users to lift up the rename restriction on
failover slave which is already UP. Although it's possible this change
potentially break userspace component (most likely configuration scripts
or management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of the failover slave. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run. The default
for the "slave_rename_ok" is set to true(1). If userspace doesn't have
the right support in place meanwhile users don't care about reliable
userspace naming, the value can be set to false(0).

Signed-off-by: si-wei@oracle.com
Reviewed-by: Liran Alon 

Not sure which of the versions I should reply to.

Sorry for multiple copies sent. It's fine to reply to this one.


I have a vague idea: would it work to *not* set
IFF_UP on slave devices at all?

Hmm, I ever thought about this option, and it appears this solution is more
invasive than required to convert existing scripts, despite the controversy
of introducing internal netdev state to differentiate user visible state.
Either we disallow slave to be brought up by user, or to not set IFF_UP flag
but instead use the internal one, could end up with substantial behavioral
change that breaks scripts. Consider any admin script that does `ip link set
dev ... up' successfully just assumes the link is up and subsequent
operation can be done as usual. While it *may* work for dracut (yet to be
verified), I'm a bit concerned that there are more scripts to be converted
than those that don't follow volatile failover slave names. It's technically
doable, but may not worth the effort (in terms of porting existing
scripts/apps).

Thanks
-Siwei


Right. Advantage could be that we prevent all kind of
misconfigurations e.g. when one has a route on a slave.
The fix for the slave route problem is already there in dracut. The ship 
has sailed, no matter how seamless upstream thought failover could work 
with the existing userspace. I would rather avoid introducing more 
breakage to userspace if there's simple yet less intrusive way to fix 
the rename issue itself.


-Siwei




Would this reduce the chances of existing scripts such as dracut being
confused?

And this leaves open the option for scripts to address
slaves by checking some custom attribute.


---
   include/linux/netdevice.h |  3 +++
   net/core/

Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-05 Thread si-wei liu




On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote:


On 3/5/2019 11:24 AM, Stephen Hemminger wrote:

On Tue, 5 Mar 2019 11:19:32 -0800
si-wei liu  wrote:


I have a vague idea: would it work to *not* set
IFF_UP on slave devices at all?

Hmm, I ever thought about this option, and it appears this solution is
more invasive than required to convert existing scripts, despite the
controversy of introducing internal netdev state to differentiate user
visible state. Either we disallow slave to be brought up by user, or to
not set IFF_UP flag but instead use the internal one, could end up with
substantial behavioral change that breaks scripts. Consider any admin
script that does `ip link set dev ... up' successfully just assumes the
link is up and subsequent operation can be done as usual.

How would it work when carrier is off?


While it *may*

work for dracut (yet to be verified), I'm a bit concerned that there are
more scripts to be converted than those that don't follow volatile
failover slave names. It's technically doable, but may not worth the
effort (in terms of porting existing scripts/apps).

Thanks
-Siwei

Won't work for most devices.  Many devices turn off PHY and link layer
if not IFF_UP

True, that's what I said about introducing internal state for those driver
and other kernel component. Very invasive change indeed.

-Siwei

Well I did say it's vague.
How about hiding IFF_UP from dev_get_flags (and probably
__dev_change_flags)?

Any different? This has small footprint for the kernel change for sure, 
while the discrepancy is still there. Anyone who writes code for IFF_UP 
will not notice IFF_FAILOVER_SLAVE.


Not to mention more userspace "fixup" work has to be done due to this 
change.


-Siwei





Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-05 Thread si-wei liu




On 3/5/2019 4:36 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 04:20:50PM -0800, si-wei liu wrote:


On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote:

On 3/5/2019 11:24 AM, Stephen Hemminger wrote:

On Tue, 5 Mar 2019 11:19:32 -0800
si-wei liu  wrote:


I have a vague idea: would it work to *not* set
IFF_UP on slave devices at all?

Hmm, I ever thought about this option, and it appears this solution is
more invasive than required to convert existing scripts, despite the
controversy of introducing internal netdev state to differentiate user
visible state. Either we disallow slave to be brought up by user, or to
not set IFF_UP flag but instead use the internal one, could end up with
substantial behavioral change that breaks scripts. Consider any admin
script that does `ip link set dev ... up' successfully just assumes the
link is up and subsequent operation can be done as usual.

How would it work when carrier is off?


While it *may*

work for dracut (yet to be verified), I'm a bit concerned that there are
more scripts to be converted than those that don't follow volatile
failover slave names. It's technically doable, but may not worth the
effort (in terms of porting existing scripts/apps).

Thanks
-Siwei

Won't work for most devices.  Many devices turn off PHY and link layer
if not IFF_UP

True, that's what I said about introducing internal state for those driver
and other kernel component. Very invasive change indeed.

-Siwei

Well I did say it's vague.
How about hiding IFF_UP from dev_get_flags (and probably
__dev_change_flags)?


Any different? This has small footprint for the kernel change for sure,
while the discrepancy is still there. Anyone who writes code for IFF_UP will
not notice IFF_FAILOVER_SLAVE.

Not to mention more userspace "fixup" work has to be done due to this
change.

-Siwei



Point is it's ok since most userspace should just ignore slaves
- hopefully it will just ignore it since it already
ignores interfaces that are down.
Admin script thought the interface could be bright up and do further 
operations without checking the UP flag. It doesn't look to be a 
reliable way of prohibit userspace from operating against slaves.


-Siwei





Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-05 Thread si-wei liu




On 3/5/2019 10:43 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 04:51:00PM -0800, si-wei liu wrote:


On 3/5/2019 4:36 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 04:20:50PM -0800, si-wei liu wrote:

On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote:

On 3/5/2019 11:24 AM, Stephen Hemminger wrote:

On Tue, 5 Mar 2019 11:19:32 -0800
si-wei liu  wrote:


I have a vague idea: would it work to *not* set
IFF_UP on slave devices at all?

Hmm, I ever thought about this option, and it appears this solution is
more invasive than required to convert existing scripts, despite the
controversy of introducing internal netdev state to differentiate user
visible state. Either we disallow slave to be brought up by user, or to
not set IFF_UP flag but instead use the internal one, could end up with
substantial behavioral change that breaks scripts. Consider any admin
script that does `ip link set dev ... up' successfully just assumes the
link is up and subsequent operation can be done as usual.

How would it work when carrier is off?


While it *may*

work for dracut (yet to be verified), I'm a bit concerned that there are
more scripts to be converted than those that don't follow volatile
failover slave names. It's technically doable, but may not worth the
effort (in terms of porting existing scripts/apps).

Thanks
-Siwei

Won't work for most devices.  Many devices turn off PHY and link layer
if not IFF_UP

True, that's what I said about introducing internal state for those driver
and other kernel component. Very invasive change indeed.

-Siwei

Well I did say it's vague.
How about hiding IFF_UP from dev_get_flags (and probably
__dev_change_flags)?


Any different? This has small footprint for the kernel change for sure,
while the discrepancy is still there. Anyone who writes code for IFF_UP will
not notice IFF_FAILOVER_SLAVE.

Not to mention more userspace "fixup" work has to be done due to this
change.

-Siwei



Point is it's ok since most userspace should just ignore slaves
- hopefully it will just ignore it since it already
ignores interfaces that are down.

Admin script thought the interface could be bright up and do further
operations without checking the UP flag.

These scripts then would be broken  on any box with multiple interfaces
since not all of these would have carrier.
Consider a script executing `ifconfig ... up' and once succeeds runs 
tcpdump or some other command relying on UP interface. It's quite common 
that those scripts don't check the UP flag but instead just rely on the 
well-known fact that the command exits with 0 meaning the interface 
should be UP. This change might well break scripts of that kind.






It doesn't look to be a reliable
way of prohibit userspace from operating against slaves.

-Siwei



This does not mean we shouldn't make an effort to disable broken
configurations.

I am not arguing against your patch. Not at all. I see better
hiding of slaves as a separate enhancement.
I understand, but my point is we should try to minimize unnecessary side 
impact to the current usage for whatever "hiding" effort we can make. 
It's hard to find a tradeoff sometimes.





Acked-by: Michael S. Tsirkin 



Thank you.

-Siwei


Re: [RFC PATCH net-next] failover: allow name change on IFF_UP slave interfaces

2019-03-06 Thread si-wei liu




On 3/5/2019 11:23 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 11:15:06PM -0800, si-wei liu wrote:


On 3/5/2019 10:43 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 04:51:00PM -0800, si-wei liu wrote:

On 3/5/2019 4:36 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 04:20:50PM -0800, si-wei liu wrote:

On 3/5/2019 4:06 PM, Michael S. Tsirkin wrote:

On Tue, Mar 05, 2019 at 11:35:50AM -0800, si-wei liu wrote:

On 3/5/2019 11:24 AM, Stephen Hemminger wrote:

On Tue, 5 Mar 2019 11:19:32 -0800
si-wei liu  wrote:


I have a vague idea: would it work to *not* set
IFF_UP on slave devices at all?

Hmm, I ever thought about this option, and it appears this solution is
more invasive than required to convert existing scripts, despite the
controversy of introducing internal netdev state to differentiate user
visible state. Either we disallow slave to be brought up by user, or to
not set IFF_UP flag but instead use the internal one, could end up with
substantial behavioral change that breaks scripts. Consider any admin
script that does `ip link set dev ... up' successfully just assumes the
link is up and subsequent operation can be done as usual.

How would it work when carrier is off?


While it *may*

work for dracut (yet to be verified), I'm a bit concerned that there are
more scripts to be converted than those that don't follow volatile
failover slave names. It's technically doable, but may not worth the
effort (in terms of porting existing scripts/apps).

Thanks
-Siwei

Won't work for most devices.  Many devices turn off PHY and link layer
if not IFF_UP

True, that's what I said about introducing internal state for those driver
and other kernel component. Very invasive change indeed.

-Siwei

Well I did say it's vague.
How about hiding IFF_UP from dev_get_flags (and probably
__dev_change_flags)?


Any different? This has small footprint for the kernel change for sure,
while the discrepancy is still there. Anyone who writes code for IFF_UP will
not notice IFF_FAILOVER_SLAVE.

Not to mention more userspace "fixup" work has to be done due to this
change.

-Siwei



Point is it's ok since most userspace should just ignore slaves
- hopefully it will just ignore it since it already
ignores interfaces that are down.

Admin script thought the interface could be bright up and do further
operations without checking the UP flag.

These scripts then would be broken  on any box with multiple interfaces
since not all of these would have carrier.

Consider a script executing `ifconfig ... up' and once succeeds runs tcpdump
or some other command relying on UP interface. It's quite common that those
scripts don't check the UP flag but instead just rely on the well-known fact
that the command exits with 0 meaning the interface should be UP. This
change might well break scripts of that kind.

I am sorry I don't get it. Could you give an example
of a script that works now but would be broken?


https://github.com/torvalds/linux/blob/master/tools/testing/selftests/net/netdevice.sh#L27
https://github.com/WPO-Foundation/wptagent/blob/master/internal/adb.py#L443
https://github.com/openstack/steth/blob/master/steth/agent/api.py#L134

There are more if you keep searching.

-Siwei







It doesn't look to be a reliable
way of prohibit userspace from operating against slaves.

-Siwei



This does not mean we shouldn't make an effort to disable broken
configurations.

I am not arguing against your patch. Not at all. I see better
hiding of slaves as a separate enhancement.

I understand, but my point is we should try to minimize unnecessary side
impact to the current usage for whatever "hiding" effort we can make. It's
hard to find a tradeoff sometimes.

Yes if some userspace made an assumption and it worked, we should keep
it working I think. I don't necessarily agree we should worry too much
about theoretical issues. In half a year since the feature got merged
it's unlikely there are millions of slightly different scripts using it.



Acked-by: Michael S. Tsirkin 



Thank you.

-Siwei




[PATCH net v2] failover: allow name change on IFF_UP slave interfaces

2019-03-06 Thread Si-Wei Liu
When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of failover slaves. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run.

Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
Signed-off-by: Si-Wei Liu 
Reviewed-by: Liran Alon 
Acked-by: Michael S. Tsirkin 

---
v1 -> v2:
- Drop configurable module parameter (Sridhar)


 include/linux/netdevice.h | 3 +++
 net/core/dev.c| 3 ++-
 net/core/failover.c   | 6 +++---
 3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8ab..6d9e4e0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1487,6 +1487,7 @@ struct net_device_ops {
  * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
  * @IFF_FAILOVER: device is a failover master device
  * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is running
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
IFF_NO_RX_HANDLER   = 1<<26,
IFF_FAILOVER= 1<<27,
IFF_FAILOVER_SLAVE  = 1<<28,
+   IFF_SLAVE_RENAME_OK = 1<<29,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
 #define IFF_NO_RX_HANDLER  IFF_NO_RX_HANDLER
 #define IFF_FAILOVER   IFF_FAILOVER
 #define IFF_FAILOVER_SLAVE IFF_FAILOVER_SLAVE
+#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK
 
 /**
  * struct net_device - The DEVICE structure.
diff --git a/net/core/dev.c b/net/core/dev.c
index 722d50d..ae070de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, const char 
*newname)
BUG_ON(!dev_net(dev));
 
net = dev_net(dev);
-   if (dev->flags & IFF_UP)
+   if (dev->flags & IFF_UP &&
+   !(dev->priv_flags & IFF_SLAVE_RENAME_OK))
return -EBUSY;
 
write_seqcount_begin(&devnet_rename_seq);
diff --git a/net/core/failover.c b/net/core/failover.c
index 4a92a98..34c5c87 100644
--- a/net/core/failover.c
+++ b/net/core/failover.c
@@ -80,14 +80,14 @@ static int failover_slave_register(struct

Re: [PATCH net v2] failover: allow name change on IFF_UP slave interfaces

2019-03-06 Thread si-wei liu




On 3/6/2019 8:13 PM, Samudrala, Sridhar wrote:


On 3/6/2019 7:08 PM, Si-Wei Liu wrote:

When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to work with the new naming
behavior of failover slaves. Userspace component interacting with
slaves should be changed to operate on failover master instead, as the
failover slave is dynamic in nature which may come and go at any point.
The goal is to make the role of failover slaves less relevant, and
all userspace should only deal with master in the long run.

Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
Signed-off-by: Si-Wei Liu 
Reviewed-by: Liran Alon 
Acked-by: Michael S. Tsirkin 

---
v1 -> v2:
- Drop configurable module parameter (Sridhar)


  include/linux/netdevice.h | 3 +++
  net/core/dev.c| 3 ++-
  net/core/failover.c   | 6 +++---
  3 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 857f8ab..6d9e4e0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1487,6 +1487,7 @@ struct net_device_ops {
   * @IFF_NO_RX_HANDLER: device doesn't support the rx_handler hook
   * @IFF_FAILOVER: device is a failover master device
   * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master 
device
+ * @IFF_SLAVE_RENAME_OK: rename is allowed while slave device is 
running

   */
  enum netdev_priv_flags {
  IFF_802_1Q_VLAN= 1<<0,
@@ -1518,6 +1519,7 @@ enum netdev_priv_flags {
  IFF_NO_RX_HANDLER= 1<<26,
  IFF_FAILOVER= 1<<27,
  IFF_FAILOVER_SLAVE= 1<<28,
+IFF_SLAVE_RENAME_OK= 1<<29,
  };
#define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1548,6 +1550,7 @@ enum netdev_priv_flags {
  #define IFF_NO_RX_HANDLERIFF_NO_RX_HANDLER
  #define IFF_FAILOVERIFF_FAILOVER
  #define IFF_FAILOVER_SLAVEIFF_FAILOVER_SLAVE
+#define IFF_SLAVE_RENAME_OKIFF_SLAVE_RENAME_OK
/**
   *struct net_device - The DEVICE structure.
diff --git a/net/core/dev.c b/net/core/dev.c
index 722d50d..ae070de 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1180,7 +1180,8 @@ int dev_change_name(struct net_device *dev, 
const char *newname)

  BUG_ON(!dev_net(dev));
net = dev_net(dev);
-if (dev->flags & IFF_UP)
+if (dev->flags & IFF_UP &&
+!(dev->priv_flags & IFF_SLAVE_RENAME_OK))
  return -EBUSY;


Without the configurable module parameter, i think we don't even need
the new SLAVE_RENAME_OK private flag.
Can't we simply check for IFF_FAILOVER_SLAVE ?
I'd prefer keeping this flag for now, even though without configurab

[RFC PATCH 3/3] virtio_net: make lower netdevs for virtio_bypass hidden

2018-04-01 Thread Si-Wei Liu
We should move virtio_bypass to a 1-upper-with-2-hidden-lower
driver model for greater compatibility with regard to preserving
userpsace API and ABI.

On the other hand, technically virtio_bypass should make stricter
check before automatically enslaving the corresponding virtual
function or passthrough device. It's more reasonable to pair
virtio_bypass instance with a VF or passthrough device 1:1,
rather than rely on searching for a random non-virtio netdev with
exact same MAC address. One possible way of doing it is to bind
virtio_bypass explicitly to a guest pci device by specifying
its  and : location. Changing BACKUP feature
to take these configs into account, such that verifying target
device for auto-enslavement no longer relies on the MAC address.

Signed-off-by: Si-Wei Liu 
---
 drivers/net/virtio_net.c| 159 
 include/uapi/linux/virtio_net.h |   2 +
 2 files changed, 148 insertions(+), 13 deletions(-)

diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index f850cf6..c54a5bd 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -77,6 +77,8 @@ struct virtnet_stats {
u64 rx_packets;
 };
 
+static struct workqueue_struct *virtnet_bypass_wq;
+
 /* Internal representation of a send virtqueue */
 struct send_queue {
/* Virtqueue associated with this send _queue */
@@ -196,6 +198,13 @@ struct padded_vnet_hdr {
char padding[4];
 };
 
+struct virtnet_bypass_task {
+   struct work_struct  work;
+   unsigned long   event; 
+   struct net_device   *child_netdev;
+   struct net_device   *bypass_netdev;
+};
+
 /* Converting between virtqueue no. and kernel tx/rx queue no.
  * 0:rx0 1:tx0 2:rx1 3:tx1 ... 2N:rxN 2N+1:txN 2N+2:cvq
  */
@@ -2557,6 +2566,11 @@ struct virtnet_bypass_info {
 
/* spinlock while updating stats */
spinlock_t stats_lock;
+
+   int bus;
+   int slot;
+   int function;
+
 };
 
 static void virtnet_bypass_child_open(struct net_device *dev,
@@ -2822,10 +2836,13 @@ static void virtnet_bypass_ethtool_get_drvinfo(struct 
net_device *dev,
.get_link_ksettings = virtnet_bypass_ethtool_get_link_ksettings,
 };
 
-static struct net_device *get_virtnet_bypass_bymac(struct net *net,
-  const u8 *mac)
+static struct net_device *
+get_virtnet_bypass_bymac(struct net_device *child_netdev)
 {
+   struct net *net = dev_net(child_netdev);
struct net_device *dev;
+   struct virtnet_bypass_info *vbi;
+   int devfn;
 
ASSERT_RTNL();
 
@@ -2833,7 +2850,29 @@ static struct net_device 
*get_virtnet_bypass_bymac(struct net *net,
if (dev->netdev_ops != &virtnet_bypass_netdev_ops)
continue;   /* not a virtnet_bypass device */
 
-   if (ether_addr_equal(mac, dev->perm_addr))
+   if (!ether_addr_equal(child_netdev->dev_addr, dev->perm_addr))
+   continue;   /* not matching MAC address */
+
+   if (!child_netdev->dev.parent)
+   continue;
+
+   /* Is child_netdev a backup netdev ? */
+   if (child_netdev->dev.parent == dev->dev.parent)
+   return dev;
+
+   /* Avoid non pci devices as active netdev */
+   if (!dev_is_pci(child_netdev->dev.parent))
+   continue;
+
+   vbi = netdev_priv(dev);
+   devfn = PCI_DEVFN(vbi->slot, vbi->function);
+
+   netdev_info(dev, "bus %d slot %d func %d",
+   vbi->bus, vbi->slot, vbi->function);
+
+   /* Need to match :. */
+   if (pci_get_bus_and_slot(vbi->bus, devfn) ==
+   to_pci_dev(child_netdev->dev.parent))
return dev;
}
 
@@ -2878,10 +2917,61 @@ static rx_handler_result_t 
virtnet_bypass_handle_frame(struct sk_buff **pskb)
return RX_HANDLER_ANOTHER;
 }
 
+static int virtnet_bypass_pregetname_child(struct net_device *child_netdev)
+{
+   struct net_device *dev;
+
+   if (child_netdev->addr_len != ETH_ALEN)
+   return NOTIFY_DONE;
+
+   /* We will use the MAC address to locate the virtnet_bypass netdev
+* to associate with the child netdev. If we don't find a matching
+* bypass netdev, move on.
+*/
+   dev = get_virtnet_bypass_bymac(child_netdev);
+   if (!dev)
+   return NOTIFY_DONE;
+
+   if (child_netdev->dev.parent &&
+   child_netdev->dev.parent != dev->dev.parent);
+   netdev_set_hidden(child_netdev);
+
+   return NOTIFY_OK;
+}
+
+static void virtnet_bypass_task_fn(struct work_struct *work)
+{
+   struct virtnet_bypass_task *task;
+   struct net_device *child_netdev;
+   int rc;
+
+  

[RFC PATCH 2/3] netdev: kernel-only IFF_HIDDEN netdevice

2018-04-01 Thread Si-Wei Liu
Hidden netdevice is not visible to userspace such that
typical network utilites e.g. ip, ifconfig and et al,
cannot sense its existence or configure it. Internally
hidden netdev may associate with an upper level netdev
that userspace has access to. Although userspace cannot
manipulate the lower netdev directly, user may control
or configure the underlying hidden device through the
upper-level netdev. For identification purpose, the
kobject for hidden netdev still presents in the sysfs
hierarchy, however, no uevent message will be generated
when the sysfs entry is created, modified or destroyed.

For that end, a separate namescope needs to be carved
out for IFF_HIDDEN netdevs. As of now netdev name that
starts with colon i.e. ':' is invalid in userspace,
since socket ioctls such as SIOCGIFCONF use ':' as the
separator for ifname. The absence of namescope started
with ':' can rightly be used as the namescope for
the kernel-only IFF_HIDDEN netdevs.

Signed-off-by: Si-Wei Liu 
---
 include/linux/netdevice.h   |  12 ++
 include/net/net_namespace.h |   2 +
 net/core/dev.c  | 281 ++--
 net/core/net_namespace.c|   1 +
 4 files changed, 263 insertions(+), 33 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index ef789e1..1a70f3a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1380,6 +1380,7 @@ struct net_device_ops {
  * @IFF_PHONY_HEADROOM: the headroom value is controlled by an external
  * entity (i.e. the master device for bridged veth)
  * @IFF_MACSEC: device is a MACsec device
+ * @IFF_HIDDEN: device is not visible to userspace
  */
 enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1410,6 +1411,7 @@ enum netdev_priv_flags {
IFF_RXFH_CONFIGURED = 1<<25,
IFF_PHONY_HEADROOM  = 1<<26,
IFF_MACSEC  = 1<<27,
+   IFF_HIDDEN  = 1<<28,
 };
 
 #define IFF_802_1Q_VLANIFF_802_1Q_VLAN
@@ -1439,6 +1441,7 @@ enum netdev_priv_flags {
 #define IFF_TEAM   IFF_TEAM
 #define IFF_RXFH_CONFIGUREDIFF_RXFH_CONFIGURED
 #define IFF_MACSEC IFF_MACSEC
+#define IFF_HIDDEN IFF_HIDDEN
 
 /**
  * struct net_device - The DEVICE structure.
@@ -1659,6 +1662,7 @@ enum netdev_priv_flags {
 struct net_device {
charname[IFNAMSIZ];
struct hlist_node   name_hlist;
+   struct hlist_node   name_cmpl_hlist;
struct dev_ifalias  __rcu *ifalias;
/*
 *  I/O specific fields
@@ -1680,6 +1684,7 @@ struct net_device {
unsigned long   state;
 
struct list_headdev_list;
+   struct list_headdev_cmpl_list;
struct list_headnapi_list;
struct list_headunreg_list;
struct list_headclose_list;
@@ -2326,6 +2331,7 @@ struct netdev_lag_lower_state_info {
 #define NETDEV_UDP_TUNNEL_PUSH_INFO0x001C
 #define NETDEV_UDP_TUNNEL_DROP_INFO0x001D
 #define NETDEV_CHANGE_TX_QUEUE_LEN 0x001E
+#define NETDEV_PRE_GETNAME 0x001F
 
 int register_netdevice_notifier(struct notifier_block *nb);
 int unregister_netdevice_notifier(struct notifier_block *nb);
@@ -2393,6 +2399,8 @@ static inline void netdev_notifier_info_init(struct 
netdev_notifier_info *info,
for_each_netdev_rcu(&init_net, slave)   \
if (netdev_master_upper_dev_get_rcu(slave) == (bond))
 #define net_device_entry(lh)   list_entry(lh, struct net_device, dev_list)
+#define for_each_netdev_complete(net, d)   \
+   list_for_each_entry(d, &(net)->dev_cmpl_head, dev_cmpl_list)
 
 static inline struct net_device *next_net_device(struct net_device *dev)
 {
@@ -2462,6 +2470,10 @@ static inline void unregister_netdevice(struct 
net_device *dev)
unregister_netdevice_queue(dev, NULL);
 }
 
+void netdev_set_hidden(struct net_device *dev);
+int hide_netdevice(struct net_device *dev);
+void unhide_netdevice(struct net_device *dev);
+
 int netdev_refcnt_read(const struct net_device *dev);
 void free_netdev(struct net_device *dev);
 void netdev_freemem(struct net_device *dev);
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index 0490084..f9ce9b4 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -80,7 +80,9 @@ struct net {
struct sock *genl_sock;
 
struct list_headdev_base_head;
+   struct list_headdev_cmpl_head;
struct hlist_head   *dev_name_head;
+   struct hlist_head   *dev_name_cmpl_head;
struct hlist_head   *dev_index_head;
unsigned intdev_base_seq;   /* protected by rtnl_mutex */
int ifindex;
diff --gi

[RFC PATCH 1/3] qemu: virtio-bypass should explicitly bind to a passthrough device

2018-04-01 Thread Si-Wei Liu
The new backup option allows guest virtio-bypass driver to explicitly
bind to a corresponding passthrough instance, which is identifiable by
the :. notation. MAC address is still validated
in the guest but not the only criteria for pairing two devices.
MAC address is more a matter of network configuration than a (virtual)
device identifier, the latter of which needs to be unique as part of
VM configuration. Techinically it's possible there exists more than
one device in the guest configured with the same MAC, but each belongs
to completely isolated network.

The direct benefit as a result of the explicit binding (or pairing),
apparently, is the prohibition of improper binding or malicious pairing
due to any flexiblility in terms of guest MAC address config.

What's more important, the indicator of guest device location can even
be used as a means to reserve the slot for the corresponding passthrough
device in the PCI bus tree if such device is temporarily absent, but
yet to be hot plugged into the VM. We'd need to preserve the slot for
the passthrough device to which virtio-bypass is bound, such that once
it is plugged out as a result of migration we can ensure the slot
wouldn't be occupied by other devices, and any user-space application
assumes consistent device location in the bus tree still works.

The usage for the backup option is as follows:

   -device virtio-net-pci, ... ,backup=:[.]

for e.g.

   -device virtio-net-pci,id=net0,mac=52:54:00:e0:58:80,backup=pci.2:0x3
   ...
   -device vfio-pci,host=02:10.1,id=hostdev0,bus=pci.2,addr=0x3

Signed-off-by: Si-Wei Liu 
---
 hw/net/virtio-net.c | 29 -
 include/hw/pci/pci.h|  3 ++
 include/hw/virtio/virtio-net.h  |  2 +
 include/standard-headers/linux/virtio_net.h |  1 +
 qdev-monitor.c  | 64 +
 5 files changed, 97 insertions(+), 2 deletions(-)

diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index de31b1b98c..a36b169958 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -26,6 +26,7 @@
 #include "qapi-event.h"
 #include "hw/virtio/virtio-access.h"
 #include "migration/misc.h"
+#include "hw/pci/pci.h"
 
 #define VIRTIO_NET_VM_VERSION11
 
@@ -61,6 +62,8 @@ static VirtIOFeature feature_sizes[] = {
  .end = endof(struct virtio_net_config, max_virtqueue_pairs)},
 {.flags = 1 << VIRTIO_NET_F_MTU,
  .end = endof(struct virtio_net_config, mtu)},
+{.flags = 1 << VIRTIO_NET_F_BACKUP,
+ .end = endof(struct virtio_net_config, bsf2backup)},
 {}
 };
 
@@ -84,10 +87,24 @@ static void virtio_net_get_config(VirtIODevice *vdev, 
uint8_t *config)
 {
 VirtIONet *n = VIRTIO_NET(vdev);
 struct virtio_net_config netcfg;
+uint16_t busdevfn;
 
 virtio_stw_p(vdev, &netcfg.status, n->status);
 virtio_stw_p(vdev, &netcfg.max_virtqueue_pairs, n->max_queues);
 virtio_stw_p(vdev, &netcfg.mtu, n->net_conf.mtu);
+if (n->net_conf.backup) {
+/* Below function should not fail as the backup ID string
+ * has been validated when device is being realized.
+ * Until guest starts to run we can can get to the
+ * effective bus num in use from pci config space where
+ * guest had written to.
+ */
+pci_get_busdevfn_by_id(n->net_conf.backup, &busdevfn,
+   NULL, NULL);
+busdevfn <<= 8;
+busdevfn |= (n->backup_devfn & 0xFF);
+virtio_stw_p(vdev, &netcfg.bsf2backup, busdevfn);
+}
 memcpy(netcfg.mac, n->mac, ETH_ALEN);
 memcpy(config, &netcfg, n->config_size);
 }
@@ -1935,11 +1952,19 @@ static void virtio_net_device_realize(DeviceState *dev, 
Error **errp)
 VirtIODevice *vdev = VIRTIO_DEVICE(dev);
 VirtIONet *n = VIRTIO_NET(dev);
 NetClientState *nc;
+uint16_t bdevfn;
 int i;
 
 if (n->net_conf.mtu) {
 n->host_features |= (0x1 << VIRTIO_NET_F_MTU);
 }
+if (n->net_conf.backup) {
+if (pci_get_busdevfn_by_id(n->net_conf.backup, NULL,
+   &bdevfn, errp))
+return;
+n->backup_devfn = bdevfn;
+n->host_features |= (0x1 << VIRTIO_NET_F_BACKUP);
+}
 
 virtio_net_set_config_size(n, n->host_features);
 virtio_init(vdev, "virtio-net", VIRTIO_ID_NET, n->config_size);
@@ -2160,8 +2185,8 @@ static Property virtio_net_properties[] = {
 DEFINE_PROP_UINT16("host_mtu", VirtIONet, net_conf.mtu, 0),
 DEFINE_PROP_BOOL("x-mtu-bypass-backend", VirtIONet, mtu_bypass_backend,
  true),
-DEFINE_PROP_BIT("backup", VirtIONet, host_features,
- VIRTIO_NET_F_BACKUP, false),
+DEFINE_PROP_STRING("backup", VirtIONet, net_conf.backu

[RFC PATCH 0/3] Userspace compatible driver model for virtio_bypass

2018-04-01 Thread Si-Wei Liu
This RFC patch series attempts to hide the lower netdevs for virtio_bypass
from userspace visibility, and tighten up the association between virtio_bypass
and the lower passthrough netdev to be enslaved by binding to a specific device
identifier explicitly. This in turn has the benefits of taking the merit of the
2-netdev driver model from netvsc (userspace compliance) to a perfect sense,
while keeping the internal implementation still a 3-netdev model. There's no
loss of feature such as XDP, and continously adding improvements for performance
and features thanks to the good bypass nature of the 3-netdev model are also
possible in the long run.

As said, this change should make the code sharing between netvsc and 
virtio_bypass
easier and more approachable, as I think the concerns Stephen pointed out was
mainly regarding userspace compatibility and not the hardware offloading
tunables on the VF slave that had to be exposed to netvsc users today, if I'm
not mistaken.

Jiri expressed concerns around the weak check depending on MAC address only
during enslavement and we really need to do strict checks more than that. With
the change to requiring user explicitly specifying the passthrough device
to which virtio_bypass is expected to be bound, virtio_bypass now would match
device based on the PCI slot info in device tree, rather than rely on MAC
address inadvertently. In addition, the PCI slot info passed in will be helpful
to accommodate udevd to name the virtio_bypass interface specifically, making
a transparent and automatic upgrade from existing VF setup to virtio_bypass
possible (expect udevd patch to come later on).

Since I'd like to get the discussion going as early as possible, this series
just shows essential changes to a minimal set. Although not included in the
series, I would like to remind ahead that a few neccessary pieces must be built
upon the assumption of hidden lower netdevs and explicit binding. Such as
sysfs interfaces for udev's naming of virtio_bypass interace. Such as passing
down HW offloading configs to the active lower slave, and making it persistent
across live migration. And so on..

The current patch series is based on Sridhar's v4 patch "Enable virtio to act
as a backup for a passthru device", but I can resync anyway to his upcoming
version once posted.


Si-Wei Liu (1):
  qemu: virtio-bypass should explicitly bind to a passthrough device

 hw/net/virtio-net.c | 29 -
 include/hw/pci/pci.h|  3 ++
 include/hw/virtio/virtio-net.h  |  2 +
 include/standard-headers/linux/virtio_net.h |  1 +
 qdev-monitor.c  | 64 +
 5 files changed, 97 insertions(+), 2 deletions(-)

Si-Wei Liu (2):
  netdev: kernel-only IFF_HIDDEN netdevice
  virtio_net: make lower netdevs for virtio_bypass hidden

 drivers/net/virtio_net.c| 159 +--
 include/linux/netdevice.h   |  12 ++
 include/net/net_namespace.h |   2 +
 include/uapi/linux/virtio_net.h |   2 +
 net/core/dev.c  | 281 +++-
 net/core/net_namespace.c|   1 +
 6 files changed, 411 insertions(+), 46 deletions(-)

-- 
1.8.3.1



Re: [PATCH net v8] failover: allow name change on IFF_UP slave interfaces

2019-04-09 Thread si-wei liu




On 4/9/2019 9:13 AM, Michael S. Tsirkin wrote:

On Mon, Apr 08, 2019 at 07:45:27PM -0400, Si-Wei Liu wrote:

When a netdev appears through hot plug then gets enslaved by a failover
master that is already up and running, the slave will be opened
right away after getting enslaved. Today there's a race that userspace
(udev) may fail to rename the slave if the kernel (net_failover)
opens the slave earlier than when the userspace rename happens.
Unlike bond or team, the primary slave of failover can't be renamed by
userspace ahead of time, since the kernel initiated auto-enslavement is
unable to, or rather, is never meant to be synchronized with the rename
request from userspace.

As the failover slave interfaces are not designed to be operated
directly by userspace apps: IP configuration, filter rules with
regard to network traffic passing and etc., should all be done on master
interface. In general, userspace apps only care about the
name of master interface, while slave names are less important as long
as admin users can see reliable names that may carry
other information describing the netdev. For e.g., they can infer that
"ens3nsby" is a standby slave of "ens3", while for a
name like "eth0" they can't tell which master it belongs to.

Historically the name of IFF_UP interface can't be changed because
there might be admin script or management software that is already
relying on such behavior and assumes that the slave name can't be
changed once UP. But failover is special: with the in-kernel
auto-enslavement mechanism, the userspace expectation for device
enumeration and bring-up order is already broken. Previously initramfs
and various userspace config tools were modified to bypass failover
slaves because of auto-enslavement and duplicate MAC address. Similarly,
in case that users care about seeing reliable slave name, the new type
of failover slaves needs to be taken care of specifically in userspace
anyway.

It's less risky to lift up the rename restriction on failover slave
which is already UP. Although it's possible this change may potentially
break userspace component (most likely configuration scripts or
management software) that assumes slave name can't be changed while
UP, it's relatively a limited and controllable set among all userspace
components, which can be fixed specifically to listen for the rename
events on failover slaves. Userspace component interacting with slaves
is expected to be changed to operate on failover master interface
instead, as the failover slave is dynamic in nature which may come and
go at any point.  The goal is to make the role of failover slaves less
relevant, and userspace components should only deal with failover master
in the long run.

Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module")
Signed-off-by: Si-Wei Liu 
Reviewed-by: Liran Alon 


OK let's start with that. We can always add more events.
Yep. Nothing is impossible but this is the least we should do. We can 
add more events based on *real use case*.



-Siwei



--
v1 -> v2:
- Drop configurable module parameter (Sridhar)

v2 -> v3:
- Drop additional IFF_SLAVE_RENAME_OK flag (Sridhar)
- Send down and up events around rename (Michael S. Tsirkin)

v3 -> v4:
- Simplify notification to be sent (Stephen Hemminger)

v4 -> v5:
- Sync up code with latest net-next (Sridhar)
- Use proper structure initialization (Stephen, Jiri)

v5 -> v6:
- Make the property of live name change a generic flag (Stephen)

v6 -> v7:
- Remove NETDEV_CHANGE notification that is deemed unnecessary
   (Stephen)

v7 -> v8:
- Fix commit message that references link up/down events still
   (Michael)
---
  include/linux/netdevice.h |  3 +++
  net/core/dev.c| 16 +++-
  net/core/failover.c   |  6 +++---
  3 files changed, 21 insertions(+), 4 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 78f5ec4e..ea9a63f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1498,6 +1498,7 @@ struct net_device_ops {
   * @IFF_FAILOVER: device is a failover master device
   * @IFF_FAILOVER_SLAVE: device is lower dev of a failover master device
   * @IFF_L3MDEV_RX_HANDLER: only invoke the rx handler of L3 master device
+ * @IFF_LIVE_RENAME_OK: rename is allowed while device is up and running
   */
  enum netdev_priv_flags {
IFF_802_1Q_VLAN = 1<<0,
@@ -1530,6 +1531,7 @@ enum netdev_priv_flags {
IFF_FAILOVER= 1<<27,
IFF_FAILOVER_SLAVE  = 1<<28,
IFF_L3MDEV_RX_HANDLER   = 1<<29,
+   IFF_LIVE_RENAME_OK  = 1<<30,
  };
  
  #define IFF_802_1Q_VLAN			IFF_802_1Q_VLAN

@@ -1561,6 +1563,7 @@ enum netdev_priv_flags {
  #define IFF_FAILOVER  IFF_FAILOVER
  #define IFF_FAILOVER_SLAVEIFF_FAILOVER_SLAVE
  #d

Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path

2020-10-13 Thread si-wei liu



On 10/9/2020 7:27 PM, Jason Wang wrote:


On 2020/10/3 下午1:02, Si-Wei Liu wrote:

Pinned pages are not properly accounted particularly when
mapping error occurs on IOTLB update. Clean up dangling
pinned pages for the error path. As the inflight pinned
pages, specifically for memory region that strides across
multiple chunks, would need more than one free page for
book keeping and accounting. For simplicity, pin pages
for all memory in the IOVA range in one go rather than
have multiple pin_user_pages calls to make up the entire
region. This way it's easier to track and account the
pages already mapped, particularly for clean-up in the
error path.

Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
Signed-off-by: Si-Wei Liu 
---
Changes in v3:
- Factor out vhost_vdpa_map() change to a separate patch

Changes in v2:
- Fix incorrect target SHA1 referenced

  drivers/vhost/vdpa.c | 119 
++-

  1 file changed, 71 insertions(+), 48 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 0f27919..dad41dae 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -595,21 +595,19 @@ static int 
vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,

  struct vhost_dev *dev = &v->vdev;
  struct vhost_iotlb *iotlb = dev->iotlb;
  struct page **page_list;
-unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
+struct vm_area_struct **vmas;
  unsigned int gup_flags = FOLL_LONGTERM;
-unsigned long npages, cur_base, map_pfn, last_pfn = 0;
-unsigned long locked, lock_limit, pinned, i;
+unsigned long map_pfn, last_pfn = 0;
+unsigned long npages, lock_limit;
+unsigned long i, nmap = 0;
  u64 iova = msg->iova;
+long pinned;
  int ret = 0;
if (vhost_iotlb_itree_first(iotlb, msg->iova,
  msg->iova + msg->size - 1))
  return -EEXIST;
  -page_list = (struct page **) __get_free_page(GFP_KERNEL);
-if (!page_list)
-return -ENOMEM;
-
  if (msg->perm & VHOST_ACCESS_WO)
  gup_flags |= FOLL_WRITE;
  @@ -617,61 +615,86 @@ static int 
vhost_vdpa_process_iotlb_update(struct vhost_vdpa *v,

  if (!npages)
  return -EINVAL;
  +page_list = kvmalloc_array(npages, sizeof(struct page *), 
GFP_KERNEL);

+vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *),
+  GFP_KERNEL);



This will result high order memory allocation which was what the code 
tried to avoid originally.


Using an unlimited size will cause a lot of side effects consider VM 
or userspace may try to pin several TB of memory.
Hmmm, that's a good point. Indeed, if the guest memory demand is huge or 
the host system is running short of free pages, kvmalloc will be 
problematic and less efficient than the __get_free_page implementation.






+if (!page_list || !vmas) {
+ret = -ENOMEM;
+goto free;
+}



Any reason that you want to use vmas?
Without providing custom vmas, it's subject to high order allocation 
failure. While page_list and vmas can now fallback to virtual memory 
allocation if need be.






+
  mmap_read_lock(dev->mm);
  -locked = atomic64_add_return(npages, &dev->mm->pinned_vm);
  lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-if (locked > lock_limit) {
+if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
  ret = -ENOMEM;
-goto out;
+goto unlock;
  }
  -cur_base = msg->uaddr & PAGE_MASK;
-iova &= PAGE_MASK;
+pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags,
+page_list, vmas);
+if (npages != pinned) {
+if (pinned < 0) {
+ret = pinned;
+} else {
+unpin_user_pages(page_list, pinned);
+ret = -ENOMEM;
+}
+goto unlock;
+}
  -while (npages) {
-pinned = min_t(unsigned long, npages, list_size);
-ret = pin_user_pages(cur_base, pinned,
- gup_flags, page_list, NULL);
-if (ret != pinned)
-goto out;
-
-if (!last_pfn)
-map_pfn = page_to_pfn(page_list[0]);
-
-for (i = 0; i < ret; i++) {
-unsigned long this_pfn = page_to_pfn(page_list[i]);
-u64 csize;
-
-if (last_pfn && (this_pfn != last_pfn + 1)) {
-/* Pin a contiguous chunk of memory */
-csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
-if (vhost_vdpa_map(v, iova, csize,
-   map_pfn << PAGE_SHIFT,
-   msg->perm))
-goto out;
-map_pfn = this_pfn;
-iova += csize;
+iova &= PAGE_MASK;
+map_pfn = page_to_pfn(page_list[0]);
+
+/* One more iterati

Re: [PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path

2020-10-15 Thread si-wei liu



On 10/15/2020 6:11 AM, Michael S. Tsirkin wrote:

On Thu, Oct 15, 2020 at 02:15:32PM +0800, Jason Wang wrote:

On 2020/10/14 上午7:42, si-wei liu wrote:


So what I suggest is to fix the pinning leakage first and do the
possible optimization on top (which is still questionable to me).

OK. Unfortunately, this was picked and got merged in upstream. So I will
post a follow up patch set to 1) revert the commit to the original
__get_free_page() implementation, and 2) fix the accounting and leakage
on top. Will it be fine?


Fine.

Thanks

Fine by me too.


Thanks, Michael & Jason. I will post the fix shortly. Stay tuned.

-Siwei



[PATCH] vdpa/mlx5: should keep avail_index despite device status

2020-10-01 Thread Si-Wei Liu
A VM with mlx5 vDPA has below warnings while being reset:

vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)

We should allow userspace emulating the virtio device be
able to get to vq's avail_index, regardless of vDPA device
status. Save the index that was last seen when virtq was
stopped, so that userspace doesn't complain.

Signed-off-by: Si-Wei Liu 
---
 drivers/vdpa/mlx5/net/mlx5_vnet.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
b/drivers/vdpa/mlx5/net/mlx5_vnet.c
index 70676a6..74264e59 100644
--- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
+++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
@@ -1133,15 +1133,17 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, 
struct mlx5_vdpa_virtqueue *m
if (!mvq->initialized)
return;
 
-   if (query_virtqueue(ndev, mvq, &attr)) {
-   mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
-   return;
-   }
if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY)
return;
 
if (modify_virtqueue(ndev, mvq, MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n");
+
+   if (query_virtqueue(ndev, mvq, &attr)) {
+   mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
+   return;
+   }
+   mvq->avail_idx = attr.available_index;
 }
 
 static void suspend_vqs(struct mlx5_vdpa_net *ndev)
@@ -1411,8 +1413,14 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device 
*vdev, u16 idx, struct vdpa
struct mlx5_virtq_attr attr;
int err;
 
-   if (!mvq->initialized)
-   return -EAGAIN;
+   /* If the virtq object was destroyed, use the value saved at
+* the last minute of suspend_vq. This caters for userspace
+* that cares about emulating the index after vq is stopped.
+*/
+   if (!mvq->initialized) {
+   state->avail_index = mvq->avail_idx;
+   return 0;
+   }
 
err = query_virtqueue(ndev, mvq, &attr);
if (err) {
-- 
1.8.3.1



[PATCH] vhost-vdpa: fix page pinning leakage in error path

2020-10-01 Thread Si-Wei Liu
Pinned pages are not properly accounted particularly when
mapping error occurs on IOTLB update. Clean up dangling
pinned pages for the error path. As the inflight pinned
pages, specifically for memory region that strides across
multiple chunks, would need more than one free page for
book keeping and accounting. For simplicity, pin pages
for all memory in the IOVA range in one go rather than
have multiple pin_user_pages calls to make up the entire
region. This way it's easier to track and account the
pages already mapped, particularly for clean-up in the
error path.

Fixes: 20453a45fb06 ("vhost: introduce vDPA-based backend")
Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 121 +++
 1 file changed, 73 insertions(+), 48 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 796fe97..abc4aa2 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -565,6 +565,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
  perm_to_iommu_flags(perm));
}
 
+   if (r)
+   vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
return r;
 }
 
@@ -592,21 +594,19 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
struct vhost_dev *dev = &v->vdev;
struct vhost_iotlb *iotlb = dev->iotlb;
struct page **page_list;
-   unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
+   struct vm_area_struct **vmas;
unsigned int gup_flags = FOLL_LONGTERM;
-   unsigned long npages, cur_base, map_pfn, last_pfn = 0;
-   unsigned long locked, lock_limit, pinned, i;
+   unsigned long map_pfn, last_pfn = 0;
+   unsigned long npages, lock_limit;
+   unsigned long i, nmap = 0;
u64 iova = msg->iova;
+   long pinned;
int ret = 0;
 
if (vhost_iotlb_itree_first(iotlb, msg->iova,
msg->iova + msg->size - 1))
return -EEXIST;
 
-   page_list = (struct page **) __get_free_page(GFP_KERNEL);
-   if (!page_list)
-   return -ENOMEM;
-
if (msg->perm & VHOST_ACCESS_WO)
gup_flags |= FOLL_WRITE;
 
@@ -614,61 +614,86 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
if (!npages)
return -EINVAL;
 
+   page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+   vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *),
+ GFP_KERNEL);
+   if (!page_list || !vmas) {
+   ret = -ENOMEM;
+   goto free;
+   }
+
mmap_read_lock(dev->mm);
 
-   locked = atomic64_add_return(npages, &dev->mm->pinned_vm);
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-   if (locked > lock_limit) {
+   if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
ret = -ENOMEM;
-   goto out;
+   goto unlock;
}
 
-   cur_base = msg->uaddr & PAGE_MASK;
-   iova &= PAGE_MASK;
+   pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags,
+   page_list, vmas);
+   if (npages != pinned) {
+   if (pinned < 0) {
+   ret = pinned;
+   } else {
+   unpin_user_pages(page_list, pinned);
+   ret = -ENOMEM;
+   }
+   goto unlock;
+   }
 
-   while (npages) {
-   pinned = min_t(unsigned long, npages, list_size);
-   ret = pin_user_pages(cur_base, pinned,
-gup_flags, page_list, NULL);
-   if (ret != pinned)
-   goto out;
-
-   if (!last_pfn)
-   map_pfn = page_to_pfn(page_list[0]);
-
-   for (i = 0; i < ret; i++) {
-   unsigned long this_pfn = page_to_pfn(page_list[i]);
-   u64 csize;
-
-   if (last_pfn && (this_pfn != last_pfn + 1)) {
-   /* Pin a contiguous chunk of memory */
-   csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
-   if (vhost_vdpa_map(v, iova, csize,
-  map_pfn << PAGE_SHIFT,
-  msg->perm))
-   goto out;
-   map_pfn = this_pfn;
-   iova += csize;
+   iova &= PAGE_MASK;
+   map_pfn = page_to_pfn(page_list[0]);
+
+   /* One more iteration to avoid extra vdpa_map() call out of loop. */
+   for (i = 0; i <= npages; i++) {
+   unsigned long this_pfn;

[PATCH v2] vhost-vdpa: fix page pinning leakage in error path

2020-10-01 Thread Si-Wei Liu
Pinned pages are not properly accounted particularly when
mapping error occurs on IOTLB update. Clean up dangling
pinned pages for the error path. As the inflight pinned
pages, specifically for memory region that strides across
multiple chunks, would need more than one free page for
book keeping and accounting. For simplicity, pin pages
for all memory in the IOVA range in one go rather than
have multiple pin_user_pages calls to make up the entire
region. This way it's easier to track and account the
pages already mapped, particularly for clean-up in the
error path.

Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
Signed-off-by: Si-Wei Liu 
---
Changes in v2:
- Fix incorrect target SHA1 referenced

 drivers/vhost/vdpa.c | 121 +++
 1 file changed, 73 insertions(+), 48 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 796fe97..abc4aa2 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -565,6 +565,8 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
  perm_to_iommu_flags(perm));
}
 
+   if (r)
+   vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
return r;
 }
 
@@ -592,21 +594,19 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
struct vhost_dev *dev = &v->vdev;
struct vhost_iotlb *iotlb = dev->iotlb;
struct page **page_list;
-   unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
+   struct vm_area_struct **vmas;
unsigned int gup_flags = FOLL_LONGTERM;
-   unsigned long npages, cur_base, map_pfn, last_pfn = 0;
-   unsigned long locked, lock_limit, pinned, i;
+   unsigned long map_pfn, last_pfn = 0;
+   unsigned long npages, lock_limit;
+   unsigned long i, nmap = 0;
u64 iova = msg->iova;
+   long pinned;
int ret = 0;
 
if (vhost_iotlb_itree_first(iotlb, msg->iova,
msg->iova + msg->size - 1))
return -EEXIST;
 
-   page_list = (struct page **) __get_free_page(GFP_KERNEL);
-   if (!page_list)
-   return -ENOMEM;
-
if (msg->perm & VHOST_ACCESS_WO)
gup_flags |= FOLL_WRITE;
 
@@ -614,61 +614,86 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
if (!npages)
return -EINVAL;
 
+   page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+   vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *),
+ GFP_KERNEL);
+   if (!page_list || !vmas) {
+   ret = -ENOMEM;
+   goto free;
+   }
+
mmap_read_lock(dev->mm);
 
-   locked = atomic64_add_return(npages, &dev->mm->pinned_vm);
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-   if (locked > lock_limit) {
+   if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
ret = -ENOMEM;
-   goto out;
+   goto unlock;
}
 
-   cur_base = msg->uaddr & PAGE_MASK;
-   iova &= PAGE_MASK;
+   pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags,
+   page_list, vmas);
+   if (npages != pinned) {
+   if (pinned < 0) {
+   ret = pinned;
+   } else {
+   unpin_user_pages(page_list, pinned);
+   ret = -ENOMEM;
+   }
+   goto unlock;
+   }
 
-   while (npages) {
-   pinned = min_t(unsigned long, npages, list_size);
-   ret = pin_user_pages(cur_base, pinned,
-gup_flags, page_list, NULL);
-   if (ret != pinned)
-   goto out;
-
-   if (!last_pfn)
-   map_pfn = page_to_pfn(page_list[0]);
-
-   for (i = 0; i < ret; i++) {
-   unsigned long this_pfn = page_to_pfn(page_list[i]);
-   u64 csize;
-
-   if (last_pfn && (this_pfn != last_pfn + 1)) {
-   /* Pin a contiguous chunk of memory */
-   csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
-   if (vhost_vdpa_map(v, iova, csize,
-  map_pfn << PAGE_SHIFT,
-  msg->perm))
-   goto out;
-   map_pfn = this_pfn;
-   iova += csize;
+   iova &= PAGE_MASK;
+   map_pfn = page_to_pfn(page_list[0]);
+
+   /* One more iteration to avoid extra vdpa_map() call out of loop. */
+   for (i = 0; i <= npa

Re: [PATCH] vdpa/mlx5: should keep avail_index despite device status

2020-10-02 Thread Si-Wei Liu
+ Eli.

On Thu, Oct 1, 2020 at 2:02 PM Si-Wei Liu  wrote:
>
> A VM with mlx5 vDPA has below warnings while being reset:
>
> vhost VQ 0 ring restore failed: -1: Resource temporarily unavailable (11)
> vhost VQ 1 ring restore failed: -1: Resource temporarily unavailable (11)
>
> We should allow userspace emulating the virtio device be
> able to get to vq's avail_index, regardless of vDPA device
> status. Save the index that was last seen when virtq was
> stopped, so that userspace doesn't complain.
>
> Signed-off-by: Si-Wei Liu 
> ---
>  drivers/vdpa/mlx5/net/mlx5_vnet.c | 20 ++--
>  1 file changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/vdpa/mlx5/net/mlx5_vnet.c 
> b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> index 70676a6..74264e59 100644
> --- a/drivers/vdpa/mlx5/net/mlx5_vnet.c
> +++ b/drivers/vdpa/mlx5/net/mlx5_vnet.c
> @@ -1133,15 +1133,17 @@ static void suspend_vq(struct mlx5_vdpa_net *ndev, 
> struct mlx5_vdpa_virtqueue *m
> if (!mvq->initialized)
> return;
>
> -   if (query_virtqueue(ndev, mvq, &attr)) {
> -   mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
> -   return;
> -   }
> if (mvq->fw_state != MLX5_VIRTIO_NET_Q_OBJECT_STATE_RDY)
> return;
>
> if (modify_virtqueue(ndev, mvq, 
> MLX5_VIRTIO_NET_Q_OBJECT_STATE_SUSPEND))
> mlx5_vdpa_warn(&ndev->mvdev, "modify to suspend failed\n");
> +
> +   if (query_virtqueue(ndev, mvq, &attr)) {
> +   mlx5_vdpa_warn(&ndev->mvdev, "failed to query virtqueue\n");
> +   return;
> +   }
> +   mvq->avail_idx = attr.available_index;
>  }
>
>  static void suspend_vqs(struct mlx5_vdpa_net *ndev)
> @@ -1411,8 +1413,14 @@ static int mlx5_vdpa_get_vq_state(struct vdpa_device 
> *vdev, u16 idx, struct vdpa
> struct mlx5_virtq_attr attr;
> int err;
>
> -   if (!mvq->initialized)
> -   return -EAGAIN;
> +   /* If the virtq object was destroyed, use the value saved at
> +* the last minute of suspend_vq. This caters for userspace
> +* that cares about emulating the index after vq is stopped.
> +*/
> +   if (!mvq->initialized) {
> +   state->avail_index = mvq->avail_idx;
> +   return 0;
> +   }
>
> err = query_virtqueue(ndev, mvq, &attr);
> if (err) {
> --
> 1.8.3.1
>


[PATCH v3 1/2] vhost-vdpa: fix vhost_vdpa_map() on error condition

2020-10-02 Thread Si-Wei Liu
vhost_vdpa_map() should remove the iotlb entry just added
if the corresponding mapping fails to set up properly.

Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
Signed-off-by: Si-Wei Liu 
---
 drivers/vhost/vdpa.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 796fe97..0f27919 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -565,6 +565,9 @@ static int vhost_vdpa_map(struct vhost_vdpa *v,
  perm_to_iommu_flags(perm));
}
 
+   if (r)
+   vhost_iotlb_del_range(dev->iotlb, iova, iova + size - 1);
+
return r;
 }
 
-- 
1.8.3.1



[PATCH v3 0/2] vhost-vdpa mapping error path fixes

2020-10-02 Thread Si-Wei Liu
Commit 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
has following issues in the failure path of IOTLB update:

1) vhost_vdpa_map() does not clean up dangling iotlb entry
   upon mapping failure

2) vhost_vdpa_process_iotlb_update() has leakage of pinned
   pages in case of vhost_vdpa_map() failure

This patchset attempts to address the above issues.

Changes in v3:
- Factor out changes in vhost_vdpa_map() and the fix for
  page pinning leak to separate patches (Jason)

---
Si-Wei Liu (2):
  vhost-vdpa: fix vhost_vdpa_map() on error condition
  vhost-vdpa: fix page pinning leakage in error path

 drivers/vhost/vdpa.c | 122 +++
 1 file changed, 74 insertions(+), 48 deletions(-)

-- 
1.8.3.1



[PATCH v3 2/2] vhost-vdpa: fix page pinning leakage in error path

2020-10-02 Thread Si-Wei Liu
Pinned pages are not properly accounted particularly when
mapping error occurs on IOTLB update. Clean up dangling
pinned pages for the error path. As the inflight pinned
pages, specifically for memory region that strides across
multiple chunks, would need more than one free page for
book keeping and accounting. For simplicity, pin pages
for all memory in the IOVA range in one go rather than
have multiple pin_user_pages calls to make up the entire
region. This way it's easier to track and account the
pages already mapped, particularly for clean-up in the
error path.

Fixes: 4c8cf31885f6 ("vhost: introduce vDPA-based backend")
Signed-off-by: Si-Wei Liu 
---
Changes in v3:
- Factor out vhost_vdpa_map() change to a separate patch

Changes in v2:
- Fix incorrect target SHA1 referenced

 drivers/vhost/vdpa.c | 119 ++-
 1 file changed, 71 insertions(+), 48 deletions(-)

diff --git a/drivers/vhost/vdpa.c b/drivers/vhost/vdpa.c
index 0f27919..dad41dae 100644
--- a/drivers/vhost/vdpa.c
+++ b/drivers/vhost/vdpa.c
@@ -595,21 +595,19 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
struct vhost_dev *dev = &v->vdev;
struct vhost_iotlb *iotlb = dev->iotlb;
struct page **page_list;
-   unsigned long list_size = PAGE_SIZE / sizeof(struct page *);
+   struct vm_area_struct **vmas;
unsigned int gup_flags = FOLL_LONGTERM;
-   unsigned long npages, cur_base, map_pfn, last_pfn = 0;
-   unsigned long locked, lock_limit, pinned, i;
+   unsigned long map_pfn, last_pfn = 0;
+   unsigned long npages, lock_limit;
+   unsigned long i, nmap = 0;
u64 iova = msg->iova;
+   long pinned;
int ret = 0;
 
if (vhost_iotlb_itree_first(iotlb, msg->iova,
msg->iova + msg->size - 1))
return -EEXIST;
 
-   page_list = (struct page **) __get_free_page(GFP_KERNEL);
-   if (!page_list)
-   return -ENOMEM;
-
if (msg->perm & VHOST_ACCESS_WO)
gup_flags |= FOLL_WRITE;
 
@@ -617,61 +615,86 @@ static int vhost_vdpa_process_iotlb_update(struct 
vhost_vdpa *v,
if (!npages)
return -EINVAL;
 
+   page_list = kvmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
+   vmas = kvmalloc_array(npages, sizeof(struct vm_area_struct *),
+ GFP_KERNEL);
+   if (!page_list || !vmas) {
+   ret = -ENOMEM;
+   goto free;
+   }
+
mmap_read_lock(dev->mm);
 
-   locked = atomic64_add_return(npages, &dev->mm->pinned_vm);
lock_limit = rlimit(RLIMIT_MEMLOCK) >> PAGE_SHIFT;
-
-   if (locked > lock_limit) {
+   if (npages + atomic64_read(&dev->mm->pinned_vm) > lock_limit) {
ret = -ENOMEM;
-   goto out;
+   goto unlock;
}
 
-   cur_base = msg->uaddr & PAGE_MASK;
-   iova &= PAGE_MASK;
+   pinned = pin_user_pages(msg->uaddr & PAGE_MASK, npages, gup_flags,
+   page_list, vmas);
+   if (npages != pinned) {
+   if (pinned < 0) {
+   ret = pinned;
+   } else {
+   unpin_user_pages(page_list, pinned);
+   ret = -ENOMEM;
+   }
+   goto unlock;
+   }
 
-   while (npages) {
-   pinned = min_t(unsigned long, npages, list_size);
-   ret = pin_user_pages(cur_base, pinned,
-gup_flags, page_list, NULL);
-   if (ret != pinned)
-   goto out;
-
-   if (!last_pfn)
-   map_pfn = page_to_pfn(page_list[0]);
-
-   for (i = 0; i < ret; i++) {
-   unsigned long this_pfn = page_to_pfn(page_list[i]);
-   u64 csize;
-
-   if (last_pfn && (this_pfn != last_pfn + 1)) {
-   /* Pin a contiguous chunk of memory */
-   csize = (last_pfn - map_pfn + 1) << PAGE_SHIFT;
-   if (vhost_vdpa_map(v, iova, csize,
-  map_pfn << PAGE_SHIFT,
-  msg->perm))
-   goto out;
-   map_pfn = this_pfn;
-   iova += csize;
+   iova &= PAGE_MASK;
+   map_pfn = page_to_pfn(page_list[0]);
+
+   /* One more iteration to avoid extra vdpa_map() call out of loop. */
+   for (i = 0; i <= npages; i++) {
+   unsigned long this_pfn;
+   u64 csize;
+
+   /* The last chunk may have no valid PFN next t