linux-source-2.6.16: inconsistent device detetion: eth0<=>eth1
Subject: linux-source-2.6.16: inconsistent device detetion: eth0<=>eth1 Package: linux-source-2.6.16 Version: 2.6.16-2 Severity: normal I have two ethernet cards on my box, an onboard SiS 7012 and a pci ne-2k compatible. When I boot the system, sometimes the SiS is detected as eth0 and the pci as eth1, and sometimes the SiS is eth1 and the pci eth0, *without my changing anything*. I just reboot and it's different. This is a big pain in the butt, because I have dhcp on the SiS connection and a separate internal net with a static address on the pci card, and config in /etc/network/interfaces is based upon specifying eth?. Attached are two dmesg outputs with the corresponding detections. Run a diff on them. Also attached is the output of lspci -vvv. [EMAIL PROTECTED]:/proc$ cat /proc/version Linux version 2.6.16 (Version: spleen.060402.2) ([EMAIL PROTECTED]) (gcc version 4.0.3 20051201 (prerelease) (Debian 4.0.2-5)) #1 Sun Apr 2 21:13:53 CEST 2006 cheers, -j -- System Information: Debian Release: 3.1 APT prefers testing APT policy: (990, 'testing') Architecture: i386 (i686) Shell: /bin/sh linked to /bin/bash Kernel: Linux 2.6.16 Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) Versions of packages linux-source-2.6.16 depends on: ii binutils 2.16.1cvs20060117-1 The GNU assembler, linker and bina ii bzip21.0.2-7 high-quality block-sorting file co Versions of packages linux-source-2.6.16 recommends: ii gcc 4:4.0.2-1 The GNU C compiler ii libc6-dev [libc-dev] 2.3.6-3 GNU C Library: Development Librari ii make 3.80+3.81.rc2-1 The GNU version of the "make" util -- no debconf information Linux version 2.6.16 (Version: spleen.060402.2) ([EMAIL PROTECTED]) (gcc version 4.0.3 20051201 (prerelease) (Debian 4.0.2-5)) #1 Sun Apr 2 21:13:53 CEST 2006 BIOS-provided physical RAM map: BIOS-e820: - 0009fc00 (usable) BIOS-e820: 0009fc00 - 000a (reserved) BIOS-e820: 000f - 0010 (reserved) BIOS-e820: 0010 - 2fff (usable) BIOS-e820: 2fff - 2fff8000 (ACPI data) BIOS-e820: 2fff8000 - 3000 (ACPI NVS) BIOS-e820: fec0 - fec01000 (reserved) BIOS-e820: fee0 - fee01000 (reserved) BIOS-e820: ffee - fff0 (reserved) BIOS-e820: fffc - 0001 (reserved) 767MB LOWMEM available. On node 0 totalpages: 196592 DMA zone: 4096 pages, LIFO batch:0 DMA32 zone: 0 pages, LIFO batch:0 Normal zone: 192496 pages, LIFO batch:31 HighMem zone: 0 pages, LIFO batch:0 DMI 2.3 present. ACPI: RSDP (v000 AMI ) @ 0x000fa340 ACPI: RSDT (v001 AMIINT SiS735XX 0x1000 MSFT 0x010b) @ 0x2fff ACPI: FADT (v001 AMIINT SiS735XX 0x1000 MSFT 0x010b) @ 0x2fff0030 ACPI: DSDT (v001SiS 735 0x0100 MSFT 0x010d) @ 0x ACPI: PM-Timer IO Port: 0x808 Allocating PCI resources starting at 4000 (gap: 3000:cec0) Built 1 zonelists Kernel command line: root=/dev/hda6 ro vga=0x030c pci=routeirq Enabling fast FPU save and restore... done. Enabling unmasked SIMD FPU exception support... done. Initializing CPU#0 PID hash table entries: 4096 (order: 12, 65536 bytes) Detected 1526.957 MHz processor. Using pmtmr for high-res timesource Console: colour VGA+ 132x60 Dentry cache hash table entries: 131072 (order: 7, 524288 bytes) Inode-cache hash table entries: 65536 (order: 6, 262144 bytes) Memory: 776572k/786368k available (1590k kernel code, 9312k reserved, 449k data, 124k init, 0k highmem) Checking if this processor honours the WP bit even in supervisor mode... Ok. Calibrating delay using timer specific routine.. 3056.64 BogoMIPS (lpj=6113294) Mount-cache hash table entries: 512 CPU: After generic identify, caps: 0383f9ff c1c3f9ff CPU: After vendor identify, caps: 0383f9ff c1c3f9ff CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line) CPU: L2 Cache: 256K (64 bytes/line) CPU: After all inits, caps: 0383f9ff c1c3f9ff 0020 Intel machine check architecture supported. Intel machine check reporting enabled on CPU#0. CPU: AMD Athlon(tm) XP 1800+ stepping 02 Checking 'hlt' instruction... OK. ACPI: setting ELCR to 0200 (from 1c20) NET: Registered protocol family 16 ACPI: bus type pci registered PCI: PCI BIOS revision 2.10 entry at 0xfdb01, last bus=1 PCI: Using configuration type 1 ACPI: Subsystem revision 20060127 ACPI: Interpreter enabled ACPI: Using PIC for interrupt routing ACPI: PCI Root Bridge [PCI0] (:00) PCI: Probing PCI hardware (bus 00) Uncovering SIS18 that hid as a SIS503 (compatible=1) Boot video device is :01:00.0 ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT] ACPI: Power Resour
Re: [PATCH 3/3] printk: implement support for extended console drivers
On Mon, Jun 29, 2015 at 11:49:14AM -0400, Tejun Heo wrote: > On Mon, Jun 29, 2015 at 05:47:49PM +0200, Geert Uytterhoeven wrote: > > > netconsole itself is optional & modular. I'm not sure making further > > > splits is called for, especially given the use cases. > > > > It could be a hidden option, selected by its users (e.g. netconsole). > > Hmmm... what do you mean? config PRINTK_BITS_THAT_NETCONSOLE_NEEDS (no help text or prompt) config NETCONSOLE select PRINTK_BITS_THAT_NETCONSOLE_NEEDS That would ensure that the bits added to printk don't get compiled in unless CONFIG_NETCONSOLE=y. - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] printk: implement support for extended console drivers
On Mon, Jun 29, 2015 at 12:13:55PM -0400, Tejun Heo wrote: > On Mon, Jun 29, 2015 at 06:11:40PM +0200, Geert Uytterhoeven wrote: > > On Mon, Jun 29, 2015 at 5:49 PM, Tejun Heo wrote: > > > On Mon, Jun 29, 2015 at 05:47:49PM +0200, Geert Uytterhoeven wrote: > > >> > netconsole itself is optional & modular. I'm not sure making further > > >> > splits is called for, especially given the use cases. > > >> > > >> It could be a hidden option, selected by its users (e.g. netconsole). > > > > > > Hmmm... what do you mean? > > > > init/Kconfig: > > > > config PRINTK_EXT_LOG > > bool > > > > drivers/net/Kconfig: > > > > config NETCONSOLE > > tristate "Network console logging support" > > select PRINTK_EXT_LOG > > > > kernel/printk/printk.c: > > > > void console_unlock(void) > > { > > #ifdef CONFIG_PRINTK_EXT_LOG > > static char ext_text[CONSOLE_EXT_LOG_MAX]; > > #endif > > OIC, hmmm... yeah, I think doing it on-demand would be better but will > try to find out which way is better. Allocating the buffer dynamically is fine, but in that case the code to do so should ideally be compiled out. Since printk is used by almost *all* kernels, while netconsole is not, it's more critical to avoid allowing printk's size to balloon. - Josh Triplett -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] tcp: add new tcp_mtu_probe_floor sysctl
The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives administrators the ability to control the floor of MSS probing. Signed-off-by: Josh Hunt --- Documentation/networking/ip-sysctl.txt | 6 ++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp_ipv4.c| 1 + net/ipv4/tcp_timer.c | 2 +- 5 files changed, 18 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index df33674799b5..49e95f438ed7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection. +tcp_mtu_probe_floor - INTEGER + If MTU probing is enabled this caps the minimum MSS used for search_low + for the connection. + + Default : 48 + tcp_min_snd_mss - INTEGER TCP SYN and SYNACK messages usually advertise an ADVMSS option, as described in RFC 1122 and RFC 6691. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index bc24a8ec1ce5..c0c0791b1912 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -116,6 +116,7 @@ struct netns_ipv4 { int sysctl_tcp_l3mdev_accept; #endif int sysctl_tcp_mtu_probing; + int sysctl_tcp_mtu_probe_floor; int sysctl_tcp_base_mss; int sysctl_tcp_min_snd_mss; int sysctl_tcp_probe_threshold; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 0b980e841927..59ded25acd04 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = { .extra2 = &tcp_min_snd_mss_max, }, { + .procname = "tcp_mtu_probe_floor", + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_floor, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &tcp_min_snd_mss_min, + .extra2 = &tcp_min_snd_mss_max, + }, + { .procname = "tcp_probe_threshold", .data = &init_net.ipv4.sysctl_tcp_probe_threshold, .maxlen = sizeof(int), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d57641cb3477..e0a372676329 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD; net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL; + net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME; net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES; diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index c801cd37cc2a..dbd9d2d0ee63 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk) } else { mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1; mss = min(net->ipv4.sysctl_tcp_base_mss, mss); - mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len); + mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor); mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss); icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss); } -- 2.7.4
Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
On 7/27/19 12:05 AM, Eric Dumazet wrote: On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt wrote: The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. And what reasonable value have you used ??? Reasonable for some may not be reasonable for others hence the new sysctl :) We're currently running with a fairly high value based off of the v6 min MTU minus headers and options, etc. We went conservative with our setting initially as it seemed a reasonable first step when re-enabling TCP MTU probing since with no configurable floor we saw a # of cases where connections were using severely reduced mss b/c of loss and not b/c of actual path restriction. I plan to reevaluate the setting at some point, but since the probing method is still the same it means the same clients who got stuck with mss of 48 before will land at whatever floor we set. Looking forward we are interested in trying to improve TCP MTU probing so it does not penalize clients like this. A suggestion for a more reasonable floor default would be 512, which is the same as the min_pmtu. Given both mechanisms are trying to achieve the same goal it seems like they should have a similar min/floor. The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives administrators the ability to control the floor of MSS probing. Signed-off-by: Josh Hunt --- Documentation/networking/ip-sysctl.txt | 6 ++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp_ipv4.c| 1 + net/ipv4/tcp_timer.c | 2 +- 5 files changed, 18 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index df33674799b5..49e95f438ed7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection. +tcp_mtu_probe_floor - INTEGER + If MTU probing is enabled this caps the minimum MSS used for search_low + for the connection. + + Default : 48 + tcp_min_snd_mss - INTEGER TCP SYN and SYNACK messages usually advertise an ADVMSS option, as described in RFC 1122 and RFC 6691. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index bc24a8ec1ce5..c0c0791b1912 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -116,6 +116,7 @@ struct netns_ipv4 { int sysctl_tcp_l3mdev_accept; #endif int sysctl_tcp_mtu_probing; + int sysctl_tcp_mtu_probe_floor; int sysctl_tcp_base_mss; int sysctl_tcp_min_snd_mss; int sysctl_tcp_probe_threshold; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 0b980e841927..59ded25acd04 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = { .extra2 = &tcp_min_snd_mss_max, }, { + .procname = "tcp_mtu_probe_floor", + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_floor, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &tcp_min_snd_mss_min, + .extra2 = &tcp_min_snd_mss_max, + }, + { .procname = "tcp_probe_threshold", .data = &init_net.ipv4.sysctl_tcp_probe_threshold, .maxlen = sizeof(int), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d57641cb3477..e0a372676329 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD; net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL; + net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME; net->ipv4.sysctl_tcp_keepalive_probes =
Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
On 7/28/19 6:54 AM, Eric Dumazet wrote: On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt wrote: On 7/27/19 12:05 AM, Eric Dumazet wrote: On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt wrote: The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. And what reasonable value have you used ??? Reasonable for some may not be reasonable for others hence the new sysctl :) We're currently running with a fairly high value based off of the v6 min MTU minus headers and options, etc. We went conservative with our setting initially as it seemed a reasonable first step when re-enabling TCP MTU probing since with no configurable floor we saw a # of cases where connections were using severely reduced mss b/c of loss and not b/c of actual path restriction. I plan to reevaluate the setting at some point, but since the probing method is still the same it means the same clients who got stuck with mss of 48 before will land at whatever floor we set. Looking forward we are interested in trying to improve TCP MTU probing so it does not penalize clients like this. A suggestion for a more reasonable floor default would be 512, which is the same as the min_pmtu. Given both mechanisms are trying to achieve the same goal it seems like they should have a similar min/floor. The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives administrators the ability to control the floor of MSS probing. Signed-off-by: Josh Hunt --- Documentation/networking/ip-sysctl.txt | 6 ++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp_ipv4.c| 1 + net/ipv4/tcp_timer.c | 2 +- 5 files changed, 18 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index df33674799b5..49e95f438ed7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection. +tcp_mtu_probe_floor - INTEGER + If MTU probing is enabled this caps the minimum MSS used for search_low + for the connection. + + Default : 48 + tcp_min_snd_mss - INTEGER TCP SYN and SYNACK messages usually advertise an ADVMSS option, as described in RFC 1122 and RFC 6691. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index bc24a8ec1ce5..c0c0791b1912 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -116,6 +116,7 @@ struct netns_ipv4 { int sysctl_tcp_l3mdev_accept; #endif int sysctl_tcp_mtu_probing; + int sysctl_tcp_mtu_probe_floor; int sysctl_tcp_base_mss; int sysctl_tcp_min_snd_mss; int sysctl_tcp_probe_threshold; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 0b980e841927..59ded25acd04 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = { .extra2 = &tcp_min_snd_mss_max, }, { + .procname = "tcp_mtu_probe_floor", + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_floor, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &tcp_min_snd_mss_min, + .extra2 = &tcp_min_snd_mss_max, + }, + { .procname = "tcp_probe_threshold", .data = &init_net.ipv4.sysctl_tcp_probe_threshold, .maxlen = sizeof(int), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d57641cb3477..e0a372676329 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD; net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL; + net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS; net-&
Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
On 7/28/19 2:14 PM, Josh Hunt wrote: On 7/28/19 6:54 AM, Eric Dumazet wrote: On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt wrote: On 7/27/19 12:05 AM, Eric Dumazet wrote: On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt wrote: The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. And what reasonable value have you used ??? Reasonable for some may not be reasonable for others hence the new sysctl :) We're currently running with a fairly high value based off of the v6 min MTU minus headers and options, etc. We went conservative with our setting initially as it seemed a reasonable first step when re-enabling TCP MTU probing since with no configurable floor we saw a # of cases where connections were using severely reduced mss b/c of loss and not b/c of actual path restriction. I plan to reevaluate the setting at some point, but since the probing method is still the same it means the same clients who got stuck with mss of 48 before will land at whatever floor we set. Looking forward we are interested in trying to improve TCP MTU probing so it does not penalize clients like this. A suggestion for a more reasonable floor default would be 512, which is the same as the min_pmtu. Given both mechanisms are trying to achieve the same goal it seems like they should have a similar min/floor. The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives administrators the ability to control the floor of MSS probing. Signed-off-by: Josh Hunt --- Documentation/networking/ip-sysctl.txt | 6 ++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp_ipv4.c | 1 + net/ipv4/tcp_timer.c | 2 +- 5 files changed, 18 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index df33674799b5..49e95f438ed7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection. +tcp_mtu_probe_floor - INTEGER + If MTU probing is enabled this caps the minimum MSS used for search_low + for the connection. + + Default : 48 + tcp_min_snd_mss - INTEGER TCP SYN and SYNACK messages usually advertise an ADVMSS option, as described in RFC 1122 and RFC 6691. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index bc24a8ec1ce5..c0c0791b1912 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -116,6 +116,7 @@ struct netns_ipv4 { int sysctl_tcp_l3mdev_accept; #endif int sysctl_tcp_mtu_probing; + int sysctl_tcp_mtu_probe_floor; int sysctl_tcp_base_mss; int sysctl_tcp_min_snd_mss; int sysctl_tcp_probe_threshold; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 0b980e841927..59ded25acd04 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = { .extra2 = &tcp_min_snd_mss_max, }, { + .procname = "tcp_mtu_probe_floor", + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_floor, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &tcp_min_snd_mss_min, + .extra2 = &tcp_min_snd_mss_max, + }, + { .procname = "tcp_probe_threshold", .data = &init_net.ipv4.sysctl_tcp_probe_threshold, .maxlen = sizeof(int), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d57641cb3477..e0a372676329 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD; net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL; + net->
Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl
On 7/29/19 6:12 AM, Neal Cardwell wrote: On Sun, Jul 28, 2019 at 5:14 PM Josh Hunt wrote: On 7/28/19 6:54 AM, Eric Dumazet wrote: On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt wrote: On 7/27/19 12:05 AM, Eric Dumazet wrote: On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt wrote: The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. And what reasonable value have you used ??? Reasonable for some may not be reasonable for others hence the new sysctl :) We're currently running with a fairly high value based off of the v6 min MTU minus headers and options, etc. We went conservative with our setting initially as it seemed a reasonable first step when re-enabling TCP MTU probing since with no configurable floor we saw a # of cases where connections were using severely reduced mss b/c of loss and not b/c of actual path restriction. I plan to reevaluate the setting at some point, but since the probing method is still the same it means the same clients who got stuck with mss of 48 before will land at whatever floor we set. Looking forward we are interested in trying to improve TCP MTU probing so it does not penalize clients like this. A suggestion for a more reasonable floor default would be 512, which is the same as the min_pmtu. Given both mechanisms are trying to achieve the same goal it seems like they should have a similar min/floor. The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives administrators the ability to control the floor of MSS probing. Signed-off-by: Josh Hunt --- Documentation/networking/ip-sysctl.txt | 6 ++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp_ipv4.c| 1 + net/ipv4/tcp_timer.c | 2 +- 5 files changed, 18 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index df33674799b5..49e95f438ed7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection. +tcp_mtu_probe_floor - INTEGER + If MTU probing is enabled this caps the minimum MSS used for search_low + for the connection. + + Default : 48 + tcp_min_snd_mss - INTEGER TCP SYN and SYNACK messages usually advertise an ADVMSS option, as described in RFC 1122 and RFC 6691. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index bc24a8ec1ce5..c0c0791b1912 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -116,6 +116,7 @@ struct netns_ipv4 { int sysctl_tcp_l3mdev_accept; #endif int sysctl_tcp_mtu_probing; + int sysctl_tcp_mtu_probe_floor; int sysctl_tcp_base_mss; int sysctl_tcp_min_snd_mss; int sysctl_tcp_probe_threshold; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 0b980e841927..59ded25acd04 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = { .extra2 = &tcp_min_snd_mss_max, }, { + .procname = "tcp_mtu_probe_floor", + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_floor, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &tcp_min_snd_mss_min, + .extra2 = &tcp_min_snd_mss_max, + }, + { .procname = "tcp_probe_threshold", .data = &init_net.ipv4.sysctl_tcp_probe_threshold, .maxlen = sizeof(int), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d57641cb3477..e0a372676329 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD; net->i
[PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl
The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives administrators the ability to control the floor of MSS probing. Signed-off-by: Josh Hunt --- Documentation/networking/ip-sysctl.txt | 6 ++ include/net/netns/ipv4.h | 1 + net/ipv4/sysctl_net_ipv4.c | 9 + net/ipv4/tcp_ipv4.c| 1 + net/ipv4/tcp_timer.c | 2 +- 5 files changed, 18 insertions(+), 1 deletion(-) diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index df33674799b5..49e95f438ed7 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -256,6 +256,12 @@ tcp_base_mss - INTEGER Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection. +tcp_mtu_probe_floor - INTEGER + If MTU probing is enabled this caps the minimum MSS used for search_low + for the connection. + + Default : 48 + tcp_min_snd_mss - INTEGER TCP SYN and SYNACK messages usually advertise an ADVMSS option, as described in RFC 1122 and RFC 6691. diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h index bc24a8ec1ce5..c0c0791b1912 100644 --- a/include/net/netns/ipv4.h +++ b/include/net/netns/ipv4.h @@ -116,6 +116,7 @@ struct netns_ipv4 { int sysctl_tcp_l3mdev_accept; #endif int sysctl_tcp_mtu_probing; + int sysctl_tcp_mtu_probe_floor; int sysctl_tcp_base_mss; int sysctl_tcp_min_snd_mss; int sysctl_tcp_probe_threshold; diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c index 0b980e841927..59ded25acd04 100644 --- a/net/ipv4/sysctl_net_ipv4.c +++ b/net/ipv4/sysctl_net_ipv4.c @@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = { .extra2 = &tcp_min_snd_mss_max, }, { + .procname = "tcp_mtu_probe_floor", + .data = &init_net.ipv4.sysctl_tcp_mtu_probe_floor, + .maxlen = sizeof(int), + .mode = 0644, + .proc_handler = proc_dointvec_minmax, + .extra1 = &tcp_min_snd_mss_min, + .extra2 = &tcp_min_snd_mss_max, + }, + { .procname = "tcp_probe_threshold", .data = &init_net.ipv4.sysctl_tcp_probe_threshold, .maxlen = sizeof(int), diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c index d57641cb3477..e0a372676329 100644 --- a/net/ipv4/tcp_ipv4.c +++ b/net/ipv4/tcp_ipv4.c @@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net) net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD; net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL; + net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS; net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME; net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES; diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c index c801cd37cc2a..dbd9d2d0ee63 100644 --- a/net/ipv4/tcp_timer.c +++ b/net/ipv4/tcp_timer.c @@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk) } else { mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1; mss = min(net->ipv4.sysctl_tcp_base_mss, mss); - mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len); + mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor); mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss); icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss); } -- 2.7.4
[PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment
TCP_BASE_MSS is used as the default initial MSS value when MTU probing is enabled. Update the comment to reflect this. Suggested-by: Neal Cardwell Signed-off-by: Josh Hunt --- include/net/tcp.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/tcp.h b/include/net/tcp.h index 81e8ade1e6e4..9e9fbfaf052b 100644 --- a/include/net/tcp.h +++ b/include/net/tcp.h @@ -64,7 +64,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo); /* Minimal accepted MSS. It is (60+60+8) - (20+20). */ #define TCP_MIN_MSS88U -/* The least MTU to use for probing */ +/* The initial MTU to use for probing */ #define TCP_BASE_MSS 1024 /* probing interval, default to 10 minutes as per RFC4821 */ -- 2.7.4
Re: [PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl
On 8/7/19 11:12 PM, Eric Dumazet wrote: On 8/8/19 1:52 AM, Josh Hunt wrote: The current implementation of TCP MTU probing can considerably underestimate the MTU on lossy connections allowing the MSS to get down to 48. We have found that in almost all of these cases on our networks these paths can handle much larger MTUs meaning the connections are being artificially limited. Even though TCP MTU probing can raise the MSS back up we have seen this not to be the case causing connections to be "stuck" with an MSS of 48 when heavy loss is present. Prior to pushing out this change we could not keep TCP MTU probing enabled b/c of the above reasons. Now with a reasonble floor set we've had it enabled for the past 6 months. I am still sad to see you do not share what is a reasonable value and let everybody guess. It seems to be a top-secret value. Haha, no sorry I didn't mean for it to come across like that. We are currently setting tcp_base_mss to 1348 and tcp_mtu_probe_floor to 1208. I thought I mentioned it in our earlier mails, but I guess I did not put the exact #s. 1348 was derived after analyzing common MTU we see across our networks and noticing that an MTU of around 1400 would cover a very large % (sorry I don't have the #s handy) of those paths. 1400 - 20 - 20 - 12 = 1348. For the floor we based it off the v6 min MTU of 1280 and subtracted out headers, etc, so 1280 - 40 - 20 - 12 = 1208. Using a floor of 1280 MTU matches what the RFC suggests in section 7.7 for v6 connections, we're just applying that to v4 right now as well. I guess that brings up a good point, would per-IP proto floor sysctls be an option here for upstream (the RFC suggests different floors for v4 and v6)? For now I'm not sure it makes sense b/c of the problems we see with lossy connections, but in the future if that can be resolved it seems like it would give some more flexibility. I'd like to investigate this all further at some point to see if we can make it work better for lossy connections. It looks like one of the problems is there are a # of conditions which cause us to not probe in the upward direction. I'm not sure if any of those can be relaxed/changed and if so would help these connections out. Thanks Josh
Re: [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment
On 8/7/19 11:13 PM, Eric Dumazet wrote: On 8/8/19 1:52 AM, Josh Hunt wrote: TCP_BASE_MSS is used as the default initial MSS value when MTU probing is enabled. Update the comment to reflect this. Suggested-by: Neal Cardwell Signed-off-by: Josh Hunt --- Signed-off-by: Eric Dumazet Dave I forgot to tag these at net-next. Do I need to resubmit a v3 with net-next in the subject? Thanks Josh
Re: [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment
On 8/9/19 11:43 AM, David Miller wrote: From: Josh Hunt Date: Fri, 9 Aug 2019 11:38:05 -0700 I forgot to tag these at net-next. Do I need to resubmit a v3 with net-next in the subject? No need. Thanks
Re: [PATCH net-next] gso: enable udp gso for virtual devices
On 6/26/19 4:41 PM, Willem de Bruijn wrote: On Wed, Jun 26, 2019 at 3:17 PM Jason Baron wrote: On 6/14/19 4:53 PM, Jason Baron wrote: On 6/13/19 5:20 PM, Willem de Bruijn wrote: @@ -237,6 +237,7 @@ static inline int find_next_netdev_feature(u64 feature, unsigned long start) NETIF_F_GSO_GRE_CSUM | \ NETIF_F_GSO_IPXIP4 | \ NETIF_F_GSO_IPXIP6 | \ +NETIF_F_GSO_UDP_L4 | \ NETIF_F_GSO_UDP_TUNNEL | \ NETIF_F_GSO_UDP_TUNNEL_CSUM) Are you adding this to NETIF_F_GSO_ENCAP_ALL? Wouldn't it make more sense to add it to NETIF_F_GSO_SOFTWARE? Yes, I'm adding to NETIF_F_GSO_ENCAP_ALL (not very clear from the context). I will fix the commit log. In: 83aa025 udp: add gso support to virtual devices, the support was also added to NETIF_F_GSO_ENCAP_ALL (although subsequently reverted due to UDP GRO not being in place), so I wonder what the reason was for that? That was probably just a bad choice on my part. It worked in practice, but if NETIF_F_GSO_SOFTWARE works the same without unexpected side effects, then I agree that it is the better choice. That choice does appear to change behavior when sending over tunnel devices. Might it send tunneled GSO packets over loopback? I set up a test case using fou tunneling through a bridge device using the udpgso_bench_tx test where packets are not received correctly if NETIF_F_GSO_UDP_L4 is added to NETIF_F_GSO_SOFTWARE. If I have it added to NETIF_F_GSO_ENCAP_ALL, it does work correctly. So there are more fixes required to include it in NETIF_F_GSO_SOFTWARE. The use-case I have only requires it to be in NETIF_F_GSO_ENCAP_ALL, but if it needs to go in NETIF_F_GSO_SOFTWARE, I can look at what's required more next week. Hi, I haven't had a chance to investigate what goes wrong with including NETIF_F_GSO_UDP_L4 in NETIF_F_GSO_SOFTWARE - but I was just wondering if people are ok with NETIF_F_GSO_UDP_L4 being added to NETIF_F_GSO_ENCAP_ALL and not NETIF_F_GSO_SOFTWARE (ie the original patch as posted)? As I mentioned that is sufficient for my use-case, and its how Willem originally proposed this. Indeed, based on the previous discussion this sounds fine to me. Willem Are you OK to ACK this? If not, is there something else you'd rather see here? Thanks Josh
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
On 4/2/21 12:25 PM, Jiri Kosina wrote: On Thu, 3 Sep 2020, John Fastabend wrote: At this point I fear we could consider reverting the NOLOCK stuff. I personally would hate doing so, but it looks like NOLOCK benefits are outweighed by its issues. I agree, NOLOCK brings more pains than gains. There are many race conditions hidden in generic qdisc layer, another one is enqueue vs. reset which is being discussed in another thread. Sure. Seems they crept in over time. I had some plans to write a lockless HTB implementation. But with fq+EDT with BPF it seems that it is no longer needed, we have a more generic/better solution. So I dropped it. Also most folks should really be using fq, fq_codel, etc. by default anyways. Using pfifo_fast alone is not ideal IMO. Half a year later, we still have the NOLOCK implementation present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself. And we've just been bitten by this very same race which appears to be still unfixed, with single packet being stuck in pfifo_fast qdisc basically indefinitely due to this very race that this whole thread began with back in 2019. Unless there are (a) any nice ideas how to solve this in an elegant way without (re-)introducing extra spinlock (Cong's fix) or (b) any objections to revert as per the argumentation above I'll be happy to send a revert of the whole NOLOCK implementation next week. Jiri If you have a reproducer can you try https://lkml.org/lkml/2021/3/24/1485 ? If that doesn't work I think your suggestion of reverting nolock makes sense to me. We've moved to using fq as our default now b/c of this bug. Josh
Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650
On Wed, Feb 03, 2021 at 09:46:55AM -0800, Ivan Babrou wrote: > > Can you pretty please not line-wrap console output? It's unreadable. > > GMail doesn't make it easy, I'll send a link to a pastebin next time. > Let me know if you'd like me to regenerate the decoded stack. > > > > edfd9b7838ba5e47f19ad8466d0565aba5c59bf0 is the first bad commit > > > commit edfd9b7838ba5e47f19ad8466d0565aba5c59bf0 > > > > Not sure what tree you're on, but that's not the upstream commit. > > I mentioned that it's a rebased core-static_call-2020-10-12 tag and > added a link to the upstream hash right below. > > > > Author: Steven Rostedt (VMware) > > > Date: Tue Aug 18 15:57:52 2020 +0200 > > > > > > tracepoint: Optimize using static_call() > > > > > > > There's a known issue with that patch, can you try: > > > > http://lkml.kernel.org/r/20210202220121.435051...@goodmis.org > > I've tried it on top of core-static_call-2020-10-12 tag rebased on top > of v5.9 (to make it reproducible), and the patch did not help. Do I > need to apply the whole series or something else? Can you recreate with this patch, and add "unwind_debug" to the cmdline? It will spit out a bunch of stack data. From: Josh Poimboeuf Subject: [PATCH] Subject: [PATCH] x86/unwind: Add 'unwind_debug' cmdline option Sometimes the one-line ORC unwinder warnings aren't very helpful. Take the existing frame pointer unwind_dump() and make it useful for all unwinders. I don't want to be too aggressive about enabling the dumps, so for now they're only enabled with the use of a new 'unwind_debug' cmdline option. When enabled, it will dump the full contents of the stack when an error condition is encountered, or when dump_stack() is called. Signed-off-by: Josh Poimboeuf --- .../admin-guide/kernel-parameters.txt | 6 +++ arch/x86/include/asm/unwind.h | 3 ++ arch/x86/kernel/dumpstack.c | 39 ++ arch/x86/kernel/unwind_frame.c| 51 +++ arch/x86/kernel/unwind_orc.c | 5 +- 5 files changed, 58 insertions(+), 46 deletions(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 3d6604a949f8..d29689aa62a2 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5521,6 +5521,12 @@ unknown_nmi_panic [X86] Cause panic on unknown NMI. + unwind_debug[X86-64] + Enable unwinder debug output. This can be + useful for debugging certain unwinder error + conditions, including corrupt stacks and + bad/missing unwinder metadata. + usbcore.authorized_default= [USB] Default USB device authorization: (default -1 = authorized except for wireless USB, diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h index 70fc159ebe69..5101d7ef7912 100644 --- a/arch/x86/include/asm/unwind.h +++ b/arch/x86/include/asm/unwind.h @@ -123,4 +123,7 @@ static inline bool task_on_another_cpu(struct task_struct *task) #endif } +extern bool unwind_debug __ro_after_init; +void unwind_dump(struct unwind_state *state); + #endif /* _ASM_X86_UNWIND_H */ diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c index 299c20f0a38b..febfd5b7f62a 100644 --- a/arch/x86/kernel/dumpstack.c +++ b/arch/x86/kernel/dumpstack.c @@ -29,6 +29,42 @@ static int die_counter; static struct pt_regs exec_summary_regs; +bool unwind_debug __ro_after_init; +static int __init unwind_debug_cmdline(char *str) +{ + unwind_debug = true; + return 0; +} +early_param("unwind_debug", unwind_debug_cmdline); + +void unwind_dump(struct unwind_state *state) +{ + unsigned long word, *sp; + struct stack_info stack_info = {0}; + unsigned long visit_mask = 0; + + printk_deferred("unwinder dump: stack type:%d next_sp:%p mask:0x%lx graph_idx:%d\n", + state->stack_info.type, state->stack_info.next_sp, + state->stack_mask, state->graph_idx); + + sp = state->task == current ? __builtin_frame_address(0) + : (void *)state->task->thread.sp; + + for (; sp; sp = PTR_ALIGN(stack_info.next_sp, sizeof(long))) { + if (get_stack_info(sp, state->task, &stack_info, &visit_mask)) + break; + + for (; sp < stack_info.end; sp++) { + + word = READ_ONCE_NOCHECK(*sp); + + printk_deferred
Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650
On Wed, Feb 03, 2021 at 02:41:53PM -0800, Ivan Babrou wrote: > On Wed, Feb 3, 2021 at 11:05 AM Josh Poimboeuf wrote: > > > > On Wed, Feb 03, 2021 at 09:46:55AM -0800, Ivan Babrou wrote: > > > > Can you pretty please not line-wrap console output? It's unreadable. > > > > > > GMail doesn't make it easy, I'll send a link to a pastebin next time. > > > Let me know if you'd like me to regenerate the decoded stack. > > > > > > > > edfd9b7838ba5e47f19ad8466d0565aba5c59bf0 is the first bad commit > > > > > commit edfd9b7838ba5e47f19ad8466d0565aba5c59bf0 > > > > > > > > Not sure what tree you're on, but that's not the upstream commit. > > > > > > I mentioned that it's a rebased core-static_call-2020-10-12 tag and > > > added a link to the upstream hash right below. > > > > > > > > Author: Steven Rostedt (VMware) > > > > > Date: Tue Aug 18 15:57:52 2020 +0200 > > > > > > > > > > tracepoint: Optimize using static_call() > > > > > > > > > > > > > There's a known issue with that patch, can you try: > > > > > > > > http://lkml.kernel.org/r/20210202220121.435051...@goodmis.org > > > > > > I've tried it on top of core-static_call-2020-10-12 tag rebased on top > > > of v5.9 (to make it reproducible), and the patch did not help. Do I > > > need to apply the whole series or something else? > > > > Can you recreate with this patch, and add "unwind_debug" to the cmdline? > > It will spit out a bunch of stack data. > > Here's the three I'm building: > > * https://github.com/bobrik/linux/tree/ivan/static-call-5.9 > > It contains: > > * v5.9 tag as the base > * static_call-2020-10-12 tag > * dm-crypt patches to reproduce the issue with KASAN > * x86/unwind: Add 'unwind_debug' cmdline option > * tracepoint: Fix race between tracing and removing tracepoint > > The very same issue can be reproduced on 5.10.11 with no patches, > but I'm going with 5.9, since it boils down to static call changes. > > Here's the decoded stack from the kernel with unwind debug enabled: > > * https://gist.github.com/bobrik/ed052ac0ae44c880f3170299ad4af56b > > See my first email for the exact commands that trigger this. Thanks. Do you happen to have the original dmesg, before running it through the post-processing script? I assume you're using decode_stacktrace.sh? It could use some improvement, it's stripping the function offset. Also spaces are getting inserted in odd places, messing the alignment. [ 137.291837][C0] 88809c409858: d7c4f3ce817a1700 (0xd7c4f3ce817a1700) [ 137.291837][C0] 88809c409860: (0x0) [ 137.291839][ C0] 88809c409868: (0x) [ 137.291841][ C0] 88809c409870: a4f01a52 unwind_next_frame (arch/x86/kernel/unwind_orc.c:380 arch/x86/kernel/unwind_orc.c:553) [ 137.291843][ C0] 88809c409878: a4f01a52 unwind_next_frame (arch/x86/kernel/unwind_orc.c:380 arch/x86/kernel/unwind_orc.c:553) [ 137.291844][C0] 88809c409880: 88809c409ac8 (0x88809c409ac8) [ 137.291845][C0] 88809c409888: 0086 (0x86) -- Josh
Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650
On Wed, Feb 03, 2021 at 03:30:35PM -0800, Ivan Babrou wrote: > > > > Can you recreate with this patch, and add "unwind_debug" to the cmdline? > > > > It will spit out a bunch of stack data. > > > > > > Here's the three I'm building: > > > > > > * https://github.com/bobrik/linux/tree/ivan/static-call-5.9 > > > > > > It contains: > > > > > > * v5.9 tag as the base > > > * static_call-2020-10-12 tag > > > * dm-crypt patches to reproduce the issue with KASAN > > > * x86/unwind: Add 'unwind_debug' cmdline option > > > * tracepoint: Fix race between tracing and removing tracepoint > > > > > > The very same issue can be reproduced on 5.10.11 with no patches, > > > but I'm going with 5.9, since it boils down to static call changes. > > > > > > Here's the decoded stack from the kernel with unwind debug enabled: > > > > > > * https://gist.github.com/bobrik/ed052ac0ae44c880f3170299ad4af56b > > > > > > See my first email for the exact commands that trigger this. > > > > Thanks. Do you happen to have the original dmesg, before running it > > through the post-processing script? > > Yes, here it is: > > * https://gist.github.com/bobrik/8c13e6a02555fb21cadabb74cdd6f9ab It appears the unwinder is getting lost in crypto code. No idea what this has to do with static calls though. Or maybe you're seeing multiple issues. Does this fix it? diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile index a31de0c6ccde..36c55341137c 100644 --- a/arch/x86/crypto/Makefile +++ b/arch/x86/crypto/Makefile @@ -2,7 +2,14 @@ # # x86 crypto algorithms -OBJECT_FILES_NON_STANDARD := y +OBJECT_FILES_NON_STANDARD_sha256-avx2-asm.o:= y +OBJECT_FILES_NON_STANDARD_sha512-ssse3-asm.o := y +OBJECT_FILES_NON_STANDARD_sha512-avx-asm.o := y +OBJECT_FILES_NON_STANDARD_sha512-avx2-asm.o:= y +OBJECT_FILES_NON_STANDARD_crc32c-pcl-intel-asm_64.o:= y +OBJECT_FILES_NON_STANDARD_camellia-aesni-avx2-asm_64.o := y +OBJECT_FILES_NON_STANDARD_sha1_avx2_x86_64_asm.o := y +OBJECT_FILES_NON_STANDARD_sha1_ni_asm.o:= y obj-$(CONFIG_CRYPTO_GLUE_HELPER_X86) += glue_helper.o diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S b/arch/x86/crypto/aesni-intel_avx-x86_64.S index 5fee47956f3b..59c36b88954f 100644 --- a/arch/x86/crypto/aesni-intel_avx-x86_64.S +++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S @@ -237,8 +237,8 @@ define_reg j %j .noaltmacro .endm -# need to push 4 registers into stack to maintain -STACK_OFFSET = 8*4 +# need to push 5 registers into stack to maintain +STACK_OFFSET = 8*5 TMP1 = 16*0# Temporary storage for AAD TMP2 = 16*1# Temporary storage for AES State 2 (State 1 is stored in an XMM register) @@ -257,6 +257,8 @@ VARIABLE_OFFSET = 16*8 .macro FUNC_SAVE #the number of pushes must equal STACK_OFFSET + push%rbp + mov %rsp, %rbp push%r12 push%r13 push%r14 @@ -271,12 +273,14 @@ VARIABLE_OFFSET = 16*8 .endm .macro FUNC_RESTORE +add $VARIABLE_OFFSET, %rsp mov %r14, %rsp pop %r15 pop %r14 pop %r13 pop %r12 + pop %rbp .endm # Encryption of a single block
Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650
On Wed, Feb 03, 2021 at 04:52:42PM -0800, Ivan Babrou wrote: > We also have the following stack that doesn't touch any crypto: > > * https://gist.github.com/bobrik/40e2559add2f0b26ae39da30dc451f1e Can you also run this through decode_stacktrace.sh? Both are useful (until I submit a fix for decode_stacktrace.sh). > I cannot reproduce this one, and it took 2 days of uptime for it to > happen. Is there anything I can do to help diagnose it? Can you run with the same unwind_debug patch+cmdline when you try to recreate this one? In the meantime I'll look at the available data. -- Josh
Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650
On Wed, Feb 03, 2021 at 09:44:48PM -0500, Steven Rostedt wrote: > > > [ 128.441287][C0] RIP: 0010:skcipher_walk_next > > > (crypto/skcipher.c:322 crypto/skcipher.c:384) > > Why do we have an RIP in skcipher_walk_next, if its the unwinder that > had a bug? Or are they related? > > Or did skcipher_walk_next trigger something in KASAN which did a stack > walk via the unwinder, and that caused another issue? It was interrupted by an IRQ, which then called kfree(), which then called kasan_save_stack(), which then called the unwinder, which then read "out-of-bounds" between stack frames. In this case it was because of some crypto code missing ORC annotations. > Looking at the unwinder code in question, we have: > > static bool deref_stack_regs(struct unwind_state *state, unsigned long addr, > unsigned long *ip, unsigned long *sp) > { > struct pt_regs *regs = (struct pt_regs *)addr; > > /* x86-32 support will be more complicated due to the ®s->sp hack > */ > BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_32)); > > if (!stack_access_ok(state, addr, sizeof(struct pt_regs))) > return false; > > *ip = regs->ip; > *sp = regs->sp; <- pointer to here > return true; > } > > and the caller of the above static function: > > case UNWIND_HINT_TYPE_REGS: > if (!deref_stack_regs(state, sp, &state->ip, &state->sp)) { > orc_warn_current("can't access registers at %pB\n", > (void *)orig_ip); > goto err; > } > > > Could it possibly be that there's some magic canary on the stack that > causes KASAN to trigger if you read it? Right, the unwinder isn't allowed to read between stack frames. In fact, you read my mind, I was looking at the other warning in network code: [160676.598929][C4] asm_common_interrupt+0x1e/0x40 [160676.608966][C4] RIP: 0010:0xc17d814c [160676.618812][C4] Code: 8b 4c 24 40 4c 8b 44 24 48 48 8b 7c 24 70 48 8b 74 24 68 48 8b 54 24 60 48 8b 4c 24 58 48 8b 44 24 50 48 81 c4 a8 00 00 00 9d 20 27 af 8f ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00 [160676.649371][C4] RSP: 0018:8893dfd4f620 EFLAGS: 0282 [160676.661073][C4] RAX: RBX: 8881be9c9c80 RCX: [160676.674788][C4] RDX: dc00 RSI: 000b RDI: 8881be9c9c80 [160676.688508][C4] RBP: 8881be9c9ce0 R08: R09: 8881908c4c97 [160676.702249][C4] R10: ed1032118992 R11: 88818a4ce68c R12: 8881be9c9eea [160676.716000][C4] R13: 8881be9c9c92 R14: 8880063ba5ac R15: 8880063ba5a8 [160676.729895][C4] ? tcp_set_state+0x5/0x620 [160676.740426][C4] ? tcp_fin+0xeb/0x5a0 [160676.750287][C4] ? tcp_data_queue+0x1e78/0x4ce0 [160676.761089][C4] ? tcp_urg+0x76/0xc50 This line gives a big clue: [160676.608966][C4] RIP: 0010:0xc17d814c That address, without a function name, most likely means that it was running in some generated code (mostly likely BPF) when it got interrupted. Right now, the ORC unwinder tries to fall back to frame pointers when it encounters generated code: orc = orc_find(state->signal ? state->ip : state->ip - 1); if (!orc) /* * As a fallback, try to assume this code uses a frame pointer. * This is useful for generated code, like BPF, which ORC * doesn't know about. This is just a guess, so the rest of * the unwind is no longer considered reliable. */ orc = &orc_fp_entry; state->error = true; } Because the ORC unwinder is guessing from that point onward, it's possible for it to read the KASAN stack redzone, if the generated code hasn't set up frame pointers. So the best fix may be for the unwinder to just always bypass KASAN when reading the stack. The unwinder has a mechanism for detecting and warning about out-of-bounds, and KASAN is short-circuiting that. This should hopefully get rid of *all* the KASAN unwinder warnings, both crypto and networking. diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c index 040194d079b6..1f69a23a4715 100644 --- a/arch/x86/kernel/unwind_orc.c +++ b/arch/x86/kernel/unwind_orc.c @@ -376,8 +376,8 @@ static bool deref_stack_regs(struct unwind_state *state, unsigned long addr, if (!stack_access_ok(state, addr, sizeof(struct pt_regs))) return false; - *ip = regs->ip; - *sp = regs->sp; + *ip = READ_ONCE_NOCHECK(regs->ip); + *sp = READ_ONCE_NOCHECK(regs->sp); return true; } @@ -389,8 +389,8 @@ static bool deref_stack_iret_regs(struct unwind_state *state, unsigned long addr if (!stack_access_ok(state, addr, IRET_FRAME_SIZE))
Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650
On Thu, Feb 04, 2021 at 11:51:44AM -0800, Ivan Babrou wrote: > > .macro FUNC_SAVE > > #the number of pushes must equal STACK_OFFSET > > + push%rbp > > + mov %rsp, %rbp > > push%r12 > > push%r13 > > push%r14 > > @@ -271,12 +273,14 @@ VARIABLE_OFFSET = 16*8 > > .endm > > > > .macro FUNC_RESTORE > > +add $VARIABLE_OFFSET, %rsp > > mov %r14, %rsp > > > > pop %r15 > > pop %r14 > > pop %r13 > > pop %r12 > > + pop %rbp > > .endm > > > > # Encryption of a single block > > > > This patch seems to fix the following warning: > > [ 147.995699][C0] WARNING: stack going in the wrong direction? at > glue_xts_req_128bit+0x21f/0x6f0 [glue_helper] > > Or at least I cannot see it anymore when combined with your other > patch, not sure if it did the trick by itself. > > This sounds like a good reason to send them both. Ok, that's what I expected. The other patch fixed the unwinder failure mode to be the above (harmless) unwinder warning, instead of a disruptive KASAN failure. This patch fixes the specific underlying crypto unwinding metadata issue. I'll definitely be sending both fixes. The improved failure mode patch will come first because it's more urgent and lower risk. -- Josh
Re: Packet gets stuck in NOLOCK pfifo_fast qdisc
Hi Jike On 8/20/20 12:43 AM, Jike Song wrote: Hi Josh, We met possibly the same problem when testing nvidia/mellanox's GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you can also have a try? We also did something similar where we've switched over to using the fq scheduler everywhere for now. We believe the bug is in the nolock code which only pfifo_fast uses atm, but we've been unable to come up with a satisfactory solution. Besides, our testing is pretty complex, do you have a quick test to reproduce it? Unfortunately we don't have a simple test case either. Our current reproducer is complex as well, although it would seem like we should be able to come up with something where you have maybe 2 threads trying to send on the same tx queue running pfifo_fast every few hundred milliseconds and not much else/no other tx traffic on that queue. IIRC we believe the scenario is when one thread is in the process of dequeuing a packet while another is enqueuing, the enqueue-er (word? :)) sees the dequeue is in progress and so does not xmit the packet assuming the dequeue operation will take care of it. However b/c the dequeue is in the process of completing it doesn't and the newly enqueued packet stays in the qdisc until another packet is enqueued pushing both out. Given that we have a workaround with using fq or any other qdisc not named pfifo_fast this has gotten bumped down in priority for us. I would like to work on a reproducer at some point, but won't likely be for a few weeks :( Josh
[PATCH 5/9] x86/bpf: Support SIB byte generation
In preparation for using R12 indexing instructions in BPF JIT code, add support for generating the x86 SIB byte. Signed-off-by: Josh Poimboeuf --- arch/x86/net/bpf_jit_comp.c | 69 + 1 file changed, 54 insertions(+), 15 deletions(-) diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c index 485692d4b163..e649f977f8e1 100644 --- a/arch/x86/net/bpf_jit_comp.c +++ b/arch/x86/net/bpf_jit_comp.c @@ -143,6 +143,12 @@ static bool is_axreg(u32 reg) return reg == BPF_REG_0; } +static bool is_sib_reg(u32 reg) +{ + /* R12 isn't used yet */ + false; +} + /* Add modifiers if 'reg' maps to x86-64 registers R8..R15 */ static u8 add_1mod(u8 byte, u32 reg) { @@ -779,10 +785,19 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, u8 *image, case BPF_ST | BPF_MEM | BPF_DW: EMIT2(add_1mod(0x48, dst_reg), 0xC7); -st:if (is_imm8(insn->off)) - EMIT2(add_1reg(0x40, dst_reg), insn->off); +st: + if (is_imm8(insn->off)) + EMIT1(add_1reg(0x40, dst_reg)); + else + EMIT1(add_1reg(0x80, dst_reg)); + + if (is_sib_reg(dst_reg)) + EMIT1(add_1reg(0x20, dst_reg)); + + if (is_imm8(insn->off)) + EMIT1(insn->off); else - EMIT1_off32(add_1reg(0x80, dst_reg), insn->off); + EMIT(insn->off, 4); EMIT(imm32, bpf_size_to_x86_bytes(BPF_SIZE(insn->code))); break; @@ -811,11 +826,19 @@ st: if (is_imm8(insn->off)) goto stx; case BPF_STX | BPF_MEM | BPF_DW: EMIT2(add_2mod(0x48, dst_reg, src_reg), 0x89); -stx: if (is_imm8(insn->off)) - EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off); +stx: + if (is_imm8(insn->off)) + EMIT1(add_2reg(0x40, dst_reg, src_reg)); else - EMIT1_off32(add_2reg(0x80, dst_reg, src_reg), - insn->off); + EMIT1(add_2reg(0x80, dst_reg, src_reg)); + + if (is_sib_reg(dst_reg)) + EMIT1(add_1reg(0x20, dst_reg)); + + if (is_imm8(insn->off)) + EMIT1(insn->off); + else + EMIT(insn->off, 4); break; /* LDX: dst_reg = *(u8*)(src_reg + off) */ @@ -837,16 +860,24 @@ stx: if (is_imm8(insn->off)) case BPF_LDX | BPF_MEM | BPF_DW: /* Emit 'mov rax, qword ptr [rax+0x14]' */ EMIT2(add_2mod(0x48, src_reg, dst_reg), 0x8B); -ldx: /* +ldx: + /* * If insn->off == 0 we can save one extra byte, but * special case of x86 R13 which always needs an offset * is not worth the hassle */ if (is_imm8(insn->off)) - EMIT2(add_2reg(0x40, src_reg, dst_reg), insn->off); + EMIT1(add_2reg(0x40, src_reg, dst_reg)); + else + EMIT1(add_2reg(0x80, src_reg, dst_reg)); + + if (is_sib_reg(src_reg)) + EMIT1(add_1reg(0x20, src_reg)); + + if (is_imm8(insn->off)) + EMIT1(insn->off); else - EMIT1_off32(add_2reg(0x80, src_reg, dst_reg), - insn->off); + EMIT(insn->off, 4); break; /* STX XADD: lock *(u32*)(dst_reg + off) += src_reg */ @@ -859,11 +890,19 @@ stx: if (is_imm8(insn->off)) goto xadd; case BPF_STX | BPF_XADD | BPF_DW: EMIT3(0xF0, add_2mod(0x48, dst_reg, src_reg), 0x01); -xadd: if (is_imm8(insn->off)) - EMIT2(add_2reg(0x40, dst_reg, src_reg), insn->off); +xadd: + if (is_imm8(insn->off)) + EMIT1(add_2reg(0x40, dst_reg, src_reg)); + else + EMIT1(add_2reg(0x80, dst_reg, src_reg)); +
[PATCH RESEND 4.9.y] net: check before dereferencing netdev_ops during busy poll
init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll. Avoid this by ensuring napi->dev->netdev_ops is valid before following the pointer, avoiding the following panic when busy polling on a dummy netdev: BUG: unable to handle kernel NULL pointer dereference at 00c8 IP: [] sk_busy_loop+0x92/0x2f0 Call Trace: [] ? uart_write_room+0x74/0xf0 [] sock_poll+0x99/0xa0 [] do_sys_poll+0x2e2/0x520 [] ? get_page_from_freelist+0x3bc/0xa30 [] ? update_curr+0x62/0x140 [] ? __slab_free+0xa1/0x2a0 [] ? __slab_free+0xa1/0x2a0 [] ? skb_free_head+0x21/0x30 [] ? poll_initwait+0x50/0x50 [] ? kmem_cache_free+0x1c6/0x1e0 [] ? uart_write+0x124/0x1d0 [] ? remove_wait_queue+0x4d/0x60 [] ? __wake_up+0x44/0x50 [] ? tty_write_unlock+0x31/0x40 [] ? tty_ldisc_deref+0x16/0x20 [] ? tty_write+0x1e0/0x2f0 [] ? process_echoes+0x80/0x80 [] ? __vfs_write+0x2b/0x130 [] ? vfs_write+0x15a/0x1a0 [] SyS_poll+0x75/0x100 [] entry_SYSCALL_64_fastpath+0x24/0xcf Commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()") indirectly fixed this upstream in linux-4.11 by removing the offending pointer usage. No other users of napi->dev touch its netdev_ops. Fixes: ce6aea93f751 ("net: network drivers no longer need to implement ndo_busy_poll()") # 4.9.y Signed-off-by: Josh Elsasser Reviewed-by: Eric Dumazet Tested-by: Matteo Croce --- No changes since V2[1], resent as per discussiond on -stable[2]. I hope this is the correct way to send net fixes for older LTS releases, I'm going off of the latest netdev FAQ: For earlier stable releases, each stable branch maintainer is supposed to take care of them. If you find any patch is missing from an earlier stable branch, please notify sta...@vger.kernel.org with either a commit ID or a formal patch backported, and CC Dave and other relevant networking developers. [1]: https://patchwork.ozlabs.org/patch/884986/ [2]: https://lore.kernel.org/stable/CAGnkfhx3ykbEsW+=FtpMFWU=_vnie7rppywpwqa1s1hpmxj...@mail.gmail.com/ net/core/dev.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 4e10bae5e3da..f693afe608d7 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5083,7 +5083,10 @@ bool sk_busy_loop(struct sock *sk, int nonblock) goto out; /* Note: ndo_busy_poll method is optional in linux-4.5 */ - busy_poll = napi->dev->netdev_ops->ndo_busy_poll; + if (napi->dev->netdev_ops) + busy_poll = napi->dev->netdev_ops->ndo_busy_poll; + else + busy_poll = NULL; do { rc = 0; -- 2.20.1
[PATCH 4.4.y] net: check before dereferencing netdev_ops during busy poll
init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll. Avoid this by ensuring napi->dev->netdev_ops is valid before following the pointer, avoiding the following panic when busy polling on a dummy netdev: BUG: unable to handle kernel NULL pointer dereference at 00c8 IP: [] sk_busy_loop+0x92/0x2f0 Call Trace: [] ? uart_write_room+0x74/0xf0 [] sock_poll+0x99/0xa0 [] do_sys_poll+0x2e2/0x520 [] ? get_page_from_freelist+0x3bc/0xa30 [] ? update_curr+0x62/0x140 [] ? __slab_free+0xa1/0x2a0 [] ? __slab_free+0xa1/0x2a0 [] ? skb_free_head+0x21/0x30 [] ? poll_initwait+0x50/0x50 [] ? kmem_cache_free+0x1c6/0x1e0 [] ? uart_write+0x124/0x1d0 [] ? remove_wait_queue+0x4d/0x60 [] ? __wake_up+0x44/0x50 [] ? tty_write_unlock+0x31/0x40 [] ? tty_ldisc_deref+0x16/0x20 [] ? tty_write+0x1e0/0x2f0 [] ? process_echoes+0x80/0x80 [] ? __vfs_write+0x2b/0x130 [] ? vfs_write+0x15a/0x1a0 [] SyS_poll+0x75/0x100 [] entry_SYSCALL_64_fastpath+0x24/0xcf Commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()") indirectly fixed this upstream in linux-4.11 by removing the offending pointer usage. No other users of napi->dev touch its netdev_ops. Fixes: 8b80cda536ea ("net: rename include/net/ll_poll.h to include/net/busy_poll.h") # 4.4.y Signed-off-by: Josh Elsasser --- This is a straightforward backport of the 4.9.y fix[1] for this crash, which doesn't apply to the older LTS releases. Only build-tested on 4.4.y, as I don't have access to wireless hardware and firmware that runs on older LTS kernels. [1]: https://lore.kernel.org/stable/20190701234143.72631-1-jelsas...@appneta.com/T/#u include/net/busy_poll.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h index 1d67fb6b23a0..6d238506d49b 100644 --- a/include/net/busy_poll.h +++ b/include/net/busy_poll.h @@ -93,7 +93,7 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock) goto out; ops = napi->dev->netdev_ops; - if (!ops->ndo_busy_poll) + if (!ops || !ops->ndo_busy_poll) goto out; do { -- 2.20.1
Re: [PATCH RESEND 4.9.y] net: check before dereferencing netdev_ops during busy poll
On Jul 1, 2019, at 6:51 PM, David Miller wrote: > I just tried to apply this with "git am" to the current v4.19 -stable > branch and it failed. This is only needed for the v4.9 stable kernel, ndo_busy_poll (and this NPE) went away in kernel 4.11. Sorry, I probably should have called that out more explicitly.
Re: [PATCH] rtl_nic: add firmware rtl8125a-3
On Mon, Aug 26, 2019 at 6:23 PM Heiner Kallweit wrote: > > This adds firmware rtl8125a-3 for Realtek's 2.5Gbps chip RTL8125. > > Signed-off-by: Heiner Kallweit > --- > Firmware file was provided by Realtek and they asked me to submit it. Can we get a Signed-off-by from someone at Realtek then? josh > The related extension to r8169 driver will be submitted in the next days. > --- > WHENCE| 3 +++ > rtl_nic/rtl8125a-3.fw | Bin 0 -> 3456 bytes > 2 files changed, 3 insertions(+) > create mode 100644 rtl_nic/rtl8125a-3.fw > > diff --git a/WHENCE b/WHENCE > index fb12924..dbec18a 100644 > --- a/WHENCE > +++ b/WHENCE > @@ -2906,6 +2906,9 @@ Version: 0.0.2 > File: rtl_nic/rtl8107e-2.fw > Version: 0.0.2 > > +File: rtl_nic/rtl8125a-3.fw > +Version: 0.0.1 > + > Licence: > * Copyright © 2011-2013, Realtek Semiconductor Corporation > * > diff --git a/rtl_nic/rtl8125a-3.fw b/rtl_nic/rtl8125a-3.fw > new file mode 100644 > index > ..fac635263f92e8d9734456b75932b2088edd5ef9 > GIT binary patch > literal 3456 > zcmb7G4@{Kj8Gr9M&hwp*MPE#webM6qsqQiuC(hXPi?+wDL#V(Z*4#K%FD{ > zd#PH+tTJbqZG?a`)*5G*m3F2zmN`7hx;2*IKiVu;;~Hwa_E@^5+Z^ooav$qyWa|!| > z{J!V^^FHtMe%~vE5S!~a<M6|pO=$6Z+-!F`d2|JjuT=?tX!6t > zo1eG1ysol-V@-KgU0(T?#@f6Efr9d!!2E*zz=G_mCu@CyK;goi!iD+b1yk6B2%b&6 > z7edS;xkzqO0?9-2l9EW0ltM}+rIFG}86+PmljJ95fhB~6Z~~0y3JYX}?Z^&0uqy1t > zny?FN!)}}nC*Ym12kkF<@t&E4_=v=ynF9AnDz2DmaE+uRv6r!*@!^m!6NmhMOzq8r > zcySgS;n{HZ&ViVjO+Em7C z8F2PIf~lcpz}RxchgOhZNnAy~1V! z;l`&BHaB5(q!u4F*5UicX4Dw$I > zZZG}Y@U*d?z6a>rPW?gZU!wjH^_^&crVIA-hauiRf}*mc@O*L%b>cXFD^8&Io|9br > zFS+(#A z<^?z-1MK-A9Fa>HXt;u_<`7nle1Pws`y;xZ4b$%$RvXt*Vtj-(#xP2a8yGS^#rwu* > z7&RlXNB)8`;|q+Lf8+C)SZDkL{T(+MYZPk@p$0naYDvhUdK;W-&~&K<5x2TvCa4ES > zJSq_Os=`o`>Ti(hqM4#RkyLfbOj8MwbamOxQ0|CNT`@D2E8 z9LCG0O+ej0lA~sw%T-;^=BaOn@)@g8tu_{^65~O&#t5oXW3d`Ciq!i?u^KfEWsf|f > z%8X@d%v{dr6>6QaQuTMNV*C=d)+lAYWhy1Kp7A%(ze4qPRH@`pHTfshkXfVRB2TgY > zO=`+Wt(q39Q=W61)m5XOc8&DkO5CPgp(bTNzg>y9p~8jDs$zJj5M4( > zG`vTNq`hitUz@`B_Nx)&fWp3Z<>))8?4g&GICDrH_jW38Z z6ix)=-SB%nIOXwTePI%gc_p@upI;Gdo~F;Tmw)5`y%^_39nX1Ki-J6~L1FKz7NSjv > z_`nkPz3?L$w%rohr-(vgBF1 z-4-4vre2o#VTHs{P{PCaz|dle45F5=t7KfRt3*ceh=r$#TAy7mkxtwHW#j)EHgmp) > zO)hO_GH>cb5{`=!(@#j)h_-jxgCneiD9Hcd;ix`Q>rX~yy2c{beS3_|m>9m87;BeK > z9^<~->L`kd5sZmZuw?QWNw>vliHU)j7&EQ4-f1m1#+c~6-j5U9k9~bn**#pV?$N|- > zq$Nr8T#$I{J?c3t2VJ-F9=Aj_^{tkE`!wz?XF~Vaag5Xw_4`&lQTP1kQQmM$_|B0( > zOkxdl@0Tb}k#O zV)t~$Ac`@GiSytUlb?qfh~`cEKhXXQ`gkQS+oQgJX8gniiK%+szlqBBQMQ+LjIoYA > z7Pea0V&QHJcUss?e1U!-enM;`#@W7FhmW$!&b34|Z>oiU3%_IGY6~}8_<=_875$5K > z-p<>elW^0nO-Xb$v)?-<5?*G%4-kJO5#)Zu+T&X8&rjZA4DRX1phN*@#Ln3Z5`&x> > zM^Cgz;x5{-ci}0VyQ8FT#^zi&IL{9H?xHU!(>!!e${wT4$GqBa3H>xiG*Va3d7hBl > z$-lAV)_T3WsTa}QwwT;{)^yFKtnb^bPxFJExzl=W&oejIewS6dj^AhH>tVh&*5~e` > zja9dg7?{KsZ_$^!dgjpQL9gf03g&nvnzMv6vp=S9DYVsnn`y<1qkR|cAF7hLwo&4_ > z$0b%#ug_6NWfb*$Il}q#a(!Ob6=ePMXrpB-v<#HJEP3C!v!5@ zAjU7*@~eyS)8C3a`2}PA^u1Eoi5NfK?_u^^&&MtM&ozI+{=2wFEidM}?R=ite~tg7 > zpYGYyoP*`0xugV=o@1RyFwcMDQ>Obe;jicCb=vBh2MhU4 z8^49!$+$sGAily`auZvb-$fkYEIElgD0dJaC)$bQXUsw`@urBL? z@#-RprR2w-mqwqPKV$B{bA8MiN1HEyEpap~@gZZ_*el!TTw^JF*(K5a0Q2#@rtsUg > zt6SovowDhaJ zME%~Yvas61t;E=S%Z@SO6V|;&h-h!3cbdD!=(z6g@jH#a_vpS&+;={={DQpi2 DeEY6F > > literal 0 > HcmV?d1 > > -- > 2.23.0 >
[PATCH iproute2-next] ss: add option to print socket information on one line
Multi-line output in ss makes it difficult to search for things with grep. This new option will make it easier to find sockets matching certain criteria with simple grep commands. Example without option: $ ss -emoitn State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 127.0.0.1:13265 127.0.0.1:36743 uid:1974 ino:48271 sk:1 <-> skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:7,7 rto:211 rtt:10.245/16.616 ato:40 mss:65483 cwnd:10 bytes_acked:41865496 bytes_received:21580440 segs_out:242496 segs_in:351446 data_segs_out:242495 data_segs_in:242495 send 511.3Mbps lastsnd:2383 lastrcv:2383 lastack:2342 pacing_rate 1022.6Mbps rcv_rtt:92427.6 rcv_space:43725 minrtt:0.007 Example with new option: $ ss -emoitnO StateRecv-Q Send-Q Local Address:PortPeer Address:Port ESTAB0 0 127.0.0.1:13265 127.0.0.1:36743 uid:1974 ino:48271 sk:1 <-> skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:7,7 rto:211 rtt:10.067/16.429 ato:40 mss:65483 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_sent:41868244 bytes_acked:41868244 bytes_received:21581866 segs_out:242512 segs_in:351469 data_segs_out:242511 data_segs_in:242511 send 520.4Mbps lastsnd:14355 lastrcv:14355 lastack:14314 pacing_rate 1040.7Mbps delivery_rate 74837.7Mbps delivered:242512 app_limited busy:1861946ms rcv_rtt:92427.6 rcv_space:43725 rcv_ssthresh:43690 minrtt:0.007 Signed-off-by: Josh Hunt --- man/man8/ss.8 | 3 +++ misc/ss.c | 51 +-- 2 files changed, 44 insertions(+), 10 deletions(-) diff --git a/man/man8/ss.8 b/man/man8/ss.8 index 03a3dcc6c9c3..dd5ebc96ec03 100644 --- a/man/man8/ss.8 +++ b/man/man8/ss.8 @@ -24,6 +24,9 @@ Output version information. .B \-H, \-\-no-header Suppress header line. .TP +.B \-O, \-\-one-line +Print each socket's data on a single line. +.TP .B \-n, \-\-numeric Do not try to resolve service names. .TP diff --git a/misc/ss.c b/misc/ss.c index 9cb3ee19e542..3a193db7d17f 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -121,6 +121,7 @@ static int follow_events; static int sctp_ino; static int show_tipcinfo; static int show_tos; +int oneline = 0; enum col_id { COL_NETID, @@ -3053,7 +3054,8 @@ static int inet_show_sock(struct nlmsghdr *nlh, } if (show_mem || (show_tcpinfo && s->type != IPPROTO_UDP)) { - out("\n\t"); + if (!oneline) + out("\n\t"); if (s->type == IPPROTO_SCTP) sctp_show_info(nlh, r, tb); else @@ -3973,7 +3975,10 @@ static int packet_show_sock(struct nlmsghdr *nlh, void *arg) if (show_details) { if (pinfo) { - out("\n\tver:%d", pinfo->pdi_version); + if (oneline) + out(" ver:%d", pinfo->pdi_version); + else + out("\n\tver:%d", pinfo->pdi_version); out(" cpy_thresh:%d", pinfo->pdi_copy_thresh); out(" flags( "); if (pinfo->pdi_flags & PDI_RUNNING) @@ -3991,19 +3996,28 @@ static int packet_show_sock(struct nlmsghdr *nlh, void *arg) out(" )"); } if (ring_rx) { - out("\n\tring_rx("); + if (oneline) + out(" ring_rx("); + else + out("\n\tring_rx("); packet_show_ring(ring_rx); out(")"); } if (ring_tx) { - out("\n\tring_tx("); + if (oneline) + out(" ring_tx("); + else + out("\n\tring_tx("); packet_show_ring(ring_tx); out(")"); } if (has_fanout) { uint16_t type = (fanout >> 16) & 0x; - out("\n\tfanout("); + if (oneline) + out(" fanout("); + else + out("\n\tfanout("); out("id:%d,", fanout & 0x); out("type:"); @@ -4032,7 +4046,10 @@ static int packet_show_sock(struct nlmsghdr *nlh, void *arg) int num = RTA_PAYLOAD(tb[PACKET_DIAG_FILTER]) /
Re: [PATCH iproute2-next] ss: add option to print socket information on one line
On 4/25/19 3:59 PM, Stephen Hemminger wrote: On Thu, 25 Apr 2019 17:21:48 -0400 Josh Hunt wrote: Multi-line output in ss makes it difficult to search for things with grep. This new option will make it easier to find sockets matching certain criteria with simple grep commands. Example without option: $ ss -emoitn State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 127.0.0.1:13265 127.0.0.1:36743 uid:1974 ino:48271 sk:1 <-> skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:7,7 rto:211 rtt:10.245/16.616 ato:40 mss:65483 cwnd:10 bytes_acked:41865496 bytes_received:21580440 segs_out:242496 segs_in:351446 data_segs_out:242495 data_segs_in:242495 send 511.3Mbps lastsnd:2383 lastrcv:2383 lastack:2342 pacing_rate 1022.6Mbps rcv_rtt:92427.6 rcv_space:43725 minrtt:0.007 Example with new option: $ ss -emoitnO StateRecv-Q Send-Q Local Address:PortPeer Address:Port ESTAB0 0 127.0.0.1:13265 127.0.0.1:36743 uid:1974 ino:48271 sk:1 <-> skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:7,7 rto:211 rtt:10.067/16.429 ato:40 mss:65483 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_sent:41868244 bytes_acked:41868244 bytes_received:21581866 segs_out:242512 segs_in:351469 data_segs_out:242511 data_segs_in:242511 send 520.4Mbps lastsnd:14355 lastrcv:14355 lastack:14314 pacing_rate 1040.7Mbps delivery_rate 74837.7Mbps delivered:242512 app_limited busy:1861946ms rcv_rtt:92427.6 rcv_space:43725 rcv_ssthresh:43690 minrtt:0.007 Signed-off-by: Josh Hunt I agree with this, but probably time to give ss a json output as well. OK sure. Do you want to take both in a series or will you take this now with a separate json patch later? Josh
Re: [net-next 01/12] i40e: replace switch-statement to speed-up retpoline-enabled builds
On Apr 29, 2019, at 12:16 PM, Jeff Kirsher wrote: > From: Björn Töpel > > GCC will generate jump tables for switch-statements with more than 5 > case statements. An entry into the jump table is an indirect call, > which means that for CONFIG_RETPOLINE builds, this is rather > expensive. > > This commit replaces the switch-statement that acts on the XDP program > result with an if-clause. Apologies for the noise, but is this patch still required after the recent threshold bump[0] and later removal[1] of switch-case jump table generation when building with CONFIG_RETPOLINE? [0]: https://lore.kernel.org/patchwork/patch/1044863/ [1]: https://lore.kernel.org/patchwork/patch/1054472/ If nothing else the commit message no longer seems accurate. Regards, -- Josh
Re: [PATCH iproute2-next] ss: add option to print socket information on one line
On 4/30/19 11:30 AM, David Ahern wrote: On 4/25/19 3:21 PM, Josh Hunt wrote: @@ -4877,6 +4903,7 @@ static void _usage(FILE *dest) "\n" " -K, --kill forcibly close sockets, display what was closed\n" " -H, --no-header Suppress header line\n" +" -O, --one-line socket's data printed on a single line\n" "\n" " -A, --query=QUERY, --socket=QUERY\n" " QUERY := {all|inet|tcp|udp|raw|unix|unix_dgram|unix_stream|unix_seqpacket|packet|netlink|vsock_stream|vsock_dgram|tipc}[,QUERY]\n" @@ -5003,6 +5030,7 @@ static const struct option long_opts[] = { { "kill", 0, 0, 'K' }, { "no-header", 0, 0, 'H' }, { "xdp", 0, 0, OPT_XDPSOCK}, + { "one-line", 0, 0, 'O' }, shame 'o' can not be used for consistency with ip, but we can have both use 'oneline' as the long option without the '-'. Yeah, thanks David. I saw that the other tools use --oneline, so I'll send an update with that changed. Josh
Re: [PATCH iproute2-next] ss: add option to print socket information on one line
On 4/30/19 11:31 AM, Josh Hunt wrote: On 4/30/19 11:30 AM, David Ahern wrote: On 4/25/19 3:21 PM, Josh Hunt wrote: @@ -4877,6 +4903,7 @@ static void _usage(FILE *dest) "\n" " -K, --kill forcibly close sockets, display what was closed\n" " -H, --no-header Suppress header line\n" +" -O, --one-line socket's data printed on a single line\n" "\n" " -A, --query=QUERY, --socket=QUERY\n" " QUERY := {all|inet|tcp|udp|raw|unix|unix_dgram|unix_stream|unix_seqpacket|packet|netlink|vsock_stream|vsock_dgram|tipc}[,QUERY]\n" @@ -5003,6 +5030,7 @@ static const struct option long_opts[] = { { "kill", 0, 0, 'K' }, { "no-header", 0, 0, 'H' }, { "xdp", 0, 0, OPT_XDPSOCK}, + { "one-line", 0, 0, 'O' }, shame 'o' can not be used for consistency with ip, but we can have both use 'oneline' as the long option without the '-'. Yeah, thanks David. I saw that the other tools use --oneline, so I'll send an update with that changed. Actually, David can you clarify what you meant by "use 'oneline' as the long option without the '-'."? Thanks Josh
Re: [PATCH iproute2-next] ss: add option to print socket information on one line
On 4/30/19 12:41 PM, David Ahern wrote: On 4/30/19 12:55 PM, Josh Hunt wrote: Actually, David can you clarify what you meant by "use 'oneline' as the long option without the '-'."? for your patch: 1,$s/one-line/oneline/ ip has -oneline which is most likely used as 'ip -o'. having ss with --one-line vs --oneline is at least consistent to the level it can be. OK, that's what I thought you meant, but wanted to confirm. Thanks! Josh
[PATCH v2] ss: add option to print socket information on one line
Multi-line output in ss makes it difficult to search for things with grep. This new option will make it easier to find sockets matching certain criteria with simple grep commands. Example without option: $ ss -emoitn State Recv-Q Send-Q Local Address:Port Peer Address:Port ESTAB 0 0 127.0.0.1:13265 127.0.0.1:36743 uid:1974 ino:48271 sk:1 <-> skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:7,7 rto:211 rtt:10.245/16.616 ato:40 mss:65483 cwnd:10 bytes_acked:41865496 bytes_received:21580440 segs_out:242496 segs_in:351446 data_segs_out:242495 data_segs_in:242495 send 511.3Mbps lastsnd:2383 lastrcv:2383 lastack:2342 pacing_rate 1022.6Mbps rcv_rtt:92427.6 rcv_space:43725 minrtt:0.007 Example with new option: $ ss -emoitnO StateRecv-Q Send-Q Local Address:PortPeer Address:Port ESTAB0 0 127.0.0.1:13265 127.0.0.1:36743 uid:1974 ino:48271 sk:1 <-> skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:7,7 rto:211 rtt:10.067/16.429 ato:40 mss:65483 pmtu:65535 rcvmss:536 advmss:65483 cwnd:10 bytes_sent:41868244 bytes_acked:41868244 bytes_received:21581866 segs_out:242512 segs_in:351469 data_segs_out:242511 data_segs_in:242511 send 520.4Mbps lastsnd:14355 lastrcv:14355 lastack:14314 pacing_rate 1040.7Mbps delivery_rate 74837.7Mbps delivered:242512 app_limited busy:1861946ms rcv_rtt:92427.6 rcv_space:43725 rcv_ssthresh:43690 minrtt:0.007 Signed-off-by: Josh Hunt --- v1 -> v2 * Update long option to --oneline to match other ip tools as per David. man/man8/ss.8 | 3 +++ misc/ss.c | 51 +-- 2 files changed, 44 insertions(+), 10 deletions(-) diff --git a/man/man8/ss.8 b/man/man8/ss.8 index 03a3dcc6c9c3..9054fab9be69 100644 --- a/man/man8/ss.8 +++ b/man/man8/ss.8 @@ -24,6 +24,9 @@ Output version information. .B \-H, \-\-no-header Suppress header line. .TP +.B \-O, \-\-oneline +Print each socket's data on a single line. +.TP .B \-n, \-\-numeric Do not try to resolve service names. .TP diff --git a/misc/ss.c b/misc/ss.c index 9cb3ee19e542..e8e7b62eb4a5 100644 --- a/misc/ss.c +++ b/misc/ss.c @@ -121,6 +121,7 @@ static int follow_events; static int sctp_ino; static int show_tipcinfo; static int show_tos; +int oneline = 0; enum col_id { COL_NETID, @@ -3053,7 +3054,8 @@ static int inet_show_sock(struct nlmsghdr *nlh, } if (show_mem || (show_tcpinfo && s->type != IPPROTO_UDP)) { - out("\n\t"); + if (!oneline) + out("\n\t"); if (s->type == IPPROTO_SCTP) sctp_show_info(nlh, r, tb); else @@ -3973,7 +3975,10 @@ static int packet_show_sock(struct nlmsghdr *nlh, void *arg) if (show_details) { if (pinfo) { - out("\n\tver:%d", pinfo->pdi_version); + if (oneline) + out(" ver:%d", pinfo->pdi_version); + else + out("\n\tver:%d", pinfo->pdi_version); out(" cpy_thresh:%d", pinfo->pdi_copy_thresh); out(" flags( "); if (pinfo->pdi_flags & PDI_RUNNING) @@ -3991,19 +3996,28 @@ static int packet_show_sock(struct nlmsghdr *nlh, void *arg) out(" )"); } if (ring_rx) { - out("\n\tring_rx("); + if (oneline) + out(" ring_rx("); + else + out("\n\tring_rx("); packet_show_ring(ring_rx); out(")"); } if (ring_tx) { - out("\n\tring_tx("); + if (oneline) + out(" ring_tx("); + else + out("\n\tring_tx("); packet_show_ring(ring_tx); out(")"); } if (has_fanout) { uint16_t type = (fanout >> 16) & 0x; - out("\n\tfanout("); + if (oneline) + out(" fanout("); + else + out("\n\tfanout("); out("id:%d,", fanout & 0x); out("type:"); @@ -4032,7 +4046,10 @@ static int packet_show_sock(struct nlmsghdr *nlh, void *arg)
Re: [PATCH] can: check for null sk before deferencing it via the call to sock_net
On Fri, Sep 8, 2017 at 1:46 PM, Oliver Hartkopp wrote: > > > On 09/08/2017 05:02 PM, Colin King wrote: >> >> From: Colin Ian King >> >> The assignment of net via call sock_net will dereference sk. This >> is performed before a sanity null check on sk, so there could be >> a potential null dereference on the sock_net call if sk is null. >> Fix this by assigning net after the sk null check. Also replace >> the sk == NULL with the more usual !sk idiom. >> >> Detected by CoverityScan CID#1431862 ("Dereference before null check") >> >> Fixes: 384317ef4187 ("can: network namespace support for CAN_BCM >> protocol") >> Signed-off-by: Colin Ian King > > > Acked-by: Oliver Hartkopp I don't see this one queued up in the net or net-next trees. Did it fall through the cracks or did it get queued up elsewhere? Seems like it's a good candidate to get into 4.14? josh > > > Thanks Collin! > > >> --- >> net/can/bcm.c | 5 +++-- >> 1 file changed, 3 insertions(+), 2 deletions(-) >> >> diff --git a/net/can/bcm.c b/net/can/bcm.c >> index 47a8748d953a..a3791674b8ce 100644 >> --- a/net/can/bcm.c >> +++ b/net/can/bcm.c >> @@ -1493,13 +1493,14 @@ static int bcm_init(struct sock *sk) >> static int bcm_release(struct socket *sock) >> { >> struct sock *sk = sock->sk; >> - struct net *net = sock_net(sk); >> + struct net *net; >> struct bcm_sock *bo; >> struct bcm_op *op, *next; >> - if (sk == NULL) >> + if (!sk) >> return 0; >> + net = sock_net(sk); >> bo = bcm_sk(sk); >> /* remove bcm_ops, timer, rx_unregister(), etc. */ >> >
Re: [PATCH 0/3] Remove accidental VLA usage
On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote: > This series adds SIMPLE_MAX() to be used in places where a stack array > is actually fixed, but the compiler still warns about VLA usage due to > confusion caused by the safety checks in the max() macro. > > I'm sending these via -mm since that's where I've introduced SIMPLE_MAX(), > and they should all have no operational differences. What if we instead simplify the max() macro's type checking so that GCC can more easily fold the array size constants? The below patch seems to work: diff --git a/include/linux/kernel.h b/include/linux/kernel.h index 3fd291503576..ec863726da29 100644 --- a/include/linux/kernel.h +++ b/include/linux/kernel.h @@ -782,42 +782,32 @@ ftrace_vprintk(const char *fmt, va_list ap) static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } #endif /* CONFIG_TRACING */ -/* - * min()/max()/clamp() macros that also do - * strict type-checking.. See the - * "unnecessary" pointer comparison. - */ -#define __min(t1, t2, min1, min2, x, y) ({ \ - t1 min1 = (x); \ - t2 min2 = (y); \ - (void) (&min1 == &min2);\ - min1 < min2 ? min1 : min2; }) +extern long __error_incompatible_types_in_min_macro; +extern long __error_incompatible_types_in_max_macro; + +#define __min(t1, t2, x, y)\ + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \ + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ + (t1)__error_incompatible_types_in_min_macro) /** * min - return minimum of two values of the same or compatible types * @x: first value * @y: second value */ -#define min(x, y) \ - __min(typeof(x), typeof(y), \ - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ - x, y) +#define min(x, y) __min(typeof(x), typeof(y), x, y)\ -#define __max(t1, t2, max1, max2, x, y) ({ \ - t1 max1 = (x); \ - t2 max2 = (y); \ - (void) (&max1 == &max2);\ - max1 > max2 ? max1 : max2; }) +#define __max(t1, t2, x, y)\ + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \ + (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\ + (t1)__error_incompatible_types_in_max_macro) /** * max - return maximum of two values of the same or compatible types * @x: first value * @y: second value */ -#define max(x, y) \ - __max(typeof(x), typeof(y), \ - __UNIQUE_ID(max1_), __UNIQUE_ID(max2_), \ - x, y) +#define max(x, y) __max(typeof(x), typeof(y), x, y) /** * min3 - return minimum of three values @@ -869,10 +859,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } * @x: first value * @y: second value */ -#define min_t(type, x, y) \ - __min(type, type, \ - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ - x, y) +#define min_t(type, x, y) __min(type, type, x, y) /** * max_t - return maximum of two values, using the specified type @@ -880,10 +867,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } * @x: first value * @y: second value */ -#define max_t(type, x, y) \ - __max(type, type, \ - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ - x, y) +#define max_t(type, x, y) __max(type, type, x, y) \ /** * clamp_t - return a value clamped to a given range using a given type
Re: [PATCH 0/3] Remove accidental VLA usage
On Thu, Mar 08, 2018 at 10:02:01AM -0800, Kees Cook wrote: > On Thu, Mar 8, 2018 at 7:02 AM, Josh Poimboeuf wrote: > > On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote: > >> This series adds SIMPLE_MAX() to be used in places where a stack array > >> is actually fixed, but the compiler still warns about VLA usage due to > >> confusion caused by the safety checks in the max() macro. > >> > >> I'm sending these via -mm since that's where I've introduced SIMPLE_MAX(), > >> and they should all have no operational differences. > > > > What if we instead simplify the max() macro's type checking so that GCC > > can more easily fold the array size constants? The below patch seems to > > work: > > Oh magic! Very nice. I couldn't figure out how to do this when I > stared at it. Yes, let me respin. (I assume I can add your S-o-b?) I'm going to be traveling for a few days, so I bequeath the patch to you. You can add my SOB. I agree with Steve's suggestion to run it through 0-day. > > -Kees > > > > > diff --git a/include/linux/kernel.h b/include/linux/kernel.h > > index 3fd291503576..ec863726da29 100644 > > --- a/include/linux/kernel.h > > +++ b/include/linux/kernel.h > > @@ -782,42 +782,32 @@ ftrace_vprintk(const char *fmt, va_list ap) > > static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { } > > #endif /* CONFIG_TRACING */ > > > > -/* > > - * min()/max()/clamp() macros that also do > > - * strict type-checking.. See the > > - * "unnecessary" pointer comparison. > > - */ > > -#define __min(t1, t2, min1, min2, x, y) ({ \ > > - t1 min1 = (x); \ > > - t2 min2 = (y); \ > > - (void) (&min1 == &min2);\ > > - min1 < min2 ? min1 : min2; }) > > +extern long __error_incompatible_types_in_min_macro; > > +extern long __error_incompatible_types_in_max_macro; > > + > > +#define __min(t1, t2, x, y)\ > > + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \ > > + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\ > > + (t1)__error_incompatible_types_in_min_macro) > > > > /** > > * min - return minimum of two values of the same or compatible types > > * @x: first value > > * @y: second value > > */ > > -#define min(x, y) \ > > - __min(typeof(x), typeof(y), \ > > - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ > > - x, y) > > +#define min(x, y) __min(typeof(x), typeof(y), x, y)\ > > > > -#define __max(t1, t2, max1, max2, x, y) ({ \ > > - t1 max1 = (x); \ > > - t2 max2 = (y); \ > > - (void) (&max1 == &max2);\ > > - max1 > max2 ? max1 : max2; }) > > +#define __max(t1, t2, x, y)\ > > + __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \ > > + (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\ > > + (t1)__error_incompatible_types_in_max_macro) > > > > /** > > * max - return maximum of two values of the same or compatible types > > * @x: first value > > * @y: second value > > */ > > -#define max(x, y) \ > > - __max(typeof(x), typeof(y), \ > > - __UNIQUE_ID(max1_), __UNIQUE_ID(max2_), \ > > - x, y) > > +#define max(x, y) __max(typeof(x), typeof(y), x, y) > > > > /** > > * min3 - return minimum of three values > > @@ -869,10 +859,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode > > oops_dump_mode) { } > > * @x: first value > > * @y: second value > > */ > > -#define min_t(type, x, y) \ > > - __min(type, type, \ > > - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ > > - x, y) > > +#define min_t(type, x, y) __min(type, type, x, y) > > > > /** > > * max_t - return maximum of two values, using the specified type > > @@ -880,10 +867,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode > > oops_dump_mode) { } > > * @x: first value > > * @y: second value > > */ > > -#define max_t(type, x, y) \ > > - __max(type, type, \ > > - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_), \ > > - x, y) > > +#define max_t(type, x, y) __max(type, type, x, y) > > \ > > > > /** > > * clamp_t - return a value clamped to a given range using a given type > > > > -- > Kees Cook > Pixel Security -- Josh
[PATCH 1/1] net: check dev->reg_state before deref of napi netdev_ops
init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll. Avoid this by ensuring that napi->dev is not a dummy device before dereferencing napi dev's netdev_ops, preventing the following panic: BUG: unable to handle kernel NULL pointer dereference at 00c8 IP: [] sk_busy_loop+0x92/0x2f0 Call Trace: [] ? uart_write_room+0x74/0xf0 [] sock_poll+0x99/0xa0 [] do_sys_poll+0x2e2/0x520 [] ? get_page_from_freelist+0x3bc/0xa30 [] ? update_curr+0x62/0x140 [] ? __slab_free+0xa1/0x2a0 [] ? __slab_free+0xa1/0x2a0 [] ? skb_free_head+0x21/0x30 [] ? poll_initwait+0x50/0x50 [] ? kmem_cache_free+0x1c6/0x1e0 [] ? uart_write+0x124/0x1d0 [] ? remove_wait_queue+0x4d/0x60 [] ? __wake_up+0x44/0x50 [] ? tty_write_unlock+0x31/0x40 [] ? tty_ldisc_deref+0x16/0x20 [] ? tty_write+0x1e0/0x2f0 [] ? process_echoes+0x80/0x80 [] ? __vfs_write+0x2b/0x130 [] ? vfs_write+0x15a/0x1a0 [] SyS_poll+0x75/0x100 [] entry_SYSCALL_64_fastpath+0x24/0xcf Commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()") indirectly fixed this upstream in linux-4.11 by removing the offending pointer usage. No other users of napi->dev touch its netdev_ops. Fixes: 060212928670 ("net: add low latency socket poll") Fixes: ce6aea93f751 ("net: network drivers no longer need to implement ndo_busy_poll()") - 4.9.y Signed-off-by: Josh Elsasser --- net/core/dev.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 8898618bf341..d0f67d544587 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5042,7 +5042,10 @@ bool sk_busy_loop(struct sock *sk, int nonblock) goto out; /* Note: ndo_busy_poll method is optional in linux-4.5 */ - busy_poll = napi->dev->netdev_ops->ndo_busy_poll; + if (napi->dev->reg_state != NETREG_DUMMY) + busy_poll = napi->dev->netdev_ops->ndo_busy_poll; + else + busy_poll = NULL; do { rc = 0; -- 2.11.0
[PATCH 0/1] net: avoid a kernel panic during sk_busy_loop
Hi Dave, I stumbled across a reproducible kernel panic while playing around with busy_poll on a Linux 4.9.86 kernel. There's an unfortunate interaction between init_dummy_netdev, which doesn't bother to fill in netdev_ops, and sk_busy_loop, which assumes netdev_ops is a valid pointer. To reproduce on the device under test (DUT), I did: $ ip addr show dev wlan0 8: wlan0: mtu 1500 qdisc mq [...] inet 172.16.122.6/23 brd 172.16.123.255 scope global wlan0 $ sysctl -w net.core.busy_read=50 $ nc -l 172.16.122.6 5001 Then transmitted some data to this socket from a second host: $ echo "foo" | nc 172.16.122.6 5001 The DUT immediately hits a kernel panic. I've attached a patch that applies cleanly to the 4.9.87 stable release. This fix isn't necessary for net/net-next (ndo_busy_poll was removed in linux-4.11), but a further backport of this commit is likely required for any stable releases older than linux-4.5. I hope this is the right way to raise something like this. I couldn't find a clear answer from the -stable and netdev howtos for bugs against features that no longer exist in mainline. Thanks, Josh
Re: [PATCH 1/1] qed: Add firmware 8.33.11.0
On Feb 28, Rahul Verma wrote: > This patch add a new qed firmware with fixes and added > support for new features. > -Support VLAN remove action in steering flow. > -Optimized the FW flow and several bug fixes. > -Allow VXLAN steering. > -Supports T10DIF and SRQ. > -Support for port redirection for RDMA bonding > > Signed-off-by: Rahul Verma > --- > WHENCE | 1 + > qed/qed_init_values_zipped-8.33.11.0.bin | Bin 0 -> 852456 bytes > 2 files changed, 1 insertion(+) > create mode 100755 qed/qed_init_values_zipped-8.33.11.0.bin I had to fixup a small conflict in WHENCE, but no big deal. Applied and pushed out. josh
Re: [PATCH linux-firmware] Mellanox: Add new mlxsw_spectrum firmware 13.1620.192
On Tue, Feb 27, 2018 at 3:51 AM, Tal Bar wrote: > This new firmware contains: > - Support for auto-neg disable mode > > Signed-off-by: Tal Bar > --- > WHENCE | 1 + > mellanox/mlxsw_spectrum-13.1620.192.mfa2 | Bin 0 -> 1091848 bytes > 2 files changed, 1 insertion(+) > create mode 100644 mellanox/mlxsw_spectrum-13.1620.192.mfa2 Applied and pushed out. Thanks. josh
Re: [PATCH 1/1] net: check dev->reg_state before deref of napi netdev_ops
On Mon, Mar 12, 2018 at 4:17 PM, Cong Wang wrote: > On Sun, Mar 11, 2018 at 12:22 PM, Josh Elsasser wrote: >> init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads >> to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi >> wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll. >> >> Avoid this by ensuring that napi->dev is not a dummy device before >> dereferencing napi dev's netdev_ops, preventing the following panic: > > Hmm, how about just checking ->netdev_ops? Checking reg_state looks > odd, although works. Fair point. I was trying to differentiate between an unexpected NULL pointer and a dummy netdev, but I guess it was clever at the expense of readability. I'll push up a v2 that just does the obvious.
[PATCH 0/1] net: avoid a kernel panic during sk_busy_loop
V2: just check napi->dev->netdev_ops instead of getting clever with the netdev registration state. Original cover letter: Hi Dave, I stumbled across a reproducible kernel panic while playing around with busy_poll on a Linux 4.9.86 kernel. There's an unfortunate interaction between init_dummy_netdev, which doesn't bother to fill in netdev_ops, and sk_busy_loop, which assumed netdev_ops is a valid pointer. To reproduce on the device under test (DUT), I did: $ ip addr show dev wlan0 8: wlan0: mtu 1500 qdisc mq [...] inet 172.16.122.6/23 brd 172.16.123.255 scope global wlan0 $ sysctl -w net.core.busy_read=50 $ nc -l 172.16.122.6 5001 Then transmitted some data to this socket from a second host: $ echo "foo" | nc 172.16.122.6 5001 The DUT immediately hits a kernel panic. I've attached a patch that applies cleanly to the 4.9.87 stable release. This fix isn't necessary for net/net-next (ndo_busy_poll was removed in linux-4.11), but a further backport of this commit is likely required for any stable releases older than linux-4.5. I hope this is the right way to raise something like this. I couldn't find a clear answer from the -stable and netdev on how to handle bugs in features that no longer exist in mainline. Thanks, Josh
[PATCH v2 1/1] net: check before dereferencing netdev_ops during busy poll
init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll. Avoid this by ensuring napi->dev->netdev_ops is valid before following the pointer, avoiding the following panic when busy polling on a dummy netdev: BUG: unable to handle kernel NULL pointer dereference at 00c8 IP: [] sk_busy_loop+0x92/0x2f0 Call Trace: [] ? uart_write_room+0x74/0xf0 [] sock_poll+0x99/0xa0 [] do_sys_poll+0x2e2/0x520 [] ? get_page_from_freelist+0x3bc/0xa30 [] ? update_curr+0x62/0x140 [] ? __slab_free+0xa1/0x2a0 [] ? __slab_free+0xa1/0x2a0 [] ? skb_free_head+0x21/0x30 [] ? poll_initwait+0x50/0x50 [] ? kmem_cache_free+0x1c6/0x1e0 [] ? uart_write+0x124/0x1d0 [] ? remove_wait_queue+0x4d/0x60 [] ? __wake_up+0x44/0x50 [] ? tty_write_unlock+0x31/0x40 [] ? tty_ldisc_deref+0x16/0x20 [] ? tty_write+0x1e0/0x2f0 [] ? process_echoes+0x80/0x80 [] ? __vfs_write+0x2b/0x130 [] ? vfs_write+0x15a/0x1a0 [] SyS_poll+0x75/0x100 [] entry_SYSCALL_64_fastpath+0x24/0xcf Commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()") indirectly fixed this upstream in linux-4.11 by removing the offending pointer usage. No other users of napi->dev touch its netdev_ops. Fixes: 060212928670 ("net: add low latency socket poll") Fixes: ce6aea93f751 ("net: network drivers no longer need to implement ndo_busy_poll()") - 4.9.y Signed-off-by: Josh Elsasser --- net/core/dev.c | 5 - 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/net/core/dev.c b/net/core/dev.c index 8898618bf341..1f50c131ed15 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5042,7 +5042,10 @@ bool sk_busy_loop(struct sock *sk, int nonblock) goto out; /* Note: ndo_busy_poll method is optional in linux-4.5 */ - busy_poll = napi->dev->netdev_ops->ndo_busy_poll; + if (napi->dev->netdev_ops) + busy_poll = napi->dev->netdev_ops->ndo_busy_poll; + else + busy_poll = NULL; do { rc = 0; -- 2.11.0
Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()
On Sat, Mar 17, 2018 at 01:07:32PM -0700, Kees Cook wrote: > On Sat, Mar 17, 2018 at 11:52 AM, Linus Torvalds > wrote: > > So the above is completely insane, bit there is actually a chance that > > using that completely crazy "x -> sizeof(char[x])" conversion actually > > helps, because it really does have a (very odd) evaluation-time > > change. sizeof() has to be evaluated as part of the constant > > expression evaluation, in ways that "__builtin_constant_p()" isn't > > specified to be done. > > > > But it is also definitely me grasping at straws. If that doesn't work > > for 4.4, there's nothing else I can possibly see. > > No luck! :( gcc 4.4 refuses to play along. And, hilariously, not only > does it not change the complaint about __builtin_choose_expr(), it > also thinks that's a VLA now. > > ./include/linux/mm.h: In function ‘get_mm_hiwater_rss’: > ./include/linux/mm.h:1567: warning: variable length array is used > ./include/linux/mm.h:1567: error: first argument to > ‘__builtin_choose_expr’ not a constant > > 6.8 is happy with it (of course). > > I do think the earlier version (without the > sizeof-hiding-builting_constant_p) provides a template for a > const_max() that both you and Rasmus would be happy with, though! I thought we were dropping support for 4.4 (for other reasons). Isn't it 4.6 we should be looking at? -- Josh
Re: WARNING: kernel stack frame pointer at ffff880156a5fea0 in bash:2103 has bad value 00007ffec7d87e50
On Tue, Sep 26, 2017 at 11:51:31PM +0200, Richard Weinberger wrote: > Alexei, > > CC'ing Josh and Ingo. > > Am Dienstag, 26. September 2017, 06:09:02 CEST schrieb Alexei Starovoitov: > > On Mon, Sep 25, 2017 at 11:23:31PM +0200, Richard Weinberger wrote: > > > Hi! > > > > > > While playing with bcc's opensnoop tool on Linux 4.14-rc2 I managed to > > > trigger this splat: > > > > > > [ 297.629773] WARNING: kernel stack frame pointer at 880156a5fea0 in > > > bash:2103 has bad value 7ffec7d87e50 > > > [ 297.629777] unwind stack type:0 next_sp: (null) mask:0x6 > > > graph_idx:0 > > > [ 297.629783] 88015b207ae0: 88015b207b68 (0x88015b207b68) > > > [ 297.629790] 88015b207ae8: b163c00e > > > (__save_stack_trace+0x6e/ > > > 0xd0) > > > [ 297.629792] 88015b207af0: ... > > > [ 297.629795] 88015b207af8: 880156a58000 (0x880156a58000) > > > [ 297.629799] 88015b207b00: 880156a6 (0x880156a6) > > > [ 297.629800] 88015b207b08: ... > > > [ 297.629803] 88015b207b10: 0006 (0x6) > > > [ 297.629806] 88015b207b18: 880151b02700 (0x880151b02700) > > > [ 297.629809] 88015b207b20: 0101 (0x101) > > > [ 297.629812] 88015b207b28: 880156a5fea0 (0x880156a5fea0) > > > [ 297.629815] 88015b207b30: 88015b207ae0 (0x88015b207ae0) > > > [ 297.629818] 88015b207b38: c0050282 (0xc0050282) > > > [ 297.629819] 88015b207b40: ... > > > [ 297.629822] 88015b207b48: 0100 (0x100) > > > [ 297.629825] 88015b207b50: 880157b98280 (0x880157b98280) > > > [ 297.629828] 88015b207b58: 880157b98380 (0x880157b98380) > > > [ 297.629831] 88015b207b60: 88015ad2b500 (0x88015ad2b500) > > > [ 297.629834] 88015b207b68: 88015b207b78 (0x88015b207b78) > > > [ 297.629838] 88015b207b70: b163c086 > > > (save_stack_trace+0x16/0x20) [ 297.629841] 88015b207b78: > > > 88015b207da8 (0x88015b207da8) [ 297.629847] 88015b207b80: > > > b18a8ed6 (save_stack+0x46/0xd0) [ 297.629850] 88015b207b88: > > > 004c (0x4c) > > > [ 297.629852] 88015b207b90: 88015b207ba0 (0x88015b207ba0) > > > [ 297.629855] 88015b207b98: 8801 (0x8801) > > > [ 297.629859] 88015b207ba0: b163c086 > > > (save_stack_trace+0x16/0x20) [ 297.629864] 88015b207ba8: > > > b18a8ed6 (save_stack+0x46/0xd0) [ 297.629868] 88015b207bb0: > > > b18a9752 (kasan_slab_free+0x72/0xc0) > > Thanks for the report! > > I'm not sure I understand what's going on here. > > It seems you have kasan enabled and it's trying to do save_stack() > > and something crashing? > > I don't see any bpf related helpers in the stack trace. > > Which architecture is this? and .config ? > > Is bpf jit enabled? If so, make sure that net.core.bpf_jit_kallsyms=1 > > I found some time to dig a little further. > It seems to happen only when CONFIG_DEBUG_SPINLOCK is enabled, please see the > attached .config. The JIT is off. > KAsan is also not involved at all, the regular stack saving machinery from > the > trace framework initiates the stack unwinder. > > The issue arises as soon as in pre_handler_kretprobe() > raw_spin_lock_irqsave() > is being called. > It happens on all releases that have commit c32c47c68a0a ("x86/unwind: Warn > on > bad frame pointer"). > Interestingly it does not happen when I run > samples/kprobes/kretprobe_example.ko. So, BPF must be involved somehow. > > Here is another variant of the warning, it matches the attached .config: I can take a look at it. Unfortunately, for these types of issues I often need the vmlinux file to be able to make sense of the unwinder dump. So if you happen to have somewhere to copy the vmlinux to, that would be helpful. Or if you give me your GCC version I can try to rebuild it locally. -- Josh
Re: WARNING: kernel stack frame pointer at ffff880156a5fea0 in bash:2103 has bad value 00007ffec7d87e50
On Wed, Sep 27, 2017 at 08:51:22AM +0200, Richard Weinberger wrote: > Am Mittwoch, 27. September 2017, 00:42:46 CEST schrieb Josh Poimboeuf: > > > Here is another variant of the warning, it matches the attached .config: > > I can take a look at it. Unfortunately, for these types of issues I > > often need the vmlinux file to be able to make sense of the unwinder > > dump. So if you happen to have somewhere to copy the vmlinux to, that > > would be helpful. Or if you give me your GCC version I can try to > > rebuild it locally. > > There you go: > http://git.infradead.org/~rw/bpf_splat/vmlinux.xz Thanks. Can you test this fix? diff --git a/arch/x86/kernel/kprobes/common.h b/arch/x86/kernel/kprobes/common.h index db2182d63ed0..3fc0f9a794cb 100644 --- a/arch/x86/kernel/kprobes/common.h +++ b/arch/x86/kernel/kprobes/common.h @@ -3,6 +3,15 @@ /* Kprobes and Optprobes common header */ +#include + +#ifdef CONFIG_FRAME_POINTER +# define SAVE_RBP_STRING " push %" _ASM_BP "\n" \ +" mov %" _ASM_SP ", %" _ASM_BP "\n" +#else +# define SAVE_RBP_STRING " push %" _ASM_BP "\n" +#endif + #ifdef CONFIG_X86_64 #define SAVE_REGS_STRING \ /* Skip cs, ip, orig_ax. */ \ @@ -17,7 +26,7 @@ " pushq %r10\n" \ " pushq %r11\n" \ " pushq %rbx\n" \ - " pushq %rbp\n" \ + SAVE_RBP_STRING \ " pushq %r12\n" \ " pushq %r13\n" \ " pushq %r14\n" \ @@ -48,7 +57,7 @@ " pushl %es\n"\ " pushl %ds\n"\ " pushl %eax\n" \ - " pushl %ebp\n" \ + SAVE_RBP_STRING \ " pushl %edi\n" \ " pushl %esi\n" \ " pushl %edx\n" \
Re: [PATCH] rtlwifi: rtl8723de: Add firmware for new driver/device
On Tue, Jan 2, 2018 at 7:44 PM, Larry Finger wrote: > A driver for the RTL8723DE is nearing submission to staging. This commit > supplies > the firmware for it. > > Signed-off-by: Larry Finger > Cc: Ping-Ke Shih > --- > WHENCE | 9 + > rtlwifi/rtl8723defw.bin | Bin 0 -> 27726 bytes > 2 files changed, 9 insertions(+) > create mode 100644 rtlwifi/rtl8723defw.bin Applied and pushed out. Thanks. josh
Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution
On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote: > On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina wrote: > > On Fri, 5 Jan 2018, Dan Williams wrote: > > > > [ ... snip ... ] > >> Andi Kleen (1): > >> x86, barrier: stop speculation for failed access_ok > >> > >> Dan Williams (13): > >> x86: implement nospec_barrier() > >> [media] uvcvideo: prevent bounds-check bypass via speculative > >> execution > >> carl9170: prevent bounds-check bypass via speculative execution > >> p54: prevent bounds-check bypass via speculative execution > >> qla2xxx: prevent bounds-check bypass via speculative execution > >> cw1200: prevent bounds-check bypass via speculative execution > >> Thermal/int340x: prevent bounds-check bypass via speculative > >> execution > >> ipv6: prevent bounds-check bypass via speculative execution > >> ipv4: prevent bounds-check bypass via speculative execution > >> vfs, fdtable: prevent bounds-check bypass via speculative execution > >> net: mpls: prevent bounds-check bypass via speculative execution > >> udf: prevent bounds-check bypass via speculative execution > >> userns: prevent bounds-check bypass via speculative execution > >> > >> Mark Rutland (4): > >> asm-generic/barrier: add generic nospec helpers > >> Documentation: document nospec helpers > >> arm64: implement nospec_ptr() > >> arm: implement nospec_ptr() > > > > So considering the recent publication of [1], how come we all of a sudden > > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and > > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT? > > > > Is this going to be handled in eBPF in some other way? > > > > Without that in place, and considering Jann Horn's paper, it would seem > > like PTI doesn't really lock it down fully, right? > > Here is the latest (v3) bpf fix: > > https://patchwork.ozlabs.org/patch/856645/ > > I currently have v2 on my 'nospec' branch and will move that to v3 for > the next update, unless it goes upstream before then. That patch seems specific to CONFIG_BPF_SYSCALL. Is the bpf() syscall the only attack vector? Or are there other ways to run bpf programs that we should be worried about? -- Josh
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote: > On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams wrote: > > From: Andi Kleen > > > > When access_ok fails we should always stop speculating. > > Add the required barriers to the x86 access_ok macro. > > Honestly, this seems completely bogus. > > The description is pure garbage afaik. > > The fact is, we have to stop speculating when access_ok() does *not* > fail - because that's when we'll actually do the access. And it's that > access that needs to be non-speculative. > > That actually seems to be what the code does (it stops speculation > when __range_not_ok() returns false, but access_ok() is > !__range_not_ok()). But the explanation is crap, and dangerous. The description also seems to be missing the "why", as it's not self-evident (to me, at least). Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence? i.e., wouldn't the pattern be: get_user(uval, uptr); if (uval < array_size) { lfence(); foo = a2[a1[uval] * 256]; } Shouldn't the lfence come much later, *after* reading the variable and comparing it and branching accordingly? -- Josh
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote: > On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf wrote: > > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote: > >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams > >> wrote: > >> > From: Andi Kleen > >> > > >> > When access_ok fails we should always stop speculating. > >> > Add the required barriers to the x86 access_ok macro. > >> > >> Honestly, this seems completely bogus. > >> > >> The description is pure garbage afaik. > >> > >> The fact is, we have to stop speculating when access_ok() does *not* > >> fail - because that's when we'll actually do the access. And it's that > >> access that needs to be non-speculative. > >> > >> That actually seems to be what the code does (it stops speculation > >> when __range_not_ok() returns false, but access_ok() is > >> !__range_not_ok()). But the explanation is crap, and dangerous. > > > > The description also seems to be missing the "why", as it's not > > self-evident (to me, at least). > > > > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence? > > > > i.e., wouldn't the pattern be: > > > > get_user(uval, uptr); > > if (uval < array_size) { > > lfence(); > > foo = a2[a1[uval] * 256]; > > } > > > > Shouldn't the lfence come much later, *after* reading the variable and > > comparing it and branching accordingly? > > The goal of putting the lfence in uaccess_begin() is to prevent > speculation past access_ok(). Right, but what's the purpose of preventing speculation past access_ok()? -- Josh
Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok
On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote: > > Right, but what's the purpose of preventing speculation past > > access_ok()? > > Caution. It's the same rationale for the nospec_array_ptr() patches. > If we, kernel community, can identify any possible speculation past a > bounds check we should inject a speculation mitigation. Unless there's > a way to be 100% certain that the first unwanted speculation can be > turned into a gadget later on in the instruction stream, err on the > side of shutting it down early. I'm all for being cautious. The nospec_array_ptr() patches are fine, and they make sense in light of the variant 1 CVE. But that still doesn't answer my question. I haven't seen *any* rationale for this patch. It would be helpful to at least describe what's being protected against, even if it's hypothetical. How can we review it if the commit log doesn't describe its purpose? -- Josh
Re: [PATCH v2 1/1] qed: Add firmware 8.37.2.0
On Mon, May 21, 2018 at 10:25 AM Rahul Verma wrote: > This patch add a new qed firmware with fixes and added > support for new features. > -Fix aRFS for tunneled traffic without inner IP. > -Fix chip configuration which may fail under heavy traffic conditions. > -Fix iSCSI recovery flow. > -Support receiving any-VNI in VXLAN and GENEVE RX classification. > -Support additional RoCE statistics for error cases. > -Support RoCE T10DIF. > v1->v2: > Added signoff for Ariel Elior. > Signed-off-by: Rahul Verma > Signed-off-by: Ariel Elior > --- > WHENCE | 1 + > qed/qed_init_values_zipped-8.37.2.0.bin | Bin 0 -> 867472 bytes > 2 files changed, 1 insertion(+) > create mode 100755 qed/qed_init_values_zipped-8.37.2.0.bin Applied and pushed out. josh
[PATCH] net: qmi_wwan: Add Netgear Aircard 779S
Add support for Netgear Aircard 779S Signed-off-by: Josh Hill --- drivers/net/usb/qmi_wwan.c | 1 + 1 file changed, 1 insertion(+) diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c index 42565dd..0946808 100644 --- a/drivers/net/usb/qmi_wwan.c +++ b/drivers/net/usb/qmi_wwan.c @@ -1103,6 +1103,7 @@ static const struct usb_device_id products[] = { {QMI_FIXED_INTF(0x05c6, 0x920d, 5)}, {QMI_QUIRK_SET_DTR(0x05c6, 0x9625, 4)}, /* YUGA CLM920-NC5 */ {QMI_FIXED_INTF(0x0846, 0x68a2, 8)}, + {QMI_FIXED_INTF(0x0846, 0x68d3, 8)},/* Netgear Aircard 779S */ {QMI_FIXED_INTF(0x12d1, 0x140c, 1)},/* Huawei E173 */ {QMI_FIXED_INTF(0x12d1, 0x14ac, 1)},/* Huawei E1820 */ {QMI_FIXED_INTF(0x1435, 0xd181, 3)},/* Wistron NeWeb D18Q1 */ -- 2.7.4
Re: pull request Cavium liquidio vswitch firmware v1.7.2
On Wed, Jun 6, 2018 at 1:13 PM Felix Manlunas wrote: > > On Mon, May 21, 2018 at 10:39:20AM -0700, Felix Manlunas wrote: > > The following changes since commit 2a9b2cf50fb32e36e4fc1586c2f6f1421913b553: > > > > Merge branch 'for-upstreaming-v1.7.2' of > > https://github.com/felix-cavium/linux-firmware (2018-05-18 08:35:22 -0400) > > > > are available in the git repository at: > > > > https://github.com/felix-cavium/linux-firmware.git > > for-upstreaming-v1.7.2-vsw > > > > for you to fetch changes up to 0e193ca65d8b064502d61163597bf14eef81710f: > > > > linux-firmware: liquidio: update vswitch firmware to v1.7.2 (2018-05-19 > > 23:29:03 -0700) > > > > Signed-off-by: Manish Awasthi > > Signed-off-by: Felix Manlunas > > > > Felix Manlunas (1): > > linux-firmware: liquidio: update vswitch firmware to v1.7.2 > > > > WHENCE| 2 +- > > liquidio/lio_23xx_vsw.bin | Bin 19922416 -> 20434408 bytes > > 2 files changed, 1 insertion(+), 1 deletion(-) > > Hello Maintainers of linux-firmware.git, > > Any feedback about this submission? We sent it two weeks ago, but we > haven't heard anything. Thanks for the ping. I missed this one and then confused it with an older pull request with a similar subject line. I've pulled and pushed out now. josh
Re: [linux-firmware][pull request] ice: Add package file for Intel E800 series driver
On Mon, Sep 23, 2019 at 5:56 PM Jeff Kirsher wrote: > > From: Tony Nguyen > > The ice driver must load a package file to the firmware to utilize full > functionality; add the package file to /lib/firmware/intel/ice/ddp. Also > add a symlink, ice.pkg, so the driver can refer to the package by a > consistent name. > > Signed-off-by: Tony Nguyen > Tested-by: Andrew Bowers > Signed-off-by: Jeff Kirsher > --- > The following are changes since commit > 417a9c6e197a8d3eec792494efc87a2b42f76324: > amdgpu: add initial navi10 firmware > and are available in the git repository at: > git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/firmware dev-queue > > LICENSE.ice | 39 ++ > WHENCE| 9 > intel/ice/ddp/ice-1.3.4.0.pkg | Bin 0 -> 577796 bytes > intel/ice/ddp/ice.pkg | 1 + > 4 files changed, 49 insertions(+) > create mode 100644 LICENSE.ice > create mode 100644 intel/ice/ddp/ice-1.3.4.0.pkg > create mode 12 intel/ice/ddp/ice.pkg Thanks. I added another commit on top to remove the symlink and fix the WHENCE file because we merged a commit that changes how symlinks are handled. All applied and pushed out. josh
[PATCH 3/3] i40e: Add UDP segmentation offload support
Based on a series from Alexander Duyck this change adds UDP segmentation offload support to the i40e driver. CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 6031223eafab..56f8c52cbba1 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -12911,6 +12911,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi) NETIF_F_GSO_IPXIP6| NETIF_F_GSO_UDP_TUNNEL| NETIF_F_GSO_UDP_TUNNEL_CSUM | + NETIF_F_GSO_UDP_L4| NETIF_F_SCTP_CRC | NETIF_F_RXHASH| NETIF_F_RXCSUM| diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index e3f29dc8b290..0b32f04a6255 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2960,10 +2960,16 @@ static int i40e_tso(struct i40e_tx_buffer *first, u8 *hdr_len, /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; + if (skb->csum_offset == offsetof(struct tcphdr, check)) { + csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + } else { + csum_replace_by_diff(&l4.udp->check, (__force __wsum)htonl(paylen)); + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + } /* pull values out of skb_shinfo */ gso_size = skb_shinfo(skb)->gso_size; -- 2.7.4
[PATCH 1/3] igb: Add UDP segmentation offload support
Based on a series from Alexander Duyck this change adds UDP segmentation offload support to the igb driver. CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/igb/e1000_82575.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c| 23 +-- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h b/drivers/net/ethernet/intel/igb/e1000_82575.h index 6ad775b1a4c5..63ec253ac788 100644 --- a/drivers/net/ethernet/intel/igb/e1000_82575.h +++ b/drivers/net/ethernet/intel/igb/e1000_82575.h @@ -127,6 +127,7 @@ struct e1000_adv_tx_context_desc { }; #define E1000_ADVTXD_MACLEN_SHIFT9 /* Adv ctxt desc mac len shift */ +#define E1000_ADVTXD_TUCMD_L4T_UDP 0x /* L4 Packet TYPE of UDP */ #define E1000_ADVTXD_TUCMD_IPV40x0400 /* IP Packet Type: 1=IPv4 */ #define E1000_ADVTXD_TUCMD_L4T_TCP 0x0800 /* L4 Packet TYPE of TCP */ #define E1000_ADVTXD_TUCMD_L4T_SCTP 0x1000 /* L4 packet TYPE of SCTP */ diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 105b0624081a..5eabfac5a18d 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -2516,6 +2516,7 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(mac_hdr_len > IGB_MAX_MAC_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_TSO | NETIF_F_TSO6); @@ -2524,6 +2525,7 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(network_hdr_len > IGB_MAX_NETWORK_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_TSO | NETIF_F_TSO6); @@ -3120,7 +3122,7 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) NETIF_F_HW_CSUM; if (hw->mac.type >= e1000_82576) - netdev->features |= NETIF_F_SCTP_CRC; + netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4; if (hw->mac.type >= e1000_i350) netdev->features |= NETIF_F_HW_TC; @@ -5694,6 +5696,7 @@ static int igb_tso(struct igb_ring *tx_ring, } ip; union { struct tcphdr *tcp; + struct udphdr *udp; unsigned char *hdr; } l4; u32 paylen, l4_offset; @@ -5713,7 +5716,8 @@ static int igb_tso(struct igb_ring *tx_ring, l4.hdr = skb_checksum_start(skb); /* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */ - type_tucmd = E1000_ADVTXD_TUCMD_L4T_TCP; + type_tucmd = (skb->csum_offset == offsetof(struct tcphdr, check)) ? + E1000_ADVTXD_TUCMD_L4T_TCP : E1000_ADVTXD_TUCMD_L4T_UDP; /* initialize outer IP header fields */ if (ip.v4->version == 4) { @@ -5741,12 +5745,19 @@ static int igb_tso(struct igb_ring *tx_ring, /* determine offset of inner transport header */ l4_offset = l4.hdr - skb->data; - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; - /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, htonl(paylen)); + if (type_tucmd & E1000_ADVTXD_TUCMD_L4T_TCP) { + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + csum_replace_by_diff(&l4.tcp->check, + (__force __wsum)htonl(paylen)); + } else { + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + csum_replace_by_diff(&l4.udp->check, +(__force __wsum)htonl(paylen)); + } /* update gso size and bytecount with header size */ first->gso_segs = skb_shinfo(skb)->gso_segs; -- 2.7.4
[PATCH 0/3] igb, ixgbe, i40e UDP segmentation offload support
Alexander Duyck posted a series in 2018 proposing adding UDP segmentation offload support to ixgbe and ixgbevf, but those patches were never accepted: https://lore.kernel.org/netdev/20180504003556.4769.11407.stgit@localhost.localdomain/ This series is a repost of his ixgbe patch along with a similar change to the igb and i40e drivers. Testing using the udpgso_bench_tx benchmark shows a noticeable performance improvement with these changes applied. All #s below were run with: udpgso_bench_tx -C 1 -4 -D 172.25.43.133 -z -l 30 -u -S 0 -s $pkt_size igb:: SW GSO (ethtool -K eth0 tx-udp-segmentation off): $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472120143.64 113 81263 81263 83.55 1.35 2944120160.09 114 40638 40638 62.88 1.81 5888120160.64 114 20319 20319 43.59 2.61 11776 120160.76 114 10160 10160 37.52 3.03 23552 120159.25 114 5080508034.75 3.28 47104 120160.55 114 2540254032.83 3.47 61824 120160.56 114 1935193532.09 3.55 HW GSO offload (ethtool -K eth0 tx-udp-segmentation on): $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472120144.65 113 81264 81264 83.03 1.36 2944120161.56 114 40638 40638 41 2.78 5888120160.23 114 20319 20319 23.76 4.79 11776 120161.16 114 10160 10160 15.82 7.20 23552 120156.45 114 5079507912.88.90 47104 120159.33 114 254025408.8212.92 61824 120158.43 114 193519358.2413.83 ixgbe:: SW GSO: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 14721070565.90 1015724112 724112 100 10.15 29441201579.19 1140406342 406342 95.69 11.91 58881201217.55 1140203185 203185 55.38 20.58 11776 1201613.49 1140101588 101588 42.15 27.04 23552 1201631.32 114050795 50795 35.97 31.69 47104 1201626.38 114025397 25397 33.51 34.01 61824 1201625.52 114019350 19350 32.83 34.72 HW GSO Offload: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 14721058681.25 1004715954 715954 100 10.04 29441201730.86 1134404254 404254 61.28 18.50 58881201776.61 1131201608 201608 30.25 37.38 11776 1201795.90 1130100676 100676 16.63 67.94 23552 1201807.90 112950304 50304 10.07 112.11 47104 1201748.35 112825143 25143 6.8 165.88 61824 1200770.45 112819140 19140 5.38209.66 i40e:: SW GSO: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472650122.83 616 439362 439362 100 6.16 2944943993.53 895 319042 319042 100 8.95 58881199751.90 1138202857 202857 82.51 13.79 11776 1200288.08 1139101477 101477 64.34 17.70 23552 1201596.56 114050793 50793 59.74 19.08 47104 1201597.98 114025396 25396 56.31 20.24 61824 1201610.43 114019350 19350 55.48 20.54 HW GSO offload: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472657424.83 623 444653 444653 100 6.23 29441201242.87 1139406226 406226 91.45 12.45 58881201739.95 1140203199 203199 57.46 19.83 11776 1201557.36 1140101584 101584 36.83 30.95 23552 1201525.17 114050790 50790 23.86 47.77 47104 1201514.54 114025394 25394 17.45 65.32 61824 1201478.91 114019348 19348 14.79 77.07 I was not sure how to proper attribute Alexander on the ixgbe patch so please adjust this as necessary. Thanks! Josh Hunt (3): igb: Add UDP segmentation offload support ixgbe: Add UDP segmentation offload support i40e: Add UDP segmentation offload support drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +--- drivers/net/ethernet/intel/igb/e1000_82575.h | 1 + drivers/net/ethernet/intel
[PATCH 2/3] ixgbe: Add UDP segmentation offload support
Repost from a series by Alexander Duyck to add UDP segmentation offload support to the igb driver: https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/ CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 1ce2397306b9..2b01d264e5ce 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -7946,6 +7946,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, } ip; union { struct tcphdr *tcp; + struct udphdr *udp; unsigned char *hdr; } l4; u32 paylen, l4_offset; @@ -7969,6 +7970,9 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, l4.hdr = skb_checksum_start(skb); /* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */ + type_tucmd = (skb->csum_offset == offsetof(struct tcphdr, check)) ? + IXGBE_ADVTXD_TUCMD_L4T_TCP : IXGBE_ADVTXD_TUCMD_L4T_UDP; + type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP; /* initialize outer IP header fields */ @@ -7999,12 +8003,20 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, /* determine offset of inner transport header */ l4_offset = l4.hdr - skb->data; - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; - /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); + + if (type_tucmd & IXGBE_ADVTXD_TUCMD_L4T_TCP) { + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + csum_replace_by_diff(&l4.tcp->check, +(__force __wsum)htonl(paylen)); + } else { + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + csum_replace_by_diff(&l4.udp->check, +(__force __wsum)htonl(paylen)); + } /* update gso size and bytecount with header size */ first->gso_segs = skb_shinfo(skb)->gso_segs; @@ -10190,6 +10202,7 @@ ixgbe_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(mac_hdr_len > IXGBE_MAX_MAC_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_TSO | NETIF_F_TSO6); @@ -10198,6 +10211,7 @@ ixgbe_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(network_hdr_len > IXGBE_MAX_NETWORK_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_TSO | NETIF_F_TSO6); @@ -10907,7 +10921,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent) IXGBE_GSO_PARTIAL_FEATURES; if (hw->mac.type >= ixgbe_mac_82599EB) - netdev->features |= NETIF_F_SCTP_CRC; + netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4; #ifdef CONFIG_IXGBE_IPSEC #define IXGBE_ESP_FEATURES (NETIF_F_HW_ESP | \ -- 2.7.4
Re: [PATCH 3/3] i40e: Add UDP segmentation offload support
On 10/9/19 5:39 PM, Samudrala, Sridhar wrote: On 10/9/2019 3:06 PM, Josh Hunt wrote: Based on a series from Alexander Duyck this change adds UDP segmentation offload support to the i40e driver. CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 6031223eafab..56f8c52cbba1 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -12911,6 +12911,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi) NETIF_F_GSO_IPXIP6 | NETIF_F_GSO_UDP_TUNNEL | NETIF_F_GSO_UDP_TUNNEL_CSUM | + NETIF_F_GSO_UDP_L4 | NETIF_F_SCTP_CRC | NETIF_F_RXHASH | NETIF_F_RXCSUM | diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index e3f29dc8b290..0b32f04a6255 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2960,10 +2960,16 @@ static int i40e_tso(struct i40e_tx_buffer *first, u8 *hdr_len, /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; + if (skb->csum_offset == offsetof(struct tcphdr, check)) { Isn't it more relevant to check for gso_type rather than base this on the csum_offset? Thanks Sridhar for the review. Yeah I think you're right. I will change this on all 3 patches. Josh
Re: [PATCH 2/3] ixgbe: Add UDP segmentation offload support
On 10/9/19 3:06 PM, Josh Hunt wrote: Repost from a series by Alexander Duyck to add UDP segmentation offload support to the igb driver: https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/ CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 +++- 1 file changed, 19 insertions(+), 5 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 1ce2397306b9..2b01d264e5ce 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -7946,6 +7946,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, } ip; union { struct tcphdr *tcp; + struct udphdr *udp; unsigned char *hdr; } l4; u32 paylen, l4_offset; @@ -7969,6 +7970,9 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, l4.hdr = skb_checksum_start(skb); /* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */ + type_tucmd = (skb->csum_offset == offsetof(struct tcphdr, check)) ? + IXGBE_ADVTXD_TUCMD_L4T_TCP : IXGBE_ADVTXD_TUCMD_L4T_UDP; + type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP; Copy/paste bug ^^. Will fix in v2. Josh
Re: [PATCH 0/3] igb, ixgbe, i40e UDP segmentation offload support
On 10/9/19 3:44 PM, Alexander Duyck wrote: On Wed, Oct 9, 2019 at 3:08 PM Josh Hunt wrote: Alexander Duyck posted a series in 2018 proposing adding UDP segmentation offload support to ixgbe and ixgbevf, but those patches were never accepted: https://lore.kernel.org/netdev/20180504003556.4769.11407.stgit@localhost.localdomain/ This series is a repost of his ixgbe patch along with a similar change to the igb and i40e drivers. Testing using the udpgso_bench_tx benchmark shows a noticeable performance improvement with these changes applied. All #s below were run with: udpgso_bench_tx -C 1 -4 -D 172.25.43.133 -z -l 30 -u -S 0 -s $pkt_size igb:: SW GSO (ethtool -K eth0 tx-udp-segmentation off): $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472120143.64 113 81263 81263 83.55 1.35 2944120160.09 114 40638 40638 62.88 1.81 5888120160.64 114 20319 20319 43.59 2.61 11776 120160.76 114 10160 10160 37.52 3.03 23552 120159.25 114 5080508034.75 3.28 47104 120160.55 114 2540254032.83 3.47 61824 120160.56 114 1935193532.09 3.55 HW GSO offload (ethtool -K eth0 tx-udp-segmentation on): $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472120144.65 113 81264 81264 83.03 1.36 2944120161.56 114 40638 40638 41 2.78 5888120160.23 114 20319 20319 23.76 4.79 11776 120161.16 114 10160 10160 15.82 7.20 23552 120156.45 114 5079507912.88.90 47104 120159.33 114 254025408.8212.92 61824 120158.43 114 193519358.2413.83 ixgbe:: SW GSO: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 14721070565.90 1015724112 724112 100 10.15 29441201579.19 1140406342 406342 95.69 11.91 58881201217.55 1140203185 203185 55.38 20.58 11776 1201613.49 1140101588 101588 42.15 27.04 23552 1201631.32 114050795 50795 35.97 31.69 47104 1201626.38 114025397 25397 33.51 34.01 61824 1201625.52 114019350 19350 32.83 34.72 HW GSO Offload: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 14721058681.25 1004715954 715954 100 10.04 29441201730.86 1134404254 404254 61.28 18.50 58881201776.61 1131201608 201608 30.25 37.38 11776 1201795.90 1130100676 100676 16.63 67.94 23552 1201807.90 112950304 50304 10.07 112.11 47104 1201748.35 112825143 25143 6.8 165.88 61824 1200770.45 112819140 19140 5.38209.66 i40e:: SW GSO: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472650122.83 616 439362 439362 100 6.16 2944943993.53 895 319042 319042 100 8.95 58881199751.90 1138202857 202857 82.51 13.79 11776 1200288.08 1139101477 101477 64.34 17.70 23552 1201596.56 114050793 50793 59.74 19.08 47104 1201597.98 114025396 25396 56.31 20.24 61824 1201610.43 114019350 19350 55.48 20.54 HW GSO offload: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472657424.83 623 444653 444653 100 6.23 29441201242.87 1139406226 406226 91.45 12.45 58881201739.95 1140203199 203199 57.46 19.83 11776 1201557.36 1140101584 101584 36.83 30.95 23552 1201525.17 114050790 50790 23.86 47.77 47104 1201514.54 114025394 25394 17.45 65.32 61824 1201478.91 114019348 19348 14.79 77.07 I was not sure how to proper attribute Alexander on the ixgbe patch so please adjust this as necessary. For the ixgbe patch I would be good with: Suggested-by: Alexander Duyck The big hurdle for this will be validation. I know that there are some parts such as the 82598 in the case of the ixgbe driver or 82575 in the case of igb that didn't support the feature
Re: [PATCH 0/3] igb, ixgbe, i40e UDP segmentation offload support
On 10/10/19 2:32 PM, Alexander Duyck wrote: On Thu, Oct 10, 2019 at 2:17 PM Josh Hunt wrote: On 10/9/19 3:44 PM, Alexander Duyck wrote: On Wed, Oct 9, 2019 at 3:08 PM Josh Hunt wrote: Alexander Duyck posted a series in 2018 proposing adding UDP segmentation offload support to ixgbe and ixgbevf, but those patches were never accepted: https://lore.kernel.org/netdev/20180504003556.4769.11407.stgit@localhost.localdomain/ This series is a repost of his ixgbe patch along with a similar change to the igb and i40e drivers. Testing using the udpgso_bench_tx benchmark shows a noticeable performance improvement with these changes applied. All #s below were run with: udpgso_bench_tx -C 1 -4 -D 172.25.43.133 -z -l 30 -u -S 0 -s $pkt_size igb:: SW GSO (ethtool -K eth0 tx-udp-segmentation off): $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472120143.64 113 81263 81263 83.55 1.35 2944120160.09 114 40638 40638 62.88 1.81 5888120160.64 114 20319 20319 43.59 2.61 11776 120160.76 114 10160 10160 37.52 3.03 23552 120159.25 114 5080508034.75 3.28 47104 120160.55 114 2540254032.83 3.47 61824 120160.56 114 1935193532.09 3.55 HW GSO offload (ethtool -K eth0 tx-udp-segmentation on): $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472120144.65 113 81264 81264 83.03 1.36 2944120161.56 114 40638 40638 41 2.78 5888120160.23 114 20319 20319 23.76 4.79 11776 120161.16 114 10160 10160 15.82 7.20 23552 120156.45 114 5079507912.88.90 47104 120159.33 114 254025408.8212.92 61824 120158.43 114 193519358.2413.83 ixgbe:: SW GSO: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 14721070565.90 1015724112 724112 100 10.15 29441201579.19 1140406342 406342 95.69 11.91 58881201217.55 1140203185 203185 55.38 20.58 11776 1201613.49 1140101588 101588 42.15 27.04 23552 1201631.32 114050795 50795 35.97 31.69 47104 1201626.38 114025397 25397 33.51 34.01 61824 1201625.52 114019350 19350 32.83 34.72 HW GSO Offload: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 14721058681.25 1004715954 715954 100 10.04 29441201730.86 1134404254 404254 61.28 18.50 58881201776.61 1131201608 201608 30.25 37.38 11776 1201795.90 1130100676 100676 16.63 67.94 23552 1201807.90 112950304 50304 10.07 112.11 47104 1201748.35 112825143 25143 6.8 165.88 61824 1200770.45 112819140 19140 5.38209.66 i40e:: SW GSO: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472650122.83 616 439362 439362 100 6.16 2944943993.53 895 319042 319042 100 8.95 58881199751.90 1138202857 202857 82.51 13.79 11776 1200288.08 1139101477 101477 64.34 17.70 23552 1201596.56 114050793 50793 59.74 19.08 47104 1201597.98 114025396 25396 56.31 20.24 61824 1201610.43 114019350 19350 55.48 20.54 HW GSO offload: $pkt_size kB/s(sar) MB/sCalls/s Msg/s CPU MB2CPU 1472657424.83 623 444653 444653 100 6.23 29441201242.87 1139406226 406226 91.45 12.45 58881201739.95 1140203199 203199 57.46 19.83 11776 1201557.36 1140101584 101584 36.83 30.95 23552 1201525.17 114050790 50790 23.86 47.77 47104 1201514.54 114025394 25394 17.45 65.32 61824 1201478.91 114019348 19348 14.79 77.07 I was not sure how to proper attribute Alexander on the ixgbe patch so please adjust this as necessary. For the ixgbe patch I would be good with: Suggested-by: Alexander Duyck The big hurdle for this will be validation. I know that there are some parts such as the 82598
[PATCH v2 1/3] igb: Add UDP segmentation offload support
Based on a series from Alexander Duyck this change adds UDP segmentation offload support to the igb driver. CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/igb/e1000_82575.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c| 23 +-- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h b/drivers/net/ethernet/intel/igb/e1000_82575.h index 6ad775b1a4c5..63ec253ac788 100644 --- a/drivers/net/ethernet/intel/igb/e1000_82575.h +++ b/drivers/net/ethernet/intel/igb/e1000_82575.h @@ -127,6 +127,7 @@ struct e1000_adv_tx_context_desc { }; #define E1000_ADVTXD_MACLEN_SHIFT9 /* Adv ctxt desc mac len shift */ +#define E1000_ADVTXD_TUCMD_L4T_UDP 0x /* L4 Packet TYPE of UDP */ #define E1000_ADVTXD_TUCMD_IPV40x0400 /* IP Packet Type: 1=IPv4 */ #define E1000_ADVTXD_TUCMD_L4T_TCP 0x0800 /* L4 Packet TYPE of TCP */ #define E1000_ADVTXD_TUCMD_L4T_SCTP 0x1000 /* L4 packet TYPE of SCTP */ diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 105b0624081a..4131bc8b079e 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -2516,6 +2516,7 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(mac_hdr_len > IGB_MAX_MAC_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_TSO | NETIF_F_TSO6); @@ -2524,6 +2525,7 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(network_hdr_len > IGB_MAX_NETWORK_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_TSO | NETIF_F_TSO6); @@ -3120,7 +3122,7 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) NETIF_F_HW_CSUM; if (hw->mac.type >= e1000_82576) - netdev->features |= NETIF_F_SCTP_CRC; + netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4; if (hw->mac.type >= e1000_i350) netdev->features |= NETIF_F_HW_TC; @@ -5694,6 +5696,7 @@ static int igb_tso(struct igb_ring *tx_ring, } ip; union { struct tcphdr *tcp; + struct udphdr *udp; unsigned char *hdr; } l4; u32 paylen, l4_offset; @@ -5713,7 +5716,8 @@ static int igb_tso(struct igb_ring *tx_ring, l4.hdr = skb_checksum_start(skb); /* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */ - type_tucmd = E1000_ADVTXD_TUCMD_L4T_TCP; + type_tucmd = (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)) ? + E1000_ADVTXD_TUCMD_L4T_TCP : E1000_ADVTXD_TUCMD_L4T_UDP; /* initialize outer IP header fields */ if (ip.v4->version == 4) { @@ -5741,12 +5745,19 @@ static int igb_tso(struct igb_ring *tx_ring, /* determine offset of inner transport header */ l4_offset = l4.hdr - skb->data; - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; - /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, htonl(paylen)); + if (type_tucmd & E1000_ADVTXD_TUCMD_L4T_TCP) { + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + csum_replace_by_diff(&l4.tcp->check, + (__force __wsum)htonl(paylen)); + } else { + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + csum_replace_by_diff(&l4.udp->check, +(__force __wsum)htonl(paylen)); + } /* update gso size and bytecount with header size */ first->gso_segs = skb_shinfo(skb)->gso_segs; -- 2.7.4
[PATCH v2 2/3] ixgbe: Add UDP segmentation offload support
Repost from a series by Alexander Duyck to add UDP segmentation offload support to the igb driver: https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/ CC: Alexander Duyck CC: Willem de Bruijn Suggested-by: Alexander Duyck Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 1ce2397306b9..7d50c1a4a3be 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -7946,6 +7946,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, } ip; union { struct tcphdr *tcp; + struct udphdr *udp; unsigned char *hdr; } l4; u32 paylen, l4_offset; @@ -7969,7 +7970,8 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, l4.hdr = skb_checksum_start(skb); /* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */ - type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP; + type_tucmd = (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)) ? + IXGBE_ADVTXD_TUCMD_L4T_TCP : IXGBE_ADVTXD_TUCMD_L4T_UDP; /* initialize outer IP header fields */ if (ip.v4->version == 4) { @@ -7999,12 +8001,20 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, /* determine offset of inner transport header */ l4_offset = l4.hdr - skb->data; - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; - /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); + + if (type_tucmd & IXGBE_ADVTXD_TUCMD_L4T_TCP) { + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + csum_replace_by_diff(&l4.tcp->check, +(__force __wsum)htonl(paylen)); + } else { + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + csum_replace_by_diff(&l4.udp->check, +(__force __wsum)htonl(paylen)); + } /* update gso size and bytecount with header size */ first->gso_segs = skb_shinfo(skb)->gso_segs; @@ -10190,6 +10200,7 @@ ixgbe_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(mac_hdr_len > IXGBE_MAX_MAC_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_TSO | NETIF_F_TSO6); @@ -10198,6 +10209,7 @@ ixgbe_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(network_hdr_len > IXGBE_MAX_NETWORK_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_TSO | NETIF_F_TSO6); @@ -10907,7 +10919,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent) IXGBE_GSO_PARTIAL_FEATURES; if (hw->mac.type >= ixgbe_mac_82599EB) - netdev->features |= NETIF_F_SCTP_CRC; + netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4; #ifdef CONFIG_IXGBE_IPSEC #define IXGBE_ESP_FEATURES (NETIF_F_HW_ESP | \ -- 2.7.4
[PATCH v2 0/3] igb, ixgbe, i40e UDP segmentation offload support
91.67 113925394 25394 17.14 66.45 61824 1201598.88 114019350 19350 15.11 75.44 Josh Hunt (3): igb: Add UDP segmentation offload support ixgbe: Add UDP segmentation offload support i40e: Add UDP segmentation offload support drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +--- drivers/net/ethernet/intel/igb/e1000_82575.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c | 23 +-- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 ++-- 5 files changed, 46 insertions(+), 15 deletions(-) -- 2.7.4
[PATCH v2 3/3] i40e: Add UDP segmentation offload support
Based on a series from Alexander Duyck this change adds UDP segmentation offload support to the i40e driver. CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 6031223eafab..56f8c52cbba1 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -12911,6 +12911,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi) NETIF_F_GSO_IPXIP6| NETIF_F_GSO_UDP_TUNNEL| NETIF_F_GSO_UDP_TUNNEL_CSUM | + NETIF_F_GSO_UDP_L4| NETIF_F_SCTP_CRC | NETIF_F_RXHASH| NETIF_F_RXCSUM| diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index e3f29dc8b290..2250284e27fd 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2960,10 +2960,16 @@ static int i40e_tso(struct i40e_tx_buffer *first, u8 *hdr_len, /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; + if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)) { + csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + } else { + csum_replace_by_diff(&l4.udp->check, (__force __wsum)htonl(paylen)); + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + } /* pull values out of skb_shinfo */ gso_size = skb_shinfo(skb)->gso_size; -- 2.7.4
Re: [PATCH 0/3] igb, ixgbe, i40e UDP segmentation offload support
On 10/10/19 5:21 PM, Brown, Aaron F wrote: Adding Andrew as he is most likely going to be testing this patch. Unfortunately my mail server flags attached scripts as potential threats and strips them out. Can you resent it as an tar file? I don't believe it's smart enough to open up tar and flag it as a script. Hi Aaron It looks like the netdev archive has the file. Can you try grabbing it from here? https://lore.kernel.org/netdev/0e0e706c-4ce9-c27a-af55-339b4eb6d...@akamai.com/2-udpgso_bench.sh If that doesn't work I can try your tar workaround. Thanks Josh
Re: [PATCH v2 1/3] igb: Add UDP segmentation offload support
On 10/11/19 8:29 AM, Alexander Duyck wrote: On Thu, 2019-10-10 at 20:25 -0400, Josh Hunt wrote: Based on a series from Alexander Duyck this change adds UDP segmentation offload support to the igb driver. CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/igb/e1000_82575.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c| 23 +-- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h b/drivers/net/ethernet/intel/igb/e1000_82575.h index 6ad775b1a4c5..63ec253ac788 100644 --- a/drivers/net/ethernet/intel/igb/e1000_82575.h +++ b/drivers/net/ethernet/intel/igb/e1000_82575.h @@ -127,6 +127,7 @@ struct e1000_adv_tx_context_desc { }; #define E1000_ADVTXD_MACLEN_SHIFT9 /* Adv ctxt desc mac len shift */ +#define E1000_ADVTXD_TUCMD_L4T_UDP 0x /* L4 Packet TYPE of UDP */ #define E1000_ADVTXD_TUCMD_IPV40x0400 /* IP Packet Type: 1=IPv4 */ #define E1000_ADVTXD_TUCMD_L4T_TCP 0x0800 /* L4 Packet TYPE of TCP */ #define E1000_ADVTXD_TUCMD_L4T_SCTP 0x1000 /* L4 packet TYPE of SCTP */ diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 105b0624081a..4131bc8b079e 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -2516,6 +2516,7 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(mac_hdr_len > IGB_MAX_MAC_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_TSO | NETIF_F_TSO6); @@ -2524,6 +2525,7 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(network_hdr_len > IGB_MAX_NETWORK_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_TSO | NETIF_F_TSO6); @@ -3120,7 +3122,7 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) NETIF_F_HW_CSUM; if (hw->mac.type >= e1000_82576) - netdev->features |= NETIF_F_SCTP_CRC; + netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4; if (hw->mac.type >= e1000_i350) netdev->features |= NETIF_F_HW_TC; @@ -5694,6 +5696,7 @@ static int igb_tso(struct igb_ring *tx_ring, } ip; union { struct tcphdr *tcp; + struct udphdr *udp; unsigned char *hdr; } l4; u32 paylen, l4_offset; @@ -5713,7 +5716,8 @@ static int igb_tso(struct igb_ring *tx_ring, l4.hdr = skb_checksum_start(skb); /* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */ - type_tucmd = E1000_ADVTXD_TUCMD_L4T_TCP; + type_tucmd = (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)) ? + E1000_ADVTXD_TUCMD_L4T_TCP : E1000_ADVTXD_TUCMD_L4T_UDP; The logic here seems a bit convoluted. Instead of testing for '!SKB_GSO_UDP_L4' why not just make L4T_UDP the first option and drop the '!'? That will make the TCP offload the default case rather than the UDP offload. The same applies to the other 2 patches. OK sure. I think I was trying to keep the TCP case first for some reason, but agree that I should probably just reverse the order. Will send a v3. Josh
[PATCH v3 3/3] i40e: Add UDP segmentation offload support
Based on a series from Alexander Duyck this change adds UDP segmentation offload support to the i40e driver. CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +--- 2 files changed, 10 insertions(+), 3 deletions(-) diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c b/drivers/net/ethernet/intel/i40e/i40e_main.c index 6031223eafab..56f8c52cbba1 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_main.c +++ b/drivers/net/ethernet/intel/i40e/i40e_main.c @@ -12911,6 +12911,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi) NETIF_F_GSO_IPXIP6| NETIF_F_GSO_UDP_TUNNEL| NETIF_F_GSO_UDP_TUNNEL_CSUM | + NETIF_F_GSO_UDP_L4| NETIF_F_SCTP_CRC | NETIF_F_RXHASH| NETIF_F_RXCSUM| diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c index e3f29dc8b290..b8496037ef7f 100644 --- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c +++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c @@ -2960,10 +2960,16 @@ static int i40e_tso(struct i40e_tx_buffer *first, u8 *hdr_len, /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; + if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) { + csum_replace_by_diff(&l4.udp->check, (__force __wsum)htonl(paylen)); + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + } else { + csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + } /* pull values out of skb_shinfo */ gso_size = skb_shinfo(skb)->gso_size; -- 2.7.4
[PATCH v3 2/3] ixgbe: Add UDP segmentation offload support
Repost from a series by Alexander Duyck to add UDP segmentation offload support to the igb driver: https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/ CC: Alexander Duyck CC: Willem de Bruijn Suggested-by: Alexander Duyck Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 ++-- 1 file changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c index 1ce2397306b9..6c9edd272c7a 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c @@ -7946,6 +7946,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, } ip; union { struct tcphdr *tcp; + struct udphdr *udp; unsigned char *hdr; } l4; u32 paylen, l4_offset; @@ -7969,7 +7970,8 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, l4.hdr = skb_checksum_start(skb); /* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */ - type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP; + type_tucmd = (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) ? + IXGBE_ADVTXD_TUCMD_L4T_UDP : IXGBE_ADVTXD_TUCMD_L4T_TCP; /* initialize outer IP header fields */ if (ip.v4->version == 4) { @@ -7999,12 +8001,20 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring, /* determine offset of inner transport header */ l4_offset = l4.hdr - skb->data; - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; - /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen)); + + if (type_tucmd & IXGBE_ADVTXD_TUCMD_L4T_TCP) { + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + csum_replace_by_diff(&l4.tcp->check, +(__force __wsum)htonl(paylen)); + } else { + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + csum_replace_by_diff(&l4.udp->check, +(__force __wsum)htonl(paylen)); + } /* update gso size and bytecount with header size */ first->gso_segs = skb_shinfo(skb)->gso_segs; @@ -10190,6 +10200,7 @@ ixgbe_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(mac_hdr_len > IXGBE_MAX_MAC_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_TSO | NETIF_F_TSO6); @@ -10198,6 +10209,7 @@ ixgbe_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(network_hdr_len > IXGBE_MAX_NETWORK_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_TSO | NETIF_F_TSO6); @@ -10907,7 +10919,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const struct pci_device_id *ent) IXGBE_GSO_PARTIAL_FEATURES; if (hw->mac.type >= ixgbe_mac_82599EB) - netdev->features |= NETIF_F_SCTP_CRC; + netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4; #ifdef CONFIG_IXGBE_IPSEC #define IXGBE_ESP_FEATURES (NETIF_F_HW_ESP | \ -- 2.7.4
[PATCH v3 0/3] igb, ixgbe, i40e UDP segmentation offload support
114050794 50794 24.23 47.04 47104 1201491.67 113925394 25394 17.14 66.45 61824 1201598.88 114019350 19350 15.11 75.44 Josh Hunt (3): igb: Add UDP segmentation offload support ixgbe: Add UDP segmentation offload support i40e: Add UDP segmentation offload support drivers/net/ethernet/intel/i40e/i40e_main.c | 1 + drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +--- drivers/net/ethernet/intel/igb/e1000_82575.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c | 23 +-- drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 ++-- 5 files changed, 46 insertions(+), 15 deletions(-) -- 2.7.4
[PATCH v3 1/3] igb: Add UDP segmentation offload support
Based on a series from Alexander Duyck this change adds UDP segmentation offload support to the igb driver. CC: Alexander Duyck CC: Willem de Bruijn Signed-off-by: Josh Hunt --- drivers/net/ethernet/intel/igb/e1000_82575.h | 1 + drivers/net/ethernet/intel/igb/igb_main.c| 23 +-- 2 files changed, 18 insertions(+), 6 deletions(-) diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h b/drivers/net/ethernet/intel/igb/e1000_82575.h index 6ad775b1a4c5..63ec253ac788 100644 --- a/drivers/net/ethernet/intel/igb/e1000_82575.h +++ b/drivers/net/ethernet/intel/igb/e1000_82575.h @@ -127,6 +127,7 @@ struct e1000_adv_tx_context_desc { }; #define E1000_ADVTXD_MACLEN_SHIFT9 /* Adv ctxt desc mac len shift */ +#define E1000_ADVTXD_TUCMD_L4T_UDP 0x /* L4 Packet TYPE of UDP */ #define E1000_ADVTXD_TUCMD_IPV40x0400 /* IP Packet Type: 1=IPv4 */ #define E1000_ADVTXD_TUCMD_L4T_TCP 0x0800 /* L4 Packet TYPE of TCP */ #define E1000_ADVTXD_TUCMD_L4T_SCTP 0x1000 /* L4 packet TYPE of SCTP */ diff --git a/drivers/net/ethernet/intel/igb/igb_main.c b/drivers/net/ethernet/intel/igb/igb_main.c index 105b0624081a..b65e8a063aef 100644 --- a/drivers/net/ethernet/intel/igb/igb_main.c +++ b/drivers/net/ethernet/intel/igb/igb_main.c @@ -2516,6 +2516,7 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(mac_hdr_len > IGB_MAX_MAC_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_HW_VLAN_CTAG_TX | NETIF_F_TSO | NETIF_F_TSO6); @@ -2524,6 +2525,7 @@ igb_features_check(struct sk_buff *skb, struct net_device *dev, if (unlikely(network_hdr_len > IGB_MAX_NETWORK_HDR_LEN)) return features & ~(NETIF_F_HW_CSUM | NETIF_F_SCTP_CRC | + NETIF_F_GSO_UDP_L4 | NETIF_F_TSO | NETIF_F_TSO6); @@ -3120,7 +3122,7 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent) NETIF_F_HW_CSUM; if (hw->mac.type >= e1000_82576) - netdev->features |= NETIF_F_SCTP_CRC; + netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4; if (hw->mac.type >= e1000_i350) netdev->features |= NETIF_F_HW_TC; @@ -5694,6 +5696,7 @@ static int igb_tso(struct igb_ring *tx_ring, } ip; union { struct tcphdr *tcp; + struct udphdr *udp; unsigned char *hdr; } l4; u32 paylen, l4_offset; @@ -5713,7 +5716,8 @@ static int igb_tso(struct igb_ring *tx_ring, l4.hdr = skb_checksum_start(skb); /* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */ - type_tucmd = E1000_ADVTXD_TUCMD_L4T_TCP; + type_tucmd = (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) ? + E1000_ADVTXD_TUCMD_L4T_UDP : E1000_ADVTXD_TUCMD_L4T_TCP; /* initialize outer IP header fields */ if (ip.v4->version == 4) { @@ -5741,12 +5745,19 @@ static int igb_tso(struct igb_ring *tx_ring, /* determine offset of inner transport header */ l4_offset = l4.hdr - skb->data; - /* compute length of segmentation header */ - *hdr_len = (l4.tcp->doff * 4) + l4_offset; - /* remove payload length from inner checksum */ paylen = skb->len - l4_offset; - csum_replace_by_diff(&l4.tcp->check, htonl(paylen)); + if (type_tucmd & E1000_ADVTXD_TUCMD_L4T_TCP) { + /* compute length of segmentation header */ + *hdr_len = (l4.tcp->doff * 4) + l4_offset; + csum_replace_by_diff(&l4.tcp->check, + (__force __wsum)htonl(paylen)); + } else { + /* compute length of segmentation header */ + *hdr_len = sizeof(*l4.udp) + l4_offset; + csum_replace_by_diff(&l4.udp->check, +(__force __wsum)htonl(paylen)); + } /* update gso size and bytecount with header size */ first->gso_segs = skb_shinfo(skb)->gso_segs; -- 2.7.4
Re: [Intel-wired-lan] [PATCH v3 2/3] ixgbe: Add UDP segmentation offload support
On Thu, Oct 17, 2019 at 5:33 AM Bowers, AndrewX wrote: > > > -Original Message- > > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On > > Behalf Of Josh Hunt > > Sent: Friday, October 11, 2019 9:54 AM > > To: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Kirsher, > > Jeffrey T > > Cc: Duyck, Alexander H ; > > will...@google.com; Josh Hunt ; > > alexander.h.du...@linux.intel.com > > Subject: [Intel-wired-lan] [PATCH v3 2/3] ixgbe: Add UDP segmentation > > offload support > > > > Repost from a series by Alexander Duyck to add UDP segmentation offload > > support to the igb driver: > > https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.l > > ocaldomain/ > > > > CC: Alexander Duyck > > CC: Willem de Bruijn > > Suggested-by: Alexander Duyck > > Signed-off-by: Josh Hunt > > --- > > drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 > > ++-- > > 1 file changed, 18 insertions(+), 6 deletions(-) > > Tested-by: Andrew Bowers > > Andrew, thanks for testing. Were you able to validate the igb driver? Thanks -- Josh
Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state
On Mon, Oct 21, 2019 at 10:42 AM Subash Abhinov Kasiviswanathan wrote: > > > Please give us a pointer to the exact git tree and sha1. > > > > I do not analyze TCP stack problems without an exact starting point, > > or at least a crystal ball, which I do not have. > > Hi Eric > > We are at this commit - Merge 4.19.75 into android-4.19-q. > > https://android.googlesource.com/kernel/common/+/cfe571e421ab873403a8f75413e04ecf5bf7e393 > > -- > Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, > a Linux Foundation Collaborative Project We were investigating a problem in this area a few months back (it turned out to be something else), but were wondering if retrans_out, lost_out, and sacked_out should be cleared in tcp_rtx_queue_purge()? I meant to get back to it, but it got buried and this just thread jogged my memory... -- Josh
Re: [PATCH firmware] rtl_nic: add firmware files for RTL8153
On Tue, Oct 22, 2019 at 11:40 PM Hayes Wang wrote: > > This adds the firmware for Realtek RTL8153 Based USB Ethernet Adapters. > > 1. Fix compatible issue for Asmedia hub. > 2. Fix compatible issue for Compal platform. > 3. Fix sometimes the device is lost after rebooting. > 4. Improve the compatibility for EEE. > > Signed-off-by: Hayes Wang > --- > WHENCE| 11 +++ > rtl_nic/rtl8153a-2.fw | Bin 0 -> 1768 bytes > rtl_nic/rtl8153a-3.fw | Bin 0 -> 1440 bytes > rtl_nic/rtl8153a-4.fw | Bin 0 -> 712 bytes > rtl_nic/rtl8153b-2.fw | Bin 0 -> 1088 bytes > 5 files changed, 11 insertions(+) > create mode 100644 rtl_nic/rtl8153a-2.fw > create mode 100644 rtl_nic/rtl8153a-3.fw > create mode 100644 rtl_nic/rtl8153a-4.fw > create mode 100644 rtl_nic/rtl8153b-2.fw Applied and pushed out. josh
Re: [PATCH net] ixgbe: recognize 1000BaseLX SFP modules as 1Gbps
Bjørn Mork wrote: > Not that it matters much I guess, but I think LX SFPs were unsupported > at that time. The LX support appears to have been added under the radar > while refactoring ixgbe_setup_sfp_modules_X550em in commit e23f33367882 > ("ixgbe: Fix 1G and 10G link stability for X550EM_x SFP+") Looks like you’re right. Want me to respin with an additional “Fixes” tag? - Josh
[PATCH net] ixgbe: recognize 1000BaseLX SFP modules as 1Gbps
Add the two 1000BaseLX enum values to the X550's check for 1Gbps modules, allowing the core driver code to establish a link over this SFP type. This is done by the out-of-tree driver but the fix wasn't in mainline. Fixes: 6a14ee0cfb19 ("ixgbe: Add X550 support function pointers") Signed-off-by: Josh Elsasser --- drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c index 10dbaf4f6e80..9c42f741ed5e 100644 --- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c +++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c @@ -2262,7 +2262,9 @@ static s32 ixgbe_get_link_capabilities_X550em(struct ixgbe_hw *hw, *autoneg = false; if (hw->phy.sfp_type == ixgbe_sfp_type_1g_sx_core0 || - hw->phy.sfp_type == ixgbe_sfp_type_1g_sx_core1) { + hw->phy.sfp_type == ixgbe_sfp_type_1g_sx_core1 || + hw->phy.sfp_type == ixgbe_sfp_type_1g_lx_core0 || + hw->phy.sfp_type == ixgbe_sfp_type_1g_lx_core1) { *speed = IXGBE_LINK_SPEED_1GB_FULL; return 0; } -- 2.19.1
[PATCH net] rhashtable: avoid reschedule loop after rapid growth and shrink
When running workloads with large bursts of fragmented packets, we've seen a few machines stuck returning -EEXIST from rht_shrink() and endlessly rescheduling their hash table's deferred work, pegging a CPU core. Root cause is commit da20420f83ea ("rhashtable: Add nested tables"), which stops ignoring the return code of rhashtable_shrink() and the reallocs used to grow the hashtable. This uncovers a bug in the shrink logic where "needs to shrink" check runs against the last table but the actual shrink operation runs on the first bucket_table in the hashtable (see below): +---++--+ +---+ | ht|| "first" tbl | | "last" tbl| | - tbl ---> | - future_tbl -> | - future_tbl ---> NULL +---++--+ +---+ ^^^ ^^^ used by rhashtable_shrink() used by rht_shrink_below_30() A rehash then stalls out when both the last table needs to shrink, the first table has more elements than the target size, but rht_shrink() hits a non-NULL future_tbl and returns -EEXIST. This skips the item rehashing and kicks off a reschedule loop, as no forward progress can be made while the rhashtable needs to shrink. Extend rhashtable_shrink() with a "tbl" param to avoid endless exit-and- reschedules after hitting the EEXIST, allowing it to check a future_tbl pointer that can actually be non-NULL and make forward progress when the hashtable needs to shrink. Fixes: da20420f83ea ("rhashtable: Add nested tables") Signed-off-by: Josh Elsasser --- lib/rhashtable.c | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/lib/rhashtable.c b/lib/rhashtable.c index 852ffa5160f1..98e91f9544fa 100644 --- a/lib/rhashtable.c +++ b/lib/rhashtable.c @@ -377,9 +377,9 @@ static int rhashtable_rehash_alloc(struct rhashtable *ht, * It is valid to have concurrent insertions and deletions protected by per * bucket locks or concurrent RCU protected lookups and traversals. */ -static int rhashtable_shrink(struct rhashtable *ht) +static int rhashtable_shrink(struct rhashtable *ht, +struct bucket_table *old_tbl) { - struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht); unsigned int nelems = atomic_read(&ht->nelems); unsigned int size = 0; @@ -412,7 +412,7 @@ static void rht_deferred_worker(struct work_struct *work) if (rht_grow_above_75(ht, tbl)) err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2); else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl)) - err = rhashtable_shrink(ht); + err = rhashtable_shrink(ht, tbl); else if (tbl->nest) err = rhashtable_rehash_alloc(ht, tbl, tbl->size); -- 2.19.1
Re: [v2 PATCH] rhashtable: Still do rehash when we get EEXIST
On Jan 23, 2019, at 7:08 PM, Herbert Xu wrote: > Thanks for catching this! > > Although I think we should fix this in a different way. The problem > here is that the shrink cannot proceed because there was a previous > rehash that is still incomplete. We should wait for its completion > and then reattempt a shrinnk should it still be necessary. > > So something like this: SGTM. I can't test this right now because our VM server's down after a power outage this evening, but I tried a similar patch that swallowed the -EEXIST err and even with that oversight the hashtable dodged the reschedule loop. - Josh
Re: Bug in MACSec - stops passing traffic after approx 5TB
I've got wpa_supplicant working with macsec on Fedora, my test bed has shuffled 16 billion packets so far without interruption. I am a bit concerned that I've just pushed the resource exhaustion issue down the road though, looking at the output of ip macsec show I see four SAs for TX and RX, it appears to negotiate a new pair every 3 to 3.5 billion packets. It doesn't appear to be ripping down old SAs. What happens when available SA slots run out? Joshua Coombs GWI office 207-494-2140 www.gwi.net On Mon, Oct 15, 2018 at 11:45 AM Josh Coombs wrote: > > And confirmed, starting with a high packet number results in a very > short testbed run, 296 packets and then nothing, just as you surmised. > Sorry for raising the alarm falsely. Looks like I need to roll my own > build of wpa_supplicant as the ubuntu builds don't include the macsec > driver, haven't tested Gentoo's ebuilds yet to see if they do. > > Josh Coombs > > On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs wrote: > > > > On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca > > wrote: > > > > > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote: > > > > I initially mistook this for a traffic control issue, but after > > > > stripping the test beds down to just the MACSec component, I can still > > > > replicate the issue. After approximately 5TB of transfer / 4 billion > > > > packets over a MACSec link it stops passing traffic. > > > > > > I think you're just hitting packet number exhaustion. After 2^32 > > > packets, the packet number would wrap to 0 and start being reused, > > > which breaks the crypto used by macsec. Before this point, you have to > > > add a new SA, and tell the macsec device to switch to it. > > > > I had not considered that, I naively thought as long as I didn't > > specify a replay window, it'd roll the PN over on it's own and life > > would be good. I'll test that theory tomorrow, should be easy to > > prove out. > > > > > That's why you should be using wpa_supplicant. It will monitor the > > > growth of the packet number, and handle the rekey for you. > > > > Thank you for the heads up, I'll read up on this as well. > > > > Josh C
Re: Bug in MACSec - stops passing traffic after approx 5TB
I see it reusing SAs, so I'm good. Joshua Coombs On Wed, Oct 17, 2018 at 9:45 AM Josh Coombs wrote: > > I've got wpa_supplicant working with macsec on Fedora, my test bed has > shuffled 16 billion packets so far without interruption. I am a bit > concerned that I've just pushed the resource exhaustion issue down the > road though, looking at the output of ip macsec show I see four SAs > for TX and RX, it appears to negotiate a new pair every 3 to 3.5 > billion packets. It doesn't appear to be ripping down old SAs. What > happens when available SA slots run out? > > Joshua Coombs > GWI > > office 207-494-2140 > www.gwi.net > > On Mon, Oct 15, 2018 at 11:45 AM Josh Coombs wrote: > > > > And confirmed, starting with a high packet number results in a very > > short testbed run, 296 packets and then nothing, just as you surmised. > > Sorry for raising the alarm falsely. Looks like I need to roll my own > > build of wpa_supplicant as the ubuntu builds don't include the macsec > > driver, haven't tested Gentoo's ebuilds yet to see if they do. > > > > Josh Coombs > > > > On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs wrote: > > > > > > On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca > > > wrote: > > > > > > > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote: > > > > > I initially mistook this for a traffic control issue, but after > > > > > stripping the test beds down to just the MACSec component, I can still > > > > > replicate the issue. After approximately 5TB of transfer / 4 billion > > > > > packets over a MACSec link it stops passing traffic. > > > > > > > > I think you're just hitting packet number exhaustion. After 2^32 > > > > packets, the packet number would wrap to 0 and start being reused, > > > > which breaks the crypto used by macsec. Before this point, you have to > > > > add a new SA, and tell the macsec device to switch to it. > > > > > > I had not considered that, I naively thought as long as I didn't > > > specify a replay window, it'd roll the PN over on it's own and life > > > would be good. I'll test that theory tomorrow, should be easy to > > > prove out. > > > > > > > That's why you should be using wpa_supplicant. It will monitor the > > > > growth of the packet number, and handle the rekey for you. > > > > > > Thank you for the heads up, I'll read up on this as well. > > > > > > Josh C
Re: [PATCH linux-firmware] linux-firmware: liquidio: fix GPL compliance issue
On Wed, Oct 17, 2018 at 5:09 PM John W. Linville wrote: > > On Wed, Oct 17, 2018 at 07:34:42PM +, Manlunas, Felix wrote: > > On Fri, Sep 28, 2018 at 04:50:51PM -0700, Felix Manlunas wrote: > > > Part of the code inside the lio_vsw_23xx.bin firmware image is under GPL, > > > but the LICENCE.cavium file neglects to indicate that. However, > > > LICENCE.cavium does correctly specify the license that covers the other > > > Cavium firmware images that do not contain any GPL code. > > > > > > Fix the GPL compliance issue by adding a new file, > > > LICENCE.cavium_liquidio, > > > which correctly shows the GPL boilerplate. This new file specifies the > > > licenses for all liquidio firmware, including the ones that do not have > > > GPL code. > > > > > > Change the liquidio section of WHENCE to point to LICENCE.cavium_liquidio. > > > > > > Reported-by: Florian Weimer > > > Signed-off-by: Manish Awasthi > > > Signed-off-by: Manoj Panicker > > > Signed-off-by: Faisal Masood > > > Signed-off-by: Felix Manlunas > > > --- > > > LICENCE.cavium_liquidio | 429 > > > > > > WHENCE | 2 +- > > > 2 files changed, 430 insertions(+), 1 deletion(-) > > > create mode 100644 LICENCE.cavium_liquidio > > > > Hello Maintainers of linux-firmware.git, > > > > Any feedback about this patch? > > I would prefer to see an offer that included a defined URL for anyone > to download the source for the kernel in question without having to > announce themselves. The "send an email to i...@cavium.com" offer may > (or may not) be sufficient for the letter of the law. But it seems > both fragile and prone to subjective frustrations and delays for > users to obtain the sources at some future date. I agree with John here, but I also believe the patch is better than the current text in the upstream repo. I've committed it and pushed it out. If there are improvements to be made on source availability, we can take those in a different patch. Thank you for taking this seriously and responding quickly. josh
Re: [v2 PATCH] rhashtable: Still do rehash when we get EEXIST
On Jan 23, 2019, at 7:40 PM, Josh Elsasser wrote: > On Jan 23, 2019, at 7:08 PM, Herbert Xu wrote: > >> Thanks for catching this! >> >> Although I think we should fix this in a different way. The problem >> here is that the shrink cannot proceed because there was a previous >> rehash that is still incomplete. We should wait for its completion >> and then reattempt a shrinnk should it still be necessary. > > I can't test this right now because our VM server's down Got one of the poor little reproducer VM's back up and running and loaded up this patch. Works like a charm. For the v2 PATCH, can add my: Tested-by: Josh Elsasser
[PATCH net] net: set default network namespace in init_dummy_netdev()
Assign a default net namespace to netdevs created by init_dummy_netdev(). Fixes a NULL pointer dereference caused by busy-polling a socket bound to an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat if napi_poll() received packets: BUG: unable to handle kernel NULL pointer dereference at 0190 IP: napi_busy_loop+0xd6/0x200 Call Trace: sock_poll+0x5e/0x80 do_sys_poll+0x324/0x5a0 SyS_poll+0x6c/0xf0 do_syscall_64+0x6b/0x1f0 entry_SYSCALL_64_after_hwframe+0x3d/0xa2 Fixes: 7db6b048da3b ("net: Commonize busy polling code to focus on napi_id instead of socket") Signed-off-by: Josh Elsasser --- net/core/dev.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/net/core/dev.c b/net/core/dev.c index 82f20022259d..d1043d49979c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -8712,7 +8712,9 @@ int init_dummy_netdev(struct net_device *dev) set_bit(__LINK_STATE_PRESENT, &dev->state); set_bit(__LINK_STATE_START, &dev->state); + /* napi_busy_loop stats accounting wants this */ + dev_net_set(dev, &init_net); + /* Note : We dont allocate pcpu_refcnt for dummy devices, * because users of this 'device' dont need to change * its refcount. -- 2.19.1
Re: [Intel-wired-lan] [PATCH net] ixgbe: recognize 1000BaseLX SFP modules as 1Gbps
Jeff Kirsher wrote: > No need to re-spin the patch, just reply with the additional "Fixes" > tag and if patchwork does not pick it up, I will add it to the patch I > have in my tree for validation and review. Thanks, let’s try that. Fixes: e23f33367882 ("ixgbe: Fix 1G and 10G link stability for X550EM_x SFP+”)
Re: [PATCH linux-firmware] Mellanox: Add new mlxsw_spectrum firmware 13.1910.622
On Thu, Dec 13, 2018 at 3:45 PM Shalom Toledo wrote: > > This new firmware contains: * New packet traps for discarded packets * Secure > firmware flash bug fix * Fence mechanism bug fix * TCAM RMA bug fix > Signed-off-by: Shalom Toledo Applied and pushed out. josh
Possible bug in traffic control?
multicast off sysctl -w net.ipv6.conf."$dif".autoconf=0 sysctl -w net.ipv6.conf."$dif".accept_ra=0 sysctl -w net.ipv4.conf."$dif".arp_ignore=8 sysctl -w net.ipv4.conf."$dif".rp_filter=0 ip link set "$eif" down promisc on arp off multicast off sysctl -w net.ipv6.conf."$eif".autoconf=0 sysctl -w net.ipv6.conf."$eif".accept_ra=0 sysctl -w net.ipv4.conf."$eif".arp_ignore=8 sysctl -w net.ipv4.conf."$eif".rp_filter=0 # Set up traffic mirroring: echo "* Start Port Mirror" # sif to eif tc qdisc add dev "$sif" ingress tc filter add dev "$sif" parent : \ protocol all \ u32 match u8 0 0 \ action mirred egress mirror dev "$eif" # eif to sif tc qdisc add dev "$eif" ingress tc filter add dev "$eif" parent : \ protocol all \ u32 match u8 0 0 \ action mirred egress mirror dev "$sif" # Bring up the interfaces: echo "* Light tunnel NICS" ip link set "$sif" up ip link set "$dif" up ip link set "$eif" up echo " --=[ MACSec Up ]=--" --- Josh Coombs
Re: Possible bug in traffic control?
2.3 billion 1 byte packets failed to re-create the bug. To try and simplify the setup I removed macsec from the equation, using a single host in the middle as the bridge. Interestingly, rather than 1.3Gbits a second in both directions, it ran around 8Mbits a second. Switching the filter from u32 to matchall didn't change the performance. Going back to the four machine test bed, again removing macsec and just bridging through radically decreased the throughput to around 8Mbits. Flip on macsec for the bridge and 1.3Gbits? On Tue, Oct 9, 2018 at 11:58 AM Josh Coombs wrote: > > Hello all, I'm looking for some guidance in chasing what I believe to > be a bug in kernel traffic control filters. If I'm pinging the wrong > list let me know. > > I have a homebrew MACSec bridge setup using two pairs of PCs. I > establish a MACSec link between them, and then use TC to bridge a > second ethernet interface over the MACSec link. The second interface > is connected to a Juniper switch at each end, and I'm using LACP over > the links to bond them up for redundancy. It turns out I need that > redundancy as after awhile one pair of bridges will stop flowing > packets in one direction. I've since replicated this failure with a > group of VMs as well. > > My test setup to replicate the failure inside ESXi: > - Two MACSec bridge VMs, A and Z > - Two IPerf VMs, A and Z > My VMs are currently built using Ubuntu Server 18.04 to be quick, no > additional packages are required outside of iperf3. Kernel ver as > shipped currently is 4.15.0-36. I highly advise using a CPU with AES > instruction support as MACSec eats CPU without it and will take longer > to reproduce the symptoms. > > - A 'MACSec Bridge' network > - A 'A Side link' network > - A 'Z Side link' network > In ESXi I used a dedicated vSwitch, 9000 MTU (to allow full 1500 eth > packets + MACSec to pass on the bridge) and the security policy is > full open (allow promiscuous, allow forged, allow mac changes) as > we're abusing the networks as direct point to point links. If using > physical machines, just cable up, my example script bumps the MTU as > required. > > The MACSec boxes have two ethernet interfaces each. One pair is on > the MACSec Bridge network. The other interfaces go to the A and Z > IPerf boxes respectively via their dedicated networks. A and Z need > their interfaces configured with IPs in a common subnet, such as > 192.168.0.1/30 and 192.168.0.2/30. > > My script sets up MACSec, tweaks MTUs, and touches a few sysctls to > turn the involved interfaces into silent actors. It then uses TC to > start the actual bridging. From there I've been firing up iperf 3 > sessions in both directions between A and Z to hammer the bridge until > it fails. When it does, I can see packets stop being bridged in one > direction on one MACSec host, but not the other. The second host > continues to flow packets in both directions. Nothing is logged to > dmesg when this fault occurs. The fault seems to occur at roughly the > same packet / traffic amount each time. On my main application it's > after approximately 2.5TB of traffic (random mix of sizes) and with my > test bed it was after 5.5TB of 1500 byte packets. > > On the impacted MACSec node, watching interface packet counters via > ifconfig and actual traffic with tcpdump I can see packets coming in > MACSec and going out the host interface, the host reply coming in but > not showing up on the MACSec interface to cross the bridge. Clearing > out the tc filter and qdisc and re-adding does not restore traffic > flow. > > There is a PPA with 4.18 available for Ubuntu that I'm going to test > with next to see if that makes a difference in behavior. In the mean > time I'd appreciate any suggestions on how to diagnose this. > > My MACSec bridge setup script, update sif, dif, the keys and rxmac to > match your setup. The rxmac is the mac addy of the remote bridge > interface. Keys need to be flipped between systems. > --- > #!/bin/bash > > # Interfaces: > # sif = Ingress physical interface (Source) > # dif = Egress physical interface (Dest) > # eif = Encrypted interface > sif=eno2 > dif=enp1s0f0 > eif=macsec0 > > # MACSec Keys: > # txkey = Transmit (Local) key > # rxkey = Receive (Remote) key > # rxmac = Receive (Remote) MAC addy > txkey= > rxkey= > rxmac=00:11:22:33:44:55 > > # Use jumbo frames for macsec to allow full 1500 MTU passthrough: > echo "* MTU update" > ip link set "$sif" mtu 9000 > ip link set "$dif" mtu 9000 > > # Bring up macsec:
Re: Possible bug in traffic control?
I'm actually leaning towards macsec now. I'm at 6TB transferred in a double hop, no macsec over the bridge setup without triggering the fault. I'm going to let it continue to churn and setup a second testbed that JUST uses macsec without traffic control bridging to see if I can trip the issue there.That should determine if it's macsec itself, or an interaction between macsec and traffic control. Joshua Coombs GWI office 207-494-2140 www.gwi.net On Wed, Oct 10, 2018 at 12:39 PM Cong Wang wrote: > > On Wed, Oct 10, 2018 at 8:54 AM Josh Coombs wrote: > > > > 2.3 billion 1 byte packets failed to re-create the bug. To try and > > simplify the setup I removed macsec from the equation, using a single > > host in the middle as the bridge. Interestingly, rather than 1.3Gbits > > a second in both directions, it ran around 8Mbits a second. Switching > > the filter from u32 to matchall didn't change the performance. Going > > back to the four machine test bed, again removing macsec and just > > bridging through radically decreased the throughput to around 8Mbits. > > Flip on macsec for the bridge and 1.3Gbits? > > This is a great narrow down! We can rule out macsec for guilty. > > Can you share a minimum reproducer for this problem? If so I can take > a look.
Re: Possible bug in traffic control?
I've been able to narrow the scope down, the issue is with macsec itself. I setup two hosts with a macsec link between them, and let a couple iperf3 sessions blast traffic across. At approximately 4.2 billion packets / 6TB data transferred one end stopped transmitting packets. Doing a tcpdump on the impacted node's macsec0 interface shows packets coming in from the remote node, in this case arp requests, and arp replies from the local host, but watching the interface counters for macsec0 no packets are being recorded as transmitting. Again, nothing in dmesg implying an error. Deleting the macsec interface via ip link delete macsec0 and re-creating it gets traffic flowing again without a reboot. Meanwhile my traffic control bridge without macsec has shuffled 19TB via 22 billion packets and not skipped a beat, so it appears my initial assumption of it being the culprit was wrong. To replicate, setup two hosts with a direct ethernet link between each other. - Bring up macsec between the two hosts, setup a dedicated /30 on the link. - Use iperf3 or another traffic generating tool over the /30, one session for each direction. - Wait for traffic to stop. My test bed is on Ubuntu Server 18.04 currently, kernel 4.15.0-36. I'm going to spin up a vanilla kernel on 4.15 and then -current to see if this is an Ubuntu-ism from their patches, specific to 4.15, or a general issue with macsec. The script I used on each host (keys, rxmacs and IPs updated as appropriate): #!/bin/bash # Interfaces: # dif = Egress physical interface (Dest) # eif = Encrypted interface dif=ens224 eif=macsec0 # MACSec Keys: # txkey = Transmit (Local) key # rxkey = Receive (Remote) key # rxmac = Receive (Remote) MAC addy txkey=60995924232808431491190820961556 rxkey=87345530111733181210202106249824 rxmac=00:0c:29:c5:95:df # Clear any existing IP config ifconfig $dif 0.0.0.0 # Bring up macsec: echo "* Enable MACSec" modprobe macsec ip link add link "$dif" "$eif" type macsec ip macsec add "$eif" tx sa 0 pn 1 on key 02 "$txkey" ip macsec add "$eif" rx address "$rxmac" port 1 ip macsec add "$eif" rx address "$rxmac" port 1 sa 0 pn 1 on key 01 "$rxkey" ip link set "$eif" type macsec encrypt on # Bring up the interfaces: echo "* Light tunnel NICS" ip link set "$dif" up ip link set "$eif" up # Set IP ifconfig $eif 192.168.211.1/30 echo " --=[ MACSec Up ]=--" On Thu, Oct 11, 2018 at 10:05 AM Josh Coombs wrote: > > I'm actually leaning towards macsec now. I'm at 6TB transferred in a > double hop, no macsec over the bridge setup without triggering the > fault. I'm going to let it continue to churn and setup a second > testbed that JUST uses macsec without traffic control bridging to see > if I can trip the issue there.That should determine if it's macsec > itself, or an interaction between macsec and traffic control. > > Joshua Coombs > GWI > > office 207-494-2140 > www.gwi.net > > On Wed, Oct 10, 2018 at 12:39 PM Cong Wang wrote: > > > > On Wed, Oct 10, 2018 at 8:54 AM Josh Coombs wrote: > > > > > > 2.3 billion 1 byte packets failed to re-create the bug. To try and > > > simplify the setup I removed macsec from the equation, using a single > > > host in the middle as the bridge. Interestingly, rather than 1.3Gbits > > > a second in both directions, it ran around 8Mbits a second. Switching > > > the filter from u32 to matchall didn't change the performance. Going > > > back to the four machine test bed, again removing macsec and just > > > bridging through radically decreased the throughput to around 8Mbits. > > > Flip on macsec for the bridge and 1.3Gbits? > > > > This is a great narrow down! We can rule out macsec for guilty. > > > > Can you share a minimum reproducer for this problem? If so I can take > > a look.
Bug in MACSec - stops passing traffic after approx 5TB
I initially mistook this for a traffic control issue, but after stripping the test beds down to just the MACSec component, I can still replicate the issue. After approximately 5TB of transfer / 4 billion packets over a MACSec link it stops passing traffic. I have replicated this now on both Ubuntu Server 18.04 with their patched 4.15 kernel and Gentoo with a vanilla 4.18.13 kernel. As noted before, my test setup consists of two machines with a direct ethernet connection to each other. MACSec is setup on the link, along with a /30 network. I then hammer the link with iperf3 in both directions. On a gigabit link it takes me one to two days to trip the bug. Nothing is logged to dmesg. If I remove and re-add the MACSec link, or reboot the machines traffic flow resumes. I can replicate on physical hardware or in VMs. How should I proceed to help diag and correct this? To replicate, setup two machines with a direct ethernet connection. If simulating in ESXi setup a dedicated vSwitch, allow promiscuous, forged and MAC changes should be enabled, MTU increased to 9000, then setup a dedicated port group and VLAN to simulate the direct connection. I use the following script to setup the MACSec connection, adjusting keys, rxmac, interface and IPs as appropriate: --- The script I used on each host (keys, rxmacs and IPs updated as appropriate): #!/bin/bash # Interfaces: # dif = Egress physical interface (Dest) # eif = Encrypted interface dif=ens224 eif=macsec0 # MACSec Keys: # txkey = Transmit (Local) key # rxkey = Receive (Remote) key # rxmac = Receive (Remote) MAC addy txkey=60995924232808431491190820961556 rxkey=87345530111733181210202106249824 rxmac=00:0c:29:c5:95:df # Clear any existing IP config ifconfig $dif 0.0.0.0 # Bring up macsec: echo "* Enable MACSec" modprobe macsec ip link add link "$dif" "$eif" type macsec ip macsec add "$eif" tx sa 0 pn 1 on key 02 "$txkey" ip macsec add "$eif" rx address "$rxmac" port 1 ip macsec add "$eif" rx address "$rxmac" port 1 sa 0 pn 1 on key 01 "$rxkey" ip link set "$eif" type macsec encrypt on # Bring up the interfaces: echo "* Light tunnel NICS" ip link set "$dif" up ip link set "$eif" up # Set IP ifconfig $eif 192.168.211.1/30 Once you can ping across the link, use iperf3 or a similar network stress tool to flood the link with traffic in both directions and wait for the bug to trigger. Josh Coombs
Re: Bug in MACSec - stops passing traffic after approx 5TB
On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca wrote: > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote: > > I initially mistook this for a traffic control issue, but after > > stripping the test beds down to just the MACSec component, I can still > > replicate the issue. After approximately 5TB of transfer / 4 billion > > packets over a MACSec link it stops passing traffic. > > I think you're just hitting packet number exhaustion. After 2^32 > packets, the packet number would wrap to 0 and start being reused, > which breaks the crypto used by macsec. Before this point, you have to > add a new SA, and tell the macsec device to switch to it. I had not considered that, I naively thought as long as I didn't specify a replay window, it'd roll the PN over on it's own and life would be good. I'll test that theory tomorrow, should be easy to prove out. > That's why you should be using wpa_supplicant. It will monitor the > growth of the packet number, and handle the rekey for you. Thank you for the heads up, I'll read up on this as well. Josh C
Re: Bug in MACSec - stops passing traffic after approx 5TB
And confirmed, starting with a high packet number results in a very short testbed run, 296 packets and then nothing, just as you surmised. Sorry for raising the alarm falsely. Looks like I need to roll my own build of wpa_supplicant as the ubuntu builds don't include the macsec driver, haven't tested Gentoo's ebuilds yet to see if they do. Josh Coombs On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs wrote: > > On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca wrote: > > > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote: > > > I initially mistook this for a traffic control issue, but after > > > stripping the test beds down to just the MACSec component, I can still > > > replicate the issue. After approximately 5TB of transfer / 4 billion > > > packets over a MACSec link it stops passing traffic. > > > > I think you're just hitting packet number exhaustion. After 2^32 > > packets, the packet number would wrap to 0 and start being reused, > > which breaks the crypto used by macsec. Before this point, you have to > > add a new SA, and tell the macsec device to switch to it. > > I had not considered that, I naively thought as long as I didn't > specify a replay window, it'd roll the PN over on it's own and life > would be good. I'll test that theory tomorrow, should be easy to > prove out. > > > That's why you should be using wpa_supplicant. It will monitor the > > growth of the packet number, and handle the rekey for you. > > Thank you for the heads up, I'll read up on this as well. > > Josh C
Re: [PATCH 1/1] bnx2x: Add FW 7.13.11.0.
On Sat, Feb 9, 2019 at 11:06 PM Rahul Verma wrote: > > From: Rahul Verma > > This patch adds new FW for bnx2x, which adds the following: > - TX VLAN filtering support. > - Enable TPA only for packets without VLAN. > > It also addresses few critical issues, > - Fairness algorithm misbehaviour when minimum bandwidth configured >for all PFs. > - Error recovery issue on TAPE devices. > - FW not discarding FIP frames that are not designated to PF. > - Kernel driver initialization failure after preboot driver. > - VxLAN stops working after sending inner IP fragmented traffic. > - Issues in the following FW flows: > SD VLAN update, TX packet drop, packet padding flow, vlan add/remove. > > Signed-off-by: Sudarsana Reddy Kalluru > Signed-off-by: Ariel Elior > Signed-off-by: Rahul Verma Applied and pushed out. josh
Re: pull-request Cavium liquidio firmware v1.7.2
On Wed, May 9, 2018 at 12:54 AM Felix Manlunas wrote: > The following changes since commit 8fc2d4e55685bf73b6f7752383da9067404a74bb: >Merge git://git.marvell.com/mwifiex-firmware (2018-05-07 09:09:40 -0400) > are available in the git repository at: >https://github.com/felix-cavium/linux-firmware.git for-upstreaming-v1.7.2 > for you to fetch changes up to d3b6941e1a85cbff895a92aa9e36b50deaeac970: >linux-firmware: liquidio: update firmware to v1.7.2 (2018-05-08 19:02:41 -0700) > Signed-off-by: Felix Manlunas > > Felix Manlunas (1): >linux-firmware: liquidio: update firmware to v1.7.2 > WHENCE | 8 > liquidio/lio_210nv_nic.bin | Bin 1265368 -> 1281464 bytes > liquidio/lio_210sv_nic.bin | Bin 1163128 -> 1179352 bytes > liquidio/lio_23xx_nic.bin | Bin 1271456 -> 1287264 bytes > liquidio/lio_410nv_nic.bin | Bin 1265368 -> 1281464 bytes > 5 files changed, 4 insertions(+), 4 deletions(-) Pulled and pushed out. Thanks. josh
Re: [PATCH] bpf: Tweak BPF jump table optimizations for objtool compatibility
On Wed, May 06, 2020 at 05:03:57PM -0700, Alexei Starovoitov wrote: > > > > > diff --git a/include/linux/compiler-gcc.h > > > > > b/include/linux/compiler-gcc.h > > > > > index d7ee4c6bad48..05104c3cc033 100644 > > > > > --- a/include/linux/compiler-gcc.h > > > > > +++ b/include/linux/compiler-gcc.h > > > > > @@ -171,4 +171,4 @@ > > > > > #define __diag_GCC_8(s) > > > > > #endif > > > > > > > > > > -#define __no_fgcse __attribute__((optimize("-fno-gcse"))) > > > > > +#define __no_fgcse > > > > > __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer"))) > > > > > -- > > > > > 2.23.0 > > > > > > > > > > I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y. > > > > > All works. > > > > > I think it's safer to go with frame pointers even for ORC=y > > > > > considering > > > > > all the pain this issue had caused. Even if objtool gets confused > > > > > again > > > > > in the future __bpf_prog_run() will have frame pointers and kernel > > > > > stack > > > > > unwinding can fall back from ORC to FP for that frame. > > > > > wdyt? > > > > > > > > It seems dangerous to me. The GCC manual recommends against it. > > > > > > The manual can says that it's broken. That won't stop the world from > > > using it. > > > Just google projects that are using it. For example: qt, lz4, unreal > > > engine, etc > > > Telling compiler to disable gcse via flag is a guaranteed way to avoid > > > that optimization that breaks objtool whereas messing with C code is > > > nothing > > > but guess work. gcc can still do gcse. > > > > But the manual's right, it is broken. How do you know other important > > flags won't also be stripped? > > What flags are you worried about? > I've checked that important things like -mno-red-zone, -fsanitize are > preserved. It's not any specific flags I'm worried about, it's all of them. There are a lot of possibilities, with all the different configs, and arches. Flags are usually added for a good reason, so one randomly missing flag could have unforeseen results. And I don't have any visibility into how GCC decides which flags to drop, and when. But the docs aren't comforting. Even if things seem to work now, that could (silently) change at any point in time. This time objtool warned about the missing frame pointer, but that's not necessarily going to happen for other flags. If we go this route, I would much rather do -fno-gcse on a file-wide basis. -- Josh
[PATCH] bpf: Tweak BPF jump table optimizations for objtool compatibility
Objtool decodes instructions and follows all potential code branches within a function. But it's not an emulator, so it doesn't track register values. For that reason, it usually can't follow intra-function indirect branches, unless they're using a jump table which follows a certain format (e.g., GCC switch statement jump tables). In most cases, the generated code for the BPF jump table looks a lot like a GCC jump table, so objtool can follow it. However, with RETPOLINE=n, GCC keeps the jump table address in a register, and then does 160+ indirect jumps with it. When objtool encounters the indirect jumps, it can't tell which jump table is being used (or even whether they might be sibling calls instead). This was fixed before by disabling an optimization in ___bpf_prog_run(), using the "optimize" function attribute. However, that attribute is bad news. It doesn't append options to the command-line arguments. Instead it starts from a blank slate. And according to recent GCC documentation it's not recommended for production use. So revert the previous fix: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()") With that reverted, solve the original problem in a different way by getting rid of the "goto select_insn" indirection, and instead just goto the jump table directly. This simplifies the code a bit and helps GCC generate saner code for the jump table branches, at least in the RETPOLINE=n case. But, in the RETPOLINE=y case, this simpler code actually causes GCC to generate far worse code, ballooning the function text size by +40%. So leave that code the way it was. In fact Alexei prefers to leave *all* the code the way it was, except where needed by objtool. So even non-x86 RETPOLINE=n code will continue to have "goto select_insn". This stuff is crazy voodoo, and far from ideal. But it works for now. Eventually, there's a plan to create a compiler plugin for annotating jump tables. That will make this a lot less fragile. Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()") Reported-by: Randy Dunlap Reported-by: Arnd Bergmann Signed-off-by: Josh Poimboeuf --- include/linux/compiler-gcc.h | 2 -- include/linux/compiler_types.h | 4 kernel/bpf/core.c | 10 +++--- 3 files changed, 7 insertions(+), 9 deletions(-) diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h index cf294faec2f8..2c8583eb5de8 100644 --- a/include/linux/compiler-gcc.h +++ b/include/linux/compiler-gcc.h @@ -176,5 +176,3 @@ #else #define __diag_GCC_8(s) #endif - -#define __no_fgcse __attribute__((optimize("-fno-gcse"))) diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h index e970f97a7fcb..58105f1deb79 100644 --- a/include/linux/compiler_types.h +++ b/include/linux/compiler_types.h @@ -203,10 +203,6 @@ struct ftrace_likely_data { #define asm_inline asm #endif -#ifndef __no_fgcse -# define __no_fgcse -#endif - /* Are two types/vars the same type (ignoring qualifiers)? */ #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b)) diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c index 916f5132a984..eec470c598ad 100644 --- a/kernel/bpf/core.c +++ b/kernel/bpf/core.c @@ -1364,7 +1364,7 @@ u64 __weak bpf_probe_read_kernel(void *dst, u32 size, const void *unsafe_ptr) * * Decode and execute eBPF instructions. */ -static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack) +static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack) { #define BPF_INSN_2_LBL(x, y)[BPF_##x | BPF_##y] = &&x##_##y #define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z @@ -1384,11 +1384,15 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6 #undef BPF_INSN_2_LBL u32 tail_call_cnt = 0; +#if defined(CONFIG_X86_64) && !defined(CONFIG_RETPOLINE) +#define CONT({ insn++; goto *jumptable[insn->code]; }) +#define CONT_JMP ({ insn++; goto *jumptable[insn->code]; }) +#else #define CONT({ insn++; goto select_insn; }) #define CONT_JMP ({ insn++; goto select_insn; }) - select_insn: goto *jumptable[insn->code]; +#endif /* ALU */ #define ALU(OPCODE, OP)\ @@ -1547,7 +1551,7 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u6 * where arg1_type is ARG_PTR_TO_CTX. */ insn = prog->insnsi; - goto select_insn; + CONT; out: CONT; } -- 2.21.1
Re: [PATCH] bpf: Tweak BPF jump table optimizations for objtool compatibility
On Fri, May 01, 2020 at 12:09:30PM -0700, Alexei Starovoitov wrote: > On Thu, Apr 30, 2020 at 02:07:43PM -0500, Josh Poimboeuf wrote: > > Objtool decodes instructions and follows all potential code branches > > within a function. But it's not an emulator, so it doesn't track > > register values. For that reason, it usually can't follow > > intra-function indirect branches, unless they're using a jump table > > which follows a certain format (e.g., GCC switch statement jump tables). > > > > In most cases, the generated code for the BPF jump table looks a lot > > like a GCC jump table, so objtool can follow it. However, with > > RETPOLINE=n, GCC keeps the jump table address in a register, and then > > does 160+ indirect jumps with it. When objtool encounters the indirect > > jumps, it can't tell which jump table is being used (or even whether > > they might be sibling calls instead). > > > > This was fixed before by disabling an optimization in ___bpf_prog_run(), > > using the "optimize" function attribute. However, that attribute is bad > > news. It doesn't append options to the command-line arguments. Instead > > it starts from a blank slate. And according to recent GCC documentation > > it's not recommended for production use. So revert the previous fix: > > > > 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for > > ___bpf_prog_run()") > > > > With that reverted, solve the original problem in a different way by > > getting rid of the "goto select_insn" indirection, and instead just goto > > the jump table directly. This simplifies the code a bit and helps GCC > > generate saner code for the jump table branches, at least in the > > RETPOLINE=n case. > > > > But, in the RETPOLINE=y case, this simpler code actually causes GCC to > > generate far worse code, ballooning the function text size by +40%. So > > leave that code the way it was. In fact Alexei prefers to leave *all* > > the code the way it was, except where needed by objtool. So even > > non-x86 RETPOLINE=n code will continue to have "goto select_insn". > > > > This stuff is crazy voodoo, and far from ideal. But it works for now. > > Eventually, there's a plan to create a compiler plugin for annotating > > jump tables. That will make this a lot less fragile. > > I don't like this commit log. > Here you're saying that the code recognized by objtool is sane and good > whereas well optimized gcc code is somehow voodoo and bad. > That is just wrong. I have no idea what you're talking about. Are you saying that ballooning the function text size by 40% is well optimized GCC code? It seems like a bug to me. That's the only place I said anything bad about GCC code. When I said "this stuff is crazy voodoo" I was referring to the patch itself. I agree it's horrible, it's only the best approach we're able to come up with at the moment. If any of that isn't clear then can you provide specifics about what seems wrong? -- Josh