from:"josh"

linux-source-2.6.16: inconsistent device detetion: eth0<=>eth1

2006-04-11 Thread josh

Subject: linux-source-2.6.16: inconsistent device detetion: eth0<=>eth1
Package: linux-source-2.6.16
Version: 2.6.16-2
Severity: normal

I have two ethernet cards on my box, an onboard SiS 7012 and a pci ne-2k
compatible. When I boot the system, sometimes the SiS is detected as
eth0 and the pci as eth1, and sometimes the SiS is eth1 and the pci
eth0, *without my changing anything*. I just reboot and it's different.

This is a big pain in the butt, because I have dhcp on the SiS
connection and a separate internal net with a static address on the pci
card, and config in /etc/network/interfaces is based upon specifying
eth?.


Attached are two dmesg outputs with the corresponding detections. Run
a diff on them. Also attached is the output of lspci -vvv.

[EMAIL PROTECTED]:/proc$ cat /proc/version
Linux version 2.6.16 (Version: spleen.060402.2) ([EMAIL PROTECTED]) (gcc
version 4.0.3 20051201 (prerelease) (Debian 4.0.2-5)) #1 Sun Apr 2
21:13:53 CEST 2006


cheers,

-j



-- System Information:
Debian Release: 3.1
  APT prefers testing
  APT policy: (990, 'testing')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.16
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8)

Versions of packages linux-source-2.6.16 depends on:
ii  binutils 2.16.1cvs20060117-1 The GNU assembler, linker
and bina
ii  bzip21.0.2-7 high-quality block-sorting
file co

Versions of packages linux-source-2.6.16 recommends:
ii  gcc  4:4.0.2-1   The GNU C compiler
ii  libc6-dev [libc-dev] 2.3.6-3 GNU C Library: Development
Librari
ii  make 3.80+3.81.rc2-1 The GNU version of the
"make" util

-- no debconf information




Linux version 2.6.16 (Version: spleen.060402.2) ([EMAIL PROTECTED]) (gcc 
version 4.0.3 20051201 (prerelease) (Debian 4.0.2-5)) #1 Sun Apr 2 21:13:53 
CEST 2006
BIOS-provided physical RAM map:
 BIOS-e820:  - 0009fc00 (usable)
 BIOS-e820: 0009fc00 - 000a (reserved)
 BIOS-e820: 000f - 0010 (reserved)
 BIOS-e820: 0010 - 2fff (usable)
 BIOS-e820: 2fff - 2fff8000 (ACPI data)
 BIOS-e820: 2fff8000 - 3000 (ACPI NVS)
 BIOS-e820: fec0 - fec01000 (reserved)
 BIOS-e820: fee0 - fee01000 (reserved)
 BIOS-e820: ffee - fff0 (reserved)
 BIOS-e820: fffc - 0001 (reserved)
767MB LOWMEM available.
On node 0 totalpages: 196592
  DMA zone: 4096 pages, LIFO batch:0
  DMA32 zone: 0 pages, LIFO batch:0
  Normal zone: 192496 pages, LIFO batch:31
  HighMem zone: 0 pages, LIFO batch:0
DMI 2.3 present.
ACPI: RSDP (v000 AMI   ) @ 0x000fa340
ACPI: RSDT (v001 AMIINT SiS735XX 0x1000 MSFT 0x010b) @ 0x2fff
ACPI: FADT (v001 AMIINT SiS735XX 0x1000 MSFT 0x010b) @ 0x2fff0030
ACPI: DSDT (v001SiS  735 0x0100 MSFT 0x010d) @ 0x
ACPI: PM-Timer IO Port: 0x808
Allocating PCI resources starting at 4000 (gap: 3000:cec0)
Built 1 zonelists
Kernel command line: root=/dev/hda6 ro vga=0x030c pci=routeirq 
Enabling fast FPU save and restore... done.
Enabling unmasked SIMD FPU exception support... done.
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 65536 bytes)
Detected 1526.957 MHz processor.
Using pmtmr for high-res timesource
Console: colour VGA+ 132x60
Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
Memory: 776572k/786368k available (1590k kernel code, 9312k reserved, 449k 
data, 124k init, 0k highmem)
Checking if this processor honours the WP bit even in supervisor mode... Ok.
Calibrating delay using timer specific routine.. 3056.64 BogoMIPS (lpj=6113294)
Mount-cache hash table entries: 512
CPU: After generic identify, caps: 0383f9ff c1c3f9ff    
 
CPU: After vendor identify, caps: 0383f9ff c1c3f9ff    
 
CPU: L1 I Cache: 64K (64 bytes/line), D cache 64K (64 bytes/line)
CPU: L2 Cache: 256K (64 bytes/line)
CPU: After all inits, caps: 0383f9ff c1c3f9ff  0020  
 
Intel machine check architecture supported.
Intel machine check reporting enabled on CPU#0.
CPU: AMD Athlon(tm) XP 1800+ stepping 02
Checking 'hlt' instruction... OK.
ACPI: setting ELCR to 0200 (from 1c20)
NET: Registered protocol family 16
ACPI: bus type pci registered
PCI: PCI BIOS revision 2.10 entry at 0xfdb01, last bus=1
PCI: Using configuration type 1
ACPI: Subsystem revision 20060127
ACPI: Interpreter enabled
ACPI: Using PIC for interrupt routing
ACPI: PCI Root Bridge [PCI0] (:00)
PCI: Probing PCI hardware (bus 00)
Uncovering SIS18 that hid as a SIS503 (compatible=1)
Boot video device is :01:00.0
ACPI: PCI Interrupt Routing Table [\_SB_.PCI0._PRT]
ACPI: Power Resour

Re: [PATCH 3/3] printk: implement support for extended console drivers

2015-06-29 Thread josh

On Mon, Jun 29, 2015 at 11:49:14AM -0400, Tejun Heo wrote:
> On Mon, Jun 29, 2015 at 05:47:49PM +0200, Geert Uytterhoeven wrote:
> > > netconsole itself is optional & modular.  I'm not sure making further
> > > splits is called for, especially given the use cases.
> > 
> > It could be a hidden option, selected by its users (e.g. netconsole).
> 
> Hmmm... what do you mean?

config PRINTK_BITS_THAT_NETCONSOLE_NEEDS

(no help text or prompt)

config NETCONSOLE
select PRINTK_BITS_THAT_NETCONSOLE_NEEDS

That would ensure that the bits added to printk don't get compiled in
unless CONFIG_NETCONSOLE=y.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [PATCH 3/3] printk: implement support for extended console drivers

2015-06-29 Thread josh

On Mon, Jun 29, 2015 at 12:13:55PM -0400, Tejun Heo wrote:
> On Mon, Jun 29, 2015 at 06:11:40PM +0200, Geert Uytterhoeven wrote:
> > On Mon, Jun 29, 2015 at 5:49 PM, Tejun Heo  wrote:
> > > On Mon, Jun 29, 2015 at 05:47:49PM +0200, Geert Uytterhoeven wrote:
> > >> > netconsole itself is optional & modular.  I'm not sure making further
> > >> > splits is called for, especially given the use cases.
> > >>
> > >> It could be a hidden option, selected by its users (e.g. netconsole).
> > >
> > > Hmmm... what do you mean?
> > 
> > init/Kconfig:
> > 
> > config PRINTK_EXT_LOG
> > bool
> > 
> > drivers/net/Kconfig:
> > 
> > config NETCONSOLE
> > tristate "Network console logging support"
> > select PRINTK_EXT_LOG
> > 
> > kernel/printk/printk.c:
> > 
> > void console_unlock(void)
> > {
> > #ifdef CONFIG_PRINTK_EXT_LOG
> > static char ext_text[CONSOLE_EXT_LOG_MAX];
> > #endif
> 
> OIC, hmmm... yeah, I think doing it on-demand would be better but will
> try to find out which way is better.

Allocating the buffer dynamically is fine, but in that case the code to
do so should ideally be compiled out.  Since printk is used by almost
*all* kernels, while netconsole is not, it's more critical to avoid
allowing printk's size to balloon.

- Josh Triplett
--
To unsubscribe from this list: send the line "unsubscribe netdev" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH] tcp: add new tcp_mtu_probe_floor sysctl

2019-07-26 Thread Josh Hunt

The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.

The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt 
---
 Documentation/networking/ip-sysctl.txt | 6 ++
 include/net/netns/ipv4.h   | 1 +
 net/ipv4/sysctl_net_ipv4.c | 9 +
 net/ipv4/tcp_ipv4.c| 1 +
 net/ipv4/tcp_timer.c   | 2 +-
 5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index df33674799b5..49e95f438ed7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
Path MTU discovery (MTU probing).  If MTU probing is enabled,
this is the initial MSS used by the connection.
 
+tcp_mtu_probe_floor - INTEGER
+   If MTU probing is enabled this caps the minimum MSS used for search_low
+   for the connection.
+
+   Default : 48
+
 tcp_min_snd_mss - INTEGER
TCP SYN and SYNACK messages usually advertise an ADVMSS option,
as described in RFC 1122 and RFC 6691.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc24a8ec1ce5..c0c0791b1912 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -116,6 +116,7 @@ struct netns_ipv4 {
int sysctl_tcp_l3mdev_accept;
 #endif
int sysctl_tcp_mtu_probing;
+   int sysctl_tcp_mtu_probe_floor;
int sysctl_tcp_base_mss;
int sysctl_tcp_min_snd_mss;
int sysctl_tcp_probe_threshold;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0b980e841927..59ded25acd04 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
.extra2 = &tcp_min_snd_mss_max,
},
{
+   .procname   = "tcp_mtu_probe_floor",
+   .data   = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &tcp_min_snd_mss_min,
+   .extra2 = &tcp_min_snd_mss_max,
+   },
+   {
.procname   = "tcp_probe_threshold",
.data   = &init_net.ipv4.sysctl_tcp_probe_threshold,
.maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..e0a372676329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
+   net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
 
net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c801cd37cc2a..dbd9d2d0ee63 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock 
*icsk, struct sock *sk)
} else {
mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
-   mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
+   mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
}
-- 
2.7.4

Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl

2019-07-27 Thread Josh Hunt


On 7/27/19 12:05 AM, Eric Dumazet wrote:

On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt  wrote:


The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.


And what reasonable value have you used ???


Reasonable for some may not be reasonable for others hence the new 
sysctl :) We're currently running with a fairly high value based off of 
the v6 min MTU minus headers and options, etc. We went conservative with 
our setting initially as it seemed a reasonable first step when 
re-enabling TCP MTU probing since with no configurable floor we saw a # 
of cases where connections were using severely reduced mss b/c of loss 
and not b/c of actual path restriction. I plan to reevaluate the setting 
at some point, but since the probing method is still the same it means 
the same clients who got stuck with mss of 48 before will land at 
whatever floor we set. Looking forward we are interested in trying to 
improve TCP MTU probing so it does not penalize clients like this.


A suggestion for a more reasonable floor default would be 512, which is 
the same as the min_pmtu. Given both mechanisms are trying to achieve 
the same goal it seems like they should have a similar min/floor.






The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt 
---
  Documentation/networking/ip-sysctl.txt | 6 ++
  include/net/netns/ipv4.h   | 1 +
  net/ipv4/sysctl_net_ipv4.c | 9 +
  net/ipv4/tcp_ipv4.c| 1 +
  net/ipv4/tcp_timer.c   | 2 +-
  5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index df33674799b5..49e95f438ed7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
 Path MTU discovery (MTU probing).  If MTU probing is enabled,
 this is the initial MSS used by the connection.

+tcp_mtu_probe_floor - INTEGER
+   If MTU probing is enabled this caps the minimum MSS used for search_low
+   for the connection.
+
+   Default : 48
+
  tcp_min_snd_mss - INTEGER
 TCP SYN and SYNACK messages usually advertise an ADVMSS option,
 as described in RFC 1122 and RFC 6691.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc24a8ec1ce5..c0c0791b1912 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -116,6 +116,7 @@ struct netns_ipv4 {
 int sysctl_tcp_l3mdev_accept;
  #endif
 int sysctl_tcp_mtu_probing;
+   int sysctl_tcp_mtu_probe_floor;
 int sysctl_tcp_base_mss;
 int sysctl_tcp_min_snd_mss;
 int sysctl_tcp_probe_threshold;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0b980e841927..59ded25acd04 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
 .extra2 = &tcp_min_snd_mss_max,
 },
 {
+   .procname   = "tcp_mtu_probe_floor",
+   .data   = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &tcp_min_snd_mss_min,
+   .extra2 = &tcp_min_snd_mss_max,
+   },
+   {
 .procname   = "tcp_probe_threshold",
 .data   = &init_net.ipv4.sysctl_tcp_probe_threshold,
 .maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..e0a372676329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
 net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
 net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
 net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
+   net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;

 net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
 net->ipv4.sysctl_tcp_keepalive_probes =

Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl

2019-07-28 Thread Josh Hunt


On 7/28/19 6:54 AM, Eric Dumazet wrote:

On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt  wrote:


On 7/27/19 12:05 AM, Eric Dumazet wrote:

On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt  wrote:


The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.


And what reasonable value have you used ???


Reasonable for some may not be reasonable for others hence the new
sysctl :) We're currently running with a fairly high value based off of
the v6 min MTU minus headers and options, etc. We went conservative with
our setting initially as it seemed a reasonable first step when
re-enabling TCP MTU probing since with no configurable floor we saw a #
of cases where connections were using severely reduced mss b/c of loss
and not b/c of actual path restriction. I plan to reevaluate the setting
at some point, but since the probing method is still the same it means
the same clients who got stuck with mss of 48 before will land at
whatever floor we set. Looking forward we are interested in trying to
improve TCP MTU probing so it does not penalize clients like this.

A suggestion for a more reasonable floor default would be 512, which is
the same as the min_pmtu. Given both mechanisms are trying to achieve
the same goal it seems like they should have a similar min/floor.





The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt 
---
   Documentation/networking/ip-sysctl.txt | 6 ++
   include/net/netns/ipv4.h   | 1 +
   net/ipv4/sysctl_net_ipv4.c | 9 +
   net/ipv4/tcp_ipv4.c| 1 +
   net/ipv4/tcp_timer.c   | 2 +-
   5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index df33674799b5..49e95f438ed7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
  Path MTU discovery (MTU probing).  If MTU probing is enabled,
  this is the initial MSS used by the connection.

+tcp_mtu_probe_floor - INTEGER
+   If MTU probing is enabled this caps the minimum MSS used for search_low
+   for the connection.
+
+   Default : 48
+
   tcp_min_snd_mss - INTEGER
  TCP SYN and SYNACK messages usually advertise an ADVMSS option,
  as described in RFC 1122 and RFC 6691.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc24a8ec1ce5..c0c0791b1912 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -116,6 +116,7 @@ struct netns_ipv4 {
  int sysctl_tcp_l3mdev_accept;
   #endif
  int sysctl_tcp_mtu_probing;
+   int sysctl_tcp_mtu_probe_floor;
  int sysctl_tcp_base_mss;
  int sysctl_tcp_min_snd_mss;
  int sysctl_tcp_probe_threshold;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0b980e841927..59ded25acd04 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
  .extra2 = &tcp_min_snd_mss_max,
  },
  {
+   .procname   = "tcp_mtu_probe_floor",
+   .data   = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &tcp_min_snd_mss_min,
+   .extra2 = &tcp_min_snd_mss_max,
+   },
+   {
  .procname   = "tcp_probe_threshold",
  .data   = &init_net.ipv4.sysctl_tcp_probe_threshold,
  .maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..e0a372676329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
  net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
  net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
  net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
+   net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;

  net-&

Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl

2019-07-28 Thread Josh Hunt


On 7/28/19 2:14 PM, Josh Hunt wrote:

On 7/28/19 6:54 AM, Eric Dumazet wrote:

On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt  wrote:


On 7/27/19 12:05 AM, Eric Dumazet wrote:

On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt  wrote:


The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get 
down to
48. We have found that in almost all of these cases on our networks 
these

paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS 
back up
we have seen this not to be the case causing connections to be 
"stuck" with

an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing 
enabled

b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.


And what reasonable value have you used ???


Reasonable for some may not be reasonable for others hence the new
sysctl :) We're currently running with a fairly high value based off of
the v6 min MTU minus headers and options, etc. We went conservative with
our setting initially as it seemed a reasonable first step when
re-enabling TCP MTU probing since with no configurable floor we saw a #
of cases where connections were using severely reduced mss b/c of loss
and not b/c of actual path restriction. I plan to reevaluate the setting
at some point, but since the probing method is still the same it means
the same clients who got stuck with mss of 48 before will land at
whatever floor we set. Looking forward we are interested in trying to
improve TCP MTU probing so it does not penalize clients like this.

A suggestion for a more reasonable floor default would be 512, which is
the same as the min_pmtu. Given both mechanisms are trying to achieve
the same goal it seems like they should have a similar min/floor.





The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt 
---
   Documentation/networking/ip-sysctl.txt | 6 ++
   include/net/netns/ipv4.h   | 1 +
   net/ipv4/sysctl_net_ipv4.c | 9 +
   net/ipv4/tcp_ipv4.c    | 1 +
   net/ipv4/tcp_timer.c   | 2 +-
   5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt

index df33674799b5..49e95f438ed7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
  Path MTU discovery (MTU probing).  If MTU probing is 
enabled,

  this is the initial MSS used by the connection.

+tcp_mtu_probe_floor - INTEGER
+   If MTU probing is enabled this caps the minimum MSS used 
for search_low

+   for the connection.
+
+   Default : 48
+
   tcp_min_snd_mss - INTEGER
  TCP SYN and SYNACK messages usually advertise an ADVMSS 
option,

  as described in RFC 1122 and RFC 6691.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc24a8ec1ce5..c0c0791b1912 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -116,6 +116,7 @@ struct netns_ipv4 {
  int sysctl_tcp_l3mdev_accept;
   #endif
  int sysctl_tcp_mtu_probing;
+   int sysctl_tcp_mtu_probe_floor;
  int sysctl_tcp_base_mss;
  int sysctl_tcp_min_snd_mss;
  int sysctl_tcp_probe_threshold;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0b980e841927..59ded25acd04 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
  .extra2 = &tcp_min_snd_mss_max,
  },
  {
+   .procname   = "tcp_mtu_probe_floor",
+   .data   = 
&init_net.ipv4.sysctl_tcp_mtu_probe_floor,

+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &tcp_min_snd_mss_min,
+   .extra2 = &tcp_min_snd_mss_max,
+   },
+   {
  .procname   = "tcp_probe_threshold",
  .data   = 
&init_net.ipv4.sysctl_tcp_probe_threshold,

  .maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..e0a372676329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net 
*net)

  net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
  net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
  net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
+   net->

Re: [PATCH] tcp: add new tcp_mtu_probe_floor sysctl

2019-07-29 Thread Josh Hunt


On 7/29/19 6:12 AM, Neal Cardwell wrote:

On Sun, Jul 28, 2019 at 5:14 PM Josh Hunt  wrote:


On 7/28/19 6:54 AM, Eric Dumazet wrote:

On Sun, Jul 28, 2019 at 1:21 AM Josh Hunt  wrote:


On 7/27/19 12:05 AM, Eric Dumazet wrote:

On Sat, Jul 27, 2019 at 4:23 AM Josh Hunt  wrote:


The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.


And what reasonable value have you used ???


Reasonable for some may not be reasonable for others hence the new
sysctl :) We're currently running with a fairly high value based off of
the v6 min MTU minus headers and options, etc. We went conservative with
our setting initially as it seemed a reasonable first step when
re-enabling TCP MTU probing since with no configurable floor we saw a #
of cases where connections were using severely reduced mss b/c of loss
and not b/c of actual path restriction. I plan to reevaluate the setting
at some point, but since the probing method is still the same it means
the same clients who got stuck with mss of 48 before will land at
whatever floor we set. Looking forward we are interested in trying to
improve TCP MTU probing so it does not penalize clients like this.

A suggestion for a more reasonable floor default would be 512, which is
the same as the min_pmtu. Given both mechanisms are trying to achieve
the same goal it seems like they should have a similar min/floor.





The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt 
---
Documentation/networking/ip-sysctl.txt | 6 ++
include/net/netns/ipv4.h   | 1 +
net/ipv4/sysctl_net_ipv4.c | 9 +
net/ipv4/tcp_ipv4.c| 1 +
net/ipv4/tcp_timer.c   | 2 +-
5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index df33674799b5..49e95f438ed7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
   Path MTU discovery (MTU probing).  If MTU probing is enabled,
   this is the initial MSS used by the connection.

+tcp_mtu_probe_floor - INTEGER
+   If MTU probing is enabled this caps the minimum MSS used for search_low
+   for the connection.
+
+   Default : 48
+
tcp_min_snd_mss - INTEGER
   TCP SYN and SYNACK messages usually advertise an ADVMSS option,
   as described in RFC 1122 and RFC 6691.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc24a8ec1ce5..c0c0791b1912 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -116,6 +116,7 @@ struct netns_ipv4 {
   int sysctl_tcp_l3mdev_accept;
#endif
   int sysctl_tcp_mtu_probing;
+   int sysctl_tcp_mtu_probe_floor;
   int sysctl_tcp_base_mss;
   int sysctl_tcp_min_snd_mss;
   int sysctl_tcp_probe_threshold;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0b980e841927..59ded25acd04 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
   .extra2 = &tcp_min_snd_mss_max,
   },
   {
+   .procname   = "tcp_mtu_probe_floor",
+   .data   = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &tcp_min_snd_mss_min,
+   .extra2 = &tcp_min_snd_mss_max,
+   },
+   {
   .procname   = "tcp_probe_threshold",
   .data   = &init_net.ipv4.sysctl_tcp_probe_threshold,
   .maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..e0a372676329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
   net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
   net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
   net->i

[PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl

2019-08-07 Thread Josh Hunt

The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.

The new sysctl will still default to TCP_MIN_SND_MSS (48), but gives
administrators the ability to control the floor of MSS probing.

Signed-off-by: Josh Hunt 
---
 Documentation/networking/ip-sysctl.txt | 6 ++
 include/net/netns/ipv4.h   | 1 +
 net/ipv4/sysctl_net_ipv4.c | 9 +
 net/ipv4/tcp_ipv4.c| 1 +
 net/ipv4/tcp_timer.c   | 2 +-
 5 files changed, 18 insertions(+), 1 deletion(-)

diff --git a/Documentation/networking/ip-sysctl.txt 
b/Documentation/networking/ip-sysctl.txt
index df33674799b5..49e95f438ed7 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -256,6 +256,12 @@ tcp_base_mss - INTEGER
Path MTU discovery (MTU probing).  If MTU probing is enabled,
this is the initial MSS used by the connection.
 
+tcp_mtu_probe_floor - INTEGER
+   If MTU probing is enabled this caps the minimum MSS used for search_low
+   for the connection.
+
+   Default : 48
+
 tcp_min_snd_mss - INTEGER
TCP SYN and SYNACK messages usually advertise an ADVMSS option,
as described in RFC 1122 and RFC 6691.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index bc24a8ec1ce5..c0c0791b1912 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -116,6 +116,7 @@ struct netns_ipv4 {
int sysctl_tcp_l3mdev_accept;
 #endif
int sysctl_tcp_mtu_probing;
+   int sysctl_tcp_mtu_probe_floor;
int sysctl_tcp_base_mss;
int sysctl_tcp_min_snd_mss;
int sysctl_tcp_probe_threshold;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 0b980e841927..59ded25acd04 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -820,6 +820,15 @@ static struct ctl_table ipv4_net_table[] = {
.extra2 = &tcp_min_snd_mss_max,
},
{
+   .procname   = "tcp_mtu_probe_floor",
+   .data   = &init_net.ipv4.sysctl_tcp_mtu_probe_floor,
+   .maxlen = sizeof(int),
+   .mode   = 0644,
+   .proc_handler   = proc_dointvec_minmax,
+   .extra1 = &tcp_min_snd_mss_min,
+   .extra2 = &tcp_min_snd_mss_max,
+   },
+   {
.procname   = "tcp_probe_threshold",
.data   = &init_net.ipv4.sysctl_tcp_probe_threshold,
.maxlen = sizeof(int),
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index d57641cb3477..e0a372676329 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2637,6 +2637,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_min_snd_mss = TCP_MIN_SND_MSS;
net->ipv4.sysctl_tcp_probe_threshold = TCP_PROBE_THRESHOLD;
net->ipv4.sysctl_tcp_probe_interval = TCP_PROBE_INTERVAL;
+   net->ipv4.sysctl_tcp_mtu_probe_floor = TCP_MIN_SND_MSS;
 
net->ipv4.sysctl_tcp_keepalive_time = TCP_KEEPALIVE_TIME;
net->ipv4.sysctl_tcp_keepalive_probes = TCP_KEEPALIVE_PROBES;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index c801cd37cc2a..dbd9d2d0ee63 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -154,7 +154,7 @@ static void tcp_mtu_probing(struct inet_connection_sock 
*icsk, struct sock *sk)
} else {
mss = tcp_mtu_to_mss(sk, icsk->icsk_mtup.search_low) >> 1;
mss = min(net->ipv4.sysctl_tcp_base_mss, mss);
-   mss = max(mss, 68 - tcp_sk(sk)->tcp_header_len);
+   mss = max(mss, net->ipv4.sysctl_tcp_mtu_probe_floor);
mss = max(mss, net->ipv4.sysctl_tcp_min_snd_mss);
icsk->icsk_mtup.search_low = tcp_mss_to_mtu(sk, mss);
}
-- 
2.7.4

[PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment

2019-08-07 Thread Josh Hunt

TCP_BASE_MSS is used as the default initial MSS value when MTU probing is
enabled. Update the comment to reflect this.

Suggested-by: Neal Cardwell 
Signed-off-by: Josh Hunt 
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 81e8ade1e6e4..9e9fbfaf052b 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -64,7 +64,7 @@ void tcp_time_wait(struct sock *sk, int state, int timeo);
 /* Minimal accepted MSS. It is (60+60+8) - (20+20). */
 #define TCP_MIN_MSS88U
 
-/* The least MTU to use for probing */
+/* The initial MTU to use for probing */
 #define TCP_BASE_MSS   1024
 
 /* probing interval, default to 10 minutes as per RFC4821 */
-- 
2.7.4

Re: [PATCH v2 1/2] tcp: add new tcp_mtu_probe_floor sysctl

2019-08-08 Thread Josh Hunt


On 8/7/19 11:12 PM, Eric Dumazet wrote:



On 8/8/19 1:52 AM, Josh Hunt wrote:

The current implementation of TCP MTU probing can considerably
underestimate the MTU on lossy connections allowing the MSS to get down to
48. We have found that in almost all of these cases on our networks these
paths can handle much larger MTUs meaning the connections are being
artificially limited. Even though TCP MTU probing can raise the MSS back up
we have seen this not to be the case causing connections to be "stuck" with
an MSS of 48 when heavy loss is present.

Prior to pushing out this change we could not keep TCP MTU probing enabled
b/c of the above reasons. Now with a reasonble floor set we've had it
enabled for the past 6 months.


I am still sad to see you do not share what is a reasonable value and let
everybody guess.

It seems to be a top-secret value.


Haha, no sorry I didn't mean for it to come across like that.

We are currently setting tcp_base_mss to 1348 and tcp_mtu_probe_floor to 
1208. I thought I mentioned it in our earlier mails, but I guess I did 
not put the exact #s. 1348 was derived after analyzing common MTU we see 
across our networks and noticing that an MTU of around 1400 would cover 
a very large % (sorry I don't have the #s handy) of those paths. 1400 - 
20 - 20 - 12 = 1348. For the floor we based it off the v6 min MTU of 
1280 and subtracted out headers, etc, so 1280 - 40 - 20 - 12 = 1208. 
Using a floor of 1280 MTU matches what the RFC suggests in section 7.7 
for v6 connections, we're just applying that to v4 right now as well. I 
guess that brings up a good point, would per-IP proto floor sysctls be 
an option here for upstream (the RFC suggests different floors for v4 
and v6)? For now I'm not sure it makes sense b/c of the problems we see 
with lossy connections, but in the future if that can be resolved it 
seems like it would give some more flexibility.


I'd like to investigate this all further at some point to see if we can 
make it work better for lossy connections. It looks like one of the 
problems is there are a # of conditions which cause us to not probe in 
the upward direction. I'm not sure if any of those can be 
relaxed/changed and if so would help these connections out.


Thanks
Josh

Re: [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment

2019-08-09 Thread Josh Hunt


On 8/7/19 11:13 PM, Eric Dumazet wrote:



On 8/8/19 1:52 AM, Josh Hunt wrote:

TCP_BASE_MSS is used as the default initial MSS value when MTU probing is
enabled. Update the comment to reflect this.

Suggested-by: Neal Cardwell 
Signed-off-by: Josh Hunt 
---


Signed-off-by: Eric Dumazet 



Dave

I forgot to tag these at net-next. Do I need to resubmit a v3 with 
net-next in the subject?


Thanks
Josh

Re: [PATCH v2 2/2] tcp: Update TCP_BASE_MSS comment

2019-08-09 Thread Josh Hunt


On 8/9/19 11:43 AM, David Miller wrote:

From: Josh Hunt 
Date: Fri, 9 Aug 2019 11:38:05 -0700


I forgot to tag these at net-next. Do I need to resubmit a v3 with
net-next in the subject?


No need.



Thanks

Re: [PATCH net-next] gso: enable udp gso for virtual devices

2019-08-09 Thread Josh Hunt


On 6/26/19 4:41 PM, Willem de Bruijn wrote:

On Wed, Jun 26, 2019 at 3:17 PM Jason Baron  wrote:




On 6/14/19 4:53 PM, Jason Baron wrote:



On 6/13/19 5:20 PM, Willem de Bruijn wrote:

@@ -237,6 +237,7 @@ static inline int find_next_netdev_feature(u64 feature, 
unsigned long start)
  NETIF_F_GSO_GRE_CSUM | \
  NETIF_F_GSO_IPXIP4 |   \
  NETIF_F_GSO_IPXIP6 |   \
+NETIF_F_GSO_UDP_L4 |   \
  NETIF_F_GSO_UDP_TUNNEL |   \
  NETIF_F_GSO_UDP_TUNNEL_CSUM)


Are you adding this to NETIF_F_GSO_ENCAP_ALL? Wouldn't it make more
sense to add it to NETIF_F_GSO_SOFTWARE?



Yes, I'm adding to NETIF_F_GSO_ENCAP_ALL (not very clear from the
context). I will fix the commit log.

In: 83aa025 udp: add gso support to virtual devices, the support was
also added to NETIF_F_GSO_ENCAP_ALL (although subsequently reverted due
to UDP GRO not being in place), so I wonder what the reason was for that?


That was probably just a bad choice on my part.

It worked in practice, but if NETIF_F_GSO_SOFTWARE works the same
without unexpected side effects, then I agree that it is the better choice.

That choice does appear to change behavior when sending over tunnel
devices. Might it send tunneled GSO packets over loopback?




I set up a test case using fou tunneling through a bridge device using
the udpgso_bench_tx test where packets are not received correctly if
NETIF_F_GSO_UDP_L4 is added to NETIF_F_GSO_SOFTWARE. If I have it added
to NETIF_F_GSO_ENCAP_ALL, it does work correctly. So there are more
fixes required to include it in NETIF_F_GSO_SOFTWARE.

The use-case I have only requires it to be in NETIF_F_GSO_ENCAP_ALL, but
if it needs to go in NETIF_F_GSO_SOFTWARE, I can look at what's required
more next week.



Hi,

I haven't had a chance to investigate what goes wrong with including
NETIF_F_GSO_UDP_L4 in NETIF_F_GSO_SOFTWARE - but I was just wondering if
people are ok with NETIF_F_GSO_UDP_L4 being added to
NETIF_F_GSO_ENCAP_ALL and not NETIF_F_GSO_SOFTWARE (ie the original
patch as posted)?

As I mentioned that is sufficient for my use-case, and its how Willem
originally proposed this.


Indeed, based on the previous discussion this sounds fine to me.



Willem

Are you OK to ACK this? If not, is there something else you'd rather see 
here?


Thanks
Josh

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2021-04-02 Thread Josh Hunt


On 4/2/21 12:25 PM, Jiri Kosina wrote:

On Thu, 3 Sep 2020, John Fastabend wrote:


At this point I fear we could consider reverting the NOLOCK stuff.
I personally would hate doing so, but it looks like NOLOCK benefits are
outweighed by its issues.


I agree, NOLOCK brings more pains than gains. There are many race
conditions hidden in generic qdisc layer, another one is enqueue vs.
reset which is being discussed in another thread.


Sure. Seems they crept in over time. I had some plans to write a
lockless HTB implementation. But with fq+EDT with BPF it seems that
it is no longer needed, we have a more generic/better solution.  So
I dropped it. Also most folks should really be using fq, fq_codel,
etc. by default anyways. Using pfifo_fast alone is not ideal IMO.


Half a year later, we still have the NOLOCK implementation
present, and pfifo_fast still does set the TCQ_F_NOLOCK flag on itself.

And we've just been bitten by this very same race which appears to be
still unfixed, with single packet being stuck in pfifo_fast qdisc
basically indefinitely due to this very race that this whole thread began
with back in 2019.

Unless there are

(a) any nice ideas how to solve this in an elegant way without
(re-)introducing extra spinlock (Cong's fix) or

(b) any objections to revert as per the argumentation above

I'll be happy to send a revert of the whole NOLOCK implementation next
week.



Jiri

If you have a reproducer can you try 
https://lkml.org/lkml/2021/3/24/1485 ? If that doesn't work I think your 
suggestion of reverting nolock makes sense to me. We've moved to using 
fq as our default now b/c of this bug.


Josh

Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650

2021-02-03 Thread Josh Poimboeuf

On Wed, Feb 03, 2021 at 09:46:55AM -0800, Ivan Babrou wrote:
> > Can you pretty please not line-wrap console output? It's unreadable.
> 
> GMail doesn't make it easy, I'll send a link to a pastebin next time.
> Let me know if you'd like me to regenerate the decoded stack.
> 
> > > edfd9b7838ba5e47f19ad8466d0565aba5c59bf0 is the first bad commit
> > > commit edfd9b7838ba5e47f19ad8466d0565aba5c59bf0
> >
> > Not sure what tree you're on, but that's not the upstream commit.
> 
> I mentioned that it's a rebased core-static_call-2020-10-12 tag and
> added a link to the upstream hash right below.
> 
> > > Author: Steven Rostedt (VMware) 
> > > Date:   Tue Aug 18 15:57:52 2020 +0200
> > >
> > > tracepoint: Optimize using static_call()
> > >
> >
> > There's a known issue with that patch, can you try:
> >
> >   http://lkml.kernel.org/r/20210202220121.435051...@goodmis.org
> 
> I've tried it on top of core-static_call-2020-10-12 tag rebased on top
> of v5.9 (to make it reproducible), and the patch did not help. Do I
> need to apply the whole series or something else?

Can you recreate with this patch, and add "unwind_debug" to the cmdline?
It will spit out a bunch of stack data.


From: Josh Poimboeuf 
Subject: [PATCH] Subject: [PATCH] x86/unwind: Add 'unwind_debug' cmdline
 option

Sometimes the one-line ORC unwinder warnings aren't very helpful.  Take
the existing frame pointer unwind_dump() and make it useful for all
unwinders.

I don't want to be too aggressive about enabling the dumps, so for now
they're only enabled with the use of a new 'unwind_debug' cmdline
option.  When enabled, it will dump the full contents of the stack when
an error condition is encountered, or when dump_stack() is called.

Signed-off-by: Josh Poimboeuf 
---
 .../admin-guide/kernel-parameters.txt |  6 +++
 arch/x86/include/asm/unwind.h |  3 ++
 arch/x86/kernel/dumpstack.c   | 39 ++
 arch/x86/kernel/unwind_frame.c| 51 +++
 arch/x86/kernel/unwind_orc.c  |  5 +-
 5 files changed, 58 insertions(+), 46 deletions(-)

diff --git a/Documentation/admin-guide/kernel-parameters.txt 
b/Documentation/admin-guide/kernel-parameters.txt
index 3d6604a949f8..d29689aa62a2 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -5521,6 +5521,12 @@
unknown_nmi_panic
[X86] Cause panic on unknown NMI.
 
+   unwind_debug[X86-64]
+   Enable unwinder debug output.  This can be
+   useful for debugging certain unwinder error
+   conditions, including corrupt stacks and
+   bad/missing unwinder metadata.
+
usbcore.authorized_default=
[USB] Default USB device authorization:
(default -1 = authorized except for wireless USB,
diff --git a/arch/x86/include/asm/unwind.h b/arch/x86/include/asm/unwind.h
index 70fc159ebe69..5101d7ef7912 100644
--- a/arch/x86/include/asm/unwind.h
+++ b/arch/x86/include/asm/unwind.h
@@ -123,4 +123,7 @@ static inline bool task_on_another_cpu(struct task_struct 
*task)
 #endif
 }
 
+extern bool unwind_debug __ro_after_init;
+void unwind_dump(struct unwind_state *state);
+
 #endif /* _ASM_X86_UNWIND_H */
diff --git a/arch/x86/kernel/dumpstack.c b/arch/x86/kernel/dumpstack.c
index 299c20f0a38b..febfd5b7f62a 100644
--- a/arch/x86/kernel/dumpstack.c
+++ b/arch/x86/kernel/dumpstack.c
@@ -29,6 +29,42 @@ static int die_counter;
 
 static struct pt_regs exec_summary_regs;
 
+bool unwind_debug __ro_after_init;
+static int __init unwind_debug_cmdline(char *str)
+{
+   unwind_debug = true;
+   return 0;
+}
+early_param("unwind_debug", unwind_debug_cmdline);
+
+void unwind_dump(struct unwind_state *state)
+{
+   unsigned long word, *sp;
+   struct stack_info stack_info = {0};
+   unsigned long visit_mask = 0;
+
+   printk_deferred("unwinder dump: stack type:%d next_sp:%p mask:0x%lx 
graph_idx:%d\n",
+   state->stack_info.type, state->stack_info.next_sp,
+   state->stack_mask, state->graph_idx);
+
+   sp = state->task == current ? __builtin_frame_address(0)
+   : (void *)state->task->thread.sp;
+
+   for (; sp; sp = PTR_ALIGN(stack_info.next_sp, sizeof(long))) {
+   if (get_stack_info(sp, state->task, &stack_info, &visit_mask))
+   break;
+
+   for (; sp < stack_info.end; sp++) {
+
+   word = READ_ONCE_NOCHECK(*sp);
+
+   printk_deferred

Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650

2021-02-03 Thread Josh Poimboeuf

On Wed, Feb 03, 2021 at 02:41:53PM -0800, Ivan Babrou wrote:
> On Wed, Feb 3, 2021 at 11:05 AM Josh Poimboeuf  wrote:
> >
> > On Wed, Feb 03, 2021 at 09:46:55AM -0800, Ivan Babrou wrote:
> > > > Can you pretty please not line-wrap console output? It's unreadable.
> > >
> > > GMail doesn't make it easy, I'll send a link to a pastebin next time.
> > > Let me know if you'd like me to regenerate the decoded stack.
> > >
> > > > > edfd9b7838ba5e47f19ad8466d0565aba5c59bf0 is the first bad commit
> > > > > commit edfd9b7838ba5e47f19ad8466d0565aba5c59bf0
> > > >
> > > > Not sure what tree you're on, but that's not the upstream commit.
> > >
> > > I mentioned that it's a rebased core-static_call-2020-10-12 tag and
> > > added a link to the upstream hash right below.
> > >
> > > > > Author: Steven Rostedt (VMware) 
> > > > > Date:   Tue Aug 18 15:57:52 2020 +0200
> > > > >
> > > > > tracepoint: Optimize using static_call()
> > > > >
> > > >
> > > > There's a known issue with that patch, can you try:
> > > >
> > > >   http://lkml.kernel.org/r/20210202220121.435051...@goodmis.org
> > >
> > > I've tried it on top of core-static_call-2020-10-12 tag rebased on top
> > > of v5.9 (to make it reproducible), and the patch did not help. Do I
> > > need to apply the whole series or something else?
> >
> > Can you recreate with this patch, and add "unwind_debug" to the cmdline?
> > It will spit out a bunch of stack data.
> 
> Here's the three I'm building:
> 
> * https://github.com/bobrik/linux/tree/ivan/static-call-5.9
> 
> It contains:
> 
> * v5.9 tag as the base
> * static_call-2020-10-12 tag
> * dm-crypt patches to reproduce the issue with KASAN
> * x86/unwind: Add 'unwind_debug' cmdline option
> * tracepoint: Fix race between tracing and removing tracepoint
> 
> The very same issue can be reproduced on 5.10.11 with no patches,
> but I'm going with 5.9, since it boils down to static call changes.
> 
> Here's the decoded stack from the kernel with unwind debug enabled:
> 
> * https://gist.github.com/bobrik/ed052ac0ae44c880f3170299ad4af56b
> 
> See my first email for the exact commands that trigger this.

Thanks.  Do you happen to have the original dmesg, before running it
through the post-processing script?


I assume you're using decode_stacktrace.sh?  It could use some
improvement, it's stripping the function offset.

Also spaces are getting inserted in odd places, messing the alignment.

[  137.291837][C0] 88809c409858: d7c4f3ce817a1700 (0xd7c4f3ce817a1700)
[  137.291837][C0] 88809c409860:  (0x0)
[  137.291839][    C0] 88809c409868:  (0x)
[ 137.291841][ C0] 88809c409870: a4f01a52 unwind_next_frame 
(arch/x86/kernel/unwind_orc.c:380 arch/x86/kernel/unwind_orc.c:553)
[ 137.291843][ C0] 88809c409878: a4f01a52 unwind_next_frame 
(arch/x86/kernel/unwind_orc.c:380 arch/x86/kernel/unwind_orc.c:553)
[  137.291844][C0] 88809c409880: 88809c409ac8 (0x88809c409ac8)
[  137.291845][C0] 88809c409888: 0086 (0x86)

-- 
Josh

Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650

2021-02-03 Thread Josh Poimboeuf

On Wed, Feb 03, 2021 at 03:30:35PM -0800, Ivan Babrou wrote:
> > > > Can you recreate with this patch, and add "unwind_debug" to the cmdline?
> > > > It will spit out a bunch of stack data.
> > >
> > > Here's the three I'm building:
> > >
> > > * https://github.com/bobrik/linux/tree/ivan/static-call-5.9
> > >
> > > It contains:
> > >
> > > * v5.9 tag as the base
> > > * static_call-2020-10-12 tag
> > > * dm-crypt patches to reproduce the issue with KASAN
> > > * x86/unwind: Add 'unwind_debug' cmdline option
> > > * tracepoint: Fix race between tracing and removing tracepoint
> > >
> > > The very same issue can be reproduced on 5.10.11 with no patches,
> > > but I'm going with 5.9, since it boils down to static call changes.
> > >
> > > Here's the decoded stack from the kernel with unwind debug enabled:
> > >
> > > * https://gist.github.com/bobrik/ed052ac0ae44c880f3170299ad4af56b
> > >
> > > See my first email for the exact commands that trigger this.
> >
> > Thanks.  Do you happen to have the original dmesg, before running it
> > through the post-processing script?
> 
> Yes, here it is:
> 
> * https://gist.github.com/bobrik/8c13e6a02555fb21cadabb74cdd6f9ab

It appears the unwinder is getting lost in crypto code.  No idea what
this has to do with static calls though.  Or maybe you're seeing
multiple issues.

Does this fix it?


diff --git a/arch/x86/crypto/Makefile b/arch/x86/crypto/Makefile
index a31de0c6ccde..36c55341137c 100644
--- a/arch/x86/crypto/Makefile
+++ b/arch/x86/crypto/Makefile
@@ -2,7 +2,14 @@
 #
 # x86 crypto algorithms
 
-OBJECT_FILES_NON_STANDARD := y
+OBJECT_FILES_NON_STANDARD_sha256-avx2-asm.o:= y
+OBJECT_FILES_NON_STANDARD_sha512-ssse3-asm.o   := y
+OBJECT_FILES_NON_STANDARD_sha512-avx-asm.o := y
+OBJECT_FILES_NON_STANDARD_sha512-avx2-asm.o:= y
+OBJECT_FILES_NON_STANDARD_crc32c-pcl-intel-asm_64.o:= y
+OBJECT_FILES_NON_STANDARD_camellia-aesni-avx2-asm_64.o := y
+OBJECT_FILES_NON_STANDARD_sha1_avx2_x86_64_asm.o   := y
+OBJECT_FILES_NON_STANDARD_sha1_ni_asm.o:= y
 
 obj-$(CONFIG_CRYPTO_GLUE_HELPER_X86) += glue_helper.o
 
diff --git a/arch/x86/crypto/aesni-intel_avx-x86_64.S 
b/arch/x86/crypto/aesni-intel_avx-x86_64.S
index 5fee47956f3b..59c36b88954f 100644
--- a/arch/x86/crypto/aesni-intel_avx-x86_64.S
+++ b/arch/x86/crypto/aesni-intel_avx-x86_64.S
@@ -237,8 +237,8 @@ define_reg j %j
 .noaltmacro
 .endm
 
-# need to push 4 registers into stack to maintain
-STACK_OFFSET = 8*4
+# need to push 5 registers into stack to maintain
+STACK_OFFSET = 8*5
 
 TMP1 =   16*0# Temporary storage for AAD
 TMP2 =   16*1# Temporary storage for AES State 2 (State 1 is stored in an 
XMM register)
@@ -257,6 +257,8 @@ VARIABLE_OFFSET = 16*8
 
 .macro FUNC_SAVE
 #the number of pushes must equal STACK_OFFSET
+   push%rbp
+   mov %rsp, %rbp
 push%r12
 push%r13
 push%r14
@@ -271,12 +273,14 @@ VARIABLE_OFFSET = 16*8
 .endm
 
 .macro FUNC_RESTORE
+add $VARIABLE_OFFSET, %rsp
 mov %r14, %rsp
 
 pop %r15
 pop %r14
 pop %r13
 pop %r12
+   pop %rbp
 .endm
 
 # Encryption of a single block

Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650

2021-02-03 Thread Josh Poimboeuf

On Wed, Feb 03, 2021 at 04:52:42PM -0800, Ivan Babrou wrote:
> We also have the following stack that doesn't touch any crypto:
> 
> * https://gist.github.com/bobrik/40e2559add2f0b26ae39da30dc451f1e

Can you also run this through decode_stacktrace.sh?

Both are useful (until I submit a fix for decode_stacktrace.sh).

> I cannot reproduce this one, and it took 2 days of uptime for it to
> happen. Is there anything I can do to help diagnose it?

Can you run with the same unwind_debug patch+cmdline when you try to
recreate this one?  In the meantime I'll look at the available data.

-- 
Josh

Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650

2021-02-03 Thread Josh Poimboeuf

On Wed, Feb 03, 2021 at 09:44:48PM -0500, Steven Rostedt wrote:
> > > [  128.441287][C0] RIP: 0010:skcipher_walk_next
> > > (crypto/skcipher.c:322 crypto/skcipher.c:384)
> 
> Why do we have an RIP in skcipher_walk_next, if its the unwinder that
> had a bug? Or are they related?
> 
> Or did skcipher_walk_next trigger something in KASAN which did a stack
> walk via the unwinder, and that caused another issue?

It was interrupted by an IRQ, which then called kfree(), which then
called kasan_save_stack(), which then called the unwinder, which then
read "out-of-bounds" between stack frames.

In this case it was because of some crypto code missing ORC annotations.

> Looking at the unwinder code in question, we have:
> 
> static bool deref_stack_regs(struct unwind_state *state, unsigned long addr,
>  unsigned long *ip, unsigned long *sp)
> {
> struct pt_regs *regs = (struct pt_regs *)addr;
> 
> /* x86-32 support will be more complicated due to the ®s->sp hack 
> */
> BUILD_BUG_ON(IS_ENABLED(CONFIG_X86_32));
> 
> if (!stack_access_ok(state, addr, sizeof(struct pt_regs)))
> return false;
> 
> *ip = regs->ip;
> *sp = regs->sp; <- pointer to here
> return true;
> }
> 
> and the caller of the above static function:
> 
> case UNWIND_HINT_TYPE_REGS:
> if (!deref_stack_regs(state, sp, &state->ip, &state->sp)) {
> orc_warn_current("can't access registers at %pB\n",
>  (void *)orig_ip);
> goto err;
> }
> 
> 
> Could it possibly be that there's some magic canary on the stack that
> causes KASAN to trigger if you read it?

Right, the unwinder isn't allowed to read between stack frames.

In fact, you read my mind, I was looking at the other warning in network
code:

  [160676.598929][C4]  asm_common_interrupt+0x1e/0x40
  [160676.608966][C4] RIP: 0010:0xc17d814c
  [160676.618812][C4] Code: 8b 4c 24 40 4c 8b 44 24 48 48 8b 7c 24 70 48 8b 
74 24 68 48 8b 54 24 60 48 8b 4c 24 58 48 8b 44 24 50 48 81 c4 a8 00 00 00 9d 
 20 27 af 8f ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 00 00
  [160676.649371][C4] RSP: 0018:8893dfd4f620 EFLAGS: 0282
  [160676.661073][C4] RAX:  RBX: 8881be9c9c80 RCX: 

  [160676.674788][C4] RDX: dc00 RSI: 000b RDI: 
8881be9c9c80
  [160676.688508][C4] RBP: 8881be9c9ce0 R08:  R09: 
8881908c4c97
  [160676.702249][C4] R10: ed1032118992 R11: 88818a4ce68c R12: 
8881be9c9eea
  [160676.716000][C4] R13: 8881be9c9c92 R14: 8880063ba5ac R15: 
8880063ba5a8
  [160676.729895][C4]  ? tcp_set_state+0x5/0x620
  [160676.740426][C4]  ? tcp_fin+0xeb/0x5a0
  [160676.750287][C4]  ? tcp_data_queue+0x1e78/0x4ce0
  [160676.761089][C4]  ? tcp_urg+0x76/0xc50

This line gives a big clue:

  [160676.608966][C4] RIP: 0010:0xc17d814c

That address, without a function name, most likely means that it was
running in some generated code (mostly likely BPF) when it got
interrupted.

Right now, the ORC unwinder tries to fall back to frame pointers when it
encounters generated code:

orc = orc_find(state->signal ? state->ip : state->ip - 1);
if (!orc)
/*
 * As a fallback, try to assume this code uses a frame pointer.
 * This is useful for generated code, like BPF, which ORC
 * doesn't know about.  This is just a guess, so the rest of
 * the unwind is no longer considered reliable.
 */
orc = &orc_fp_entry;
state->error = true;
}

Because the ORC unwinder is guessing from that point onward, it's
possible for it to read the KASAN stack redzone, if the generated code
hasn't set up frame pointers.  So the best fix may be for the unwinder
to just always bypass KASAN when reading the stack.

The unwinder has a mechanism for detecting and warning about
out-of-bounds, and KASAN is short-circuiting that.

This should hopefully get rid of *all* the KASAN unwinder warnings, both
crypto and networking.

diff --git a/arch/x86/kernel/unwind_orc.c b/arch/x86/kernel/unwind_orc.c
index 040194d079b6..1f69a23a4715 100644
--- a/arch/x86/kernel/unwind_orc.c
+++ b/arch/x86/kernel/unwind_orc.c
@@ -376,8 +376,8 @@ static bool deref_stack_regs(struct unwind_state *state, 
unsigned long addr,
if (!stack_access_ok(state, addr, sizeof(struct pt_regs)))
return false;

-   *ip = regs->ip;
-   *sp = regs->sp;
+   *ip = READ_ONCE_NOCHECK(regs->ip);
+   *sp = READ_ONCE_NOCHECK(regs->sp);
return true;
 }

@@ -389,8 +389,8 @@ static bool deref_stack_iret_regs(struct unwind_state 
*state, unsigned long addr
if (!stack_access_ok(state, addr, IRET_FRAME_SIZE))

Re: BUG: KASAN: stack-out-of-bounds in unwind_next_frame+0x1df5/0x2650

2021-02-04 Thread Josh Poimboeuf

On Thu, Feb 04, 2021 at 11:51:44AM -0800, Ivan Babrou wrote:
> >  .macro FUNC_SAVE
> >  #the number of pushes must equal STACK_OFFSET
> > +   push%rbp
> > +   mov %rsp, %rbp
> >  push%r12
> >  push%r13
> >  push%r14
> > @@ -271,12 +273,14 @@ VARIABLE_OFFSET = 16*8
> >  .endm
> >
> >  .macro FUNC_RESTORE
> > +add $VARIABLE_OFFSET, %rsp
> >  mov %r14, %rsp
> >
> >  pop %r15
> >  pop %r14
> >  pop %r13
> >  pop %r12
> > +   pop %rbp
> >  .endm
> >
> >  # Encryption of a single block
> >
> 
> This patch seems to fix the following warning:
> 
> [  147.995699][C0] WARNING: stack going in the wrong direction? at
> glue_xts_req_128bit+0x21f/0x6f0 [glue_helper]
> 
> Or at least I cannot see it anymore when combined with your other
> patch, not sure if it did the trick by itself.
> 
> This sounds like a good reason to send them both.

Ok, that's what I expected.

The other patch fixed the unwinder failure mode to be the above
(harmless) unwinder warning, instead of a disruptive KASAN failure.

This patch fixes the specific underlying crypto unwinding metadata
issue.

I'll definitely be sending both fixes.  The improved failure mode patch
will come first because it's more urgent and lower risk.

-- 
Josh

Re: Packet gets stuck in NOLOCK pfifo_fast qdisc

2020-08-20 Thread Josh Hunt


Hi Jike

On 8/20/20 12:43 AM, Jike Song wrote:

Hi Josh,


We met possibly the same problem when testing nvidia/mellanox's
GPUDirect RDMA product, we found that changing NET_SCH_DEFAULT to
DEFAULT_FQ_CODEL mitigated the problem, having no idea why. Maybe you
can also have a try?


We also did something similar where we've switched over to using the fq 
scheduler everywhere for now. We believe the bug is in the nolock code 
which only pfifo_fast uses atm, but we've been unable to come up with a 
satisfactory solution.




Besides, our testing is pretty complex, do you have a quick test to
reproduce it?



Unfortunately we don't have a simple test case either. Our current 
reproducer is complex as well, although it would seem like we should be 
able to come up with something where you have maybe 2 threads trying to 
send on the same tx queue running pfifo_fast every few hundred 
milliseconds and not much else/no other tx traffic on that queue. IIRC 
we believe the scenario is when one thread is in the process of 
dequeuing a packet while another is enqueuing, the enqueue-er (word? :)) 
sees the dequeue is in progress and so does not xmit the packet assuming 
the dequeue operation will take care of it. However b/c the dequeue is 
in the process of completing it doesn't and the newly enqueued packet 
stays in the qdisc until another packet is enqueued pushing both out.


Given that we have a workaround with using fq or any other qdisc not 
named pfifo_fast this has gotten bumped down in priority for us. I would 
like to work on a reproducer at some point, but won't likely be for a 
few weeks :(


Josh

[PATCH 5/9] x86/bpf: Support SIB byte generation

2019-06-13 Thread Josh Poimboeuf

In preparation for using R12 indexing instructions in BPF JIT code, add
support for generating the x86 SIB byte.

Signed-off-by: Josh Poimboeuf 
---
 arch/x86/net/bpf_jit_comp.c | 69 +
 1 file changed, 54 insertions(+), 15 deletions(-)

diff --git a/arch/x86/net/bpf_jit_comp.c b/arch/x86/net/bpf_jit_comp.c
index 485692d4b163..e649f977f8e1 100644
--- a/arch/x86/net/bpf_jit_comp.c
+++ b/arch/x86/net/bpf_jit_comp.c
@@ -143,6 +143,12 @@ static bool is_axreg(u32 reg)
return reg == BPF_REG_0;
 }
 
+static bool is_sib_reg(u32 reg)
+{
+   /* R12 isn't used yet */
+   false;
+}
+
 /* Add modifiers if 'reg' maps to x86-64 registers R8..R15 */
 static u8 add_1mod(u8 byte, u32 reg)
 {
@@ -779,10 +785,19 @@ static int do_jit(struct bpf_prog *bpf_prog, int *addrs, 
u8 *image,
case BPF_ST | BPF_MEM | BPF_DW:
EMIT2(add_1mod(0x48, dst_reg), 0xC7);
 
-st:if (is_imm8(insn->off))
-   EMIT2(add_1reg(0x40, dst_reg), insn->off);
+st:
+   if (is_imm8(insn->off))
+   EMIT1(add_1reg(0x40, dst_reg));
+   else
+   EMIT1(add_1reg(0x80, dst_reg));
+
+   if (is_sib_reg(dst_reg))
+   EMIT1(add_1reg(0x20, dst_reg));
+
+   if (is_imm8(insn->off))
+   EMIT1(insn->off);
else
-   EMIT1_off32(add_1reg(0x80, dst_reg), insn->off);
+   EMIT(insn->off, 4);
 
EMIT(imm32, 
bpf_size_to_x86_bytes(BPF_SIZE(insn->code)));
break;
@@ -811,11 +826,19 @@ st:   if (is_imm8(insn->off))
goto stx;
case BPF_STX | BPF_MEM | BPF_DW:
EMIT2(add_2mod(0x48, dst_reg, src_reg), 0x89);
-stx:   if (is_imm8(insn->off))
-   EMIT2(add_2reg(0x40, dst_reg, src_reg), 
insn->off);
+stx:
+   if (is_imm8(insn->off))
+   EMIT1(add_2reg(0x40, dst_reg, src_reg));
else
-   EMIT1_off32(add_2reg(0x80, dst_reg, src_reg),
-   insn->off);
+   EMIT1(add_2reg(0x80, dst_reg, src_reg));
+
+   if (is_sib_reg(dst_reg))
+   EMIT1(add_1reg(0x20, dst_reg));
+
+   if (is_imm8(insn->off))
+   EMIT1(insn->off);
+   else
+   EMIT(insn->off, 4);
break;
 
/* LDX: dst_reg = *(u8*)(src_reg + off) */
@@ -837,16 +860,24 @@ stx:  if (is_imm8(insn->off))
case BPF_LDX | BPF_MEM | BPF_DW:
/* Emit 'mov rax, qword ptr [rax+0x14]' */
EMIT2(add_2mod(0x48, src_reg, dst_reg), 0x8B);
-ldx:   /*
+ldx:
+   /*
 * If insn->off == 0 we can save one extra byte, but
 * special case of x86 R13 which always needs an offset
 * is not worth the hassle
 */
if (is_imm8(insn->off))
-   EMIT2(add_2reg(0x40, src_reg, dst_reg), 
insn->off);
+   EMIT1(add_2reg(0x40, src_reg, dst_reg));
+   else
+   EMIT1(add_2reg(0x80, src_reg, dst_reg));
+
+   if (is_sib_reg(src_reg))
+   EMIT1(add_1reg(0x20, src_reg));
+
+   if (is_imm8(insn->off))
+   EMIT1(insn->off);
else
-   EMIT1_off32(add_2reg(0x80, src_reg, dst_reg),
-   insn->off);
+   EMIT(insn->off, 4);
break;
 
/* STX XADD: lock *(u32*)(dst_reg + off) += src_reg */
@@ -859,11 +890,19 @@ stx:  if (is_imm8(insn->off))
goto xadd;
case BPF_STX | BPF_XADD | BPF_DW:
EMIT3(0xF0, add_2mod(0x48, dst_reg, src_reg), 0x01);
-xadd:  if (is_imm8(insn->off))
-   EMIT2(add_2reg(0x40, dst_reg, src_reg), 
insn->off);
+xadd:
+   if (is_imm8(insn->off))
+   EMIT1(add_2reg(0x40, dst_reg, src_reg));
+   else
+   EMIT1(add_2reg(0x80, dst_reg, src_reg));
+

[PATCH RESEND 4.9.y] net: check before dereferencing netdev_ops during busy poll

2019-07-01 Thread Josh Elsasser

init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads
to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi
wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll.

Avoid this by ensuring napi->dev->netdev_ops is valid before following
the pointer, avoiding the following panic when busy polling on a dummy
netdev:

  BUG: unable to handle kernel NULL pointer dereference at 00c8
  IP: [] sk_busy_loop+0x92/0x2f0
  Call Trace:
   [] ? uart_write_room+0x74/0xf0
   [] sock_poll+0x99/0xa0
   [] do_sys_poll+0x2e2/0x520
   [] ? get_page_from_freelist+0x3bc/0xa30
   [] ? update_curr+0x62/0x140
   [] ? __slab_free+0xa1/0x2a0
   [] ? __slab_free+0xa1/0x2a0
   [] ? skb_free_head+0x21/0x30
   [] ? poll_initwait+0x50/0x50
   [] ? kmem_cache_free+0x1c6/0x1e0
   [] ? uart_write+0x124/0x1d0
   [] ? remove_wait_queue+0x4d/0x60
   [] ? __wake_up+0x44/0x50
   [] ? tty_write_unlock+0x31/0x40
   [] ? tty_ldisc_deref+0x16/0x20
   [] ? tty_write+0x1e0/0x2f0
   [] ? process_echoes+0x80/0x80
   [] ? __vfs_write+0x2b/0x130
   [] ? vfs_write+0x15a/0x1a0
   [] SyS_poll+0x75/0x100
   [] entry_SYSCALL_64_fastpath+0x24/0xcf

Commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()")
indirectly fixed this upstream in linux-4.11 by removing the offending
pointer usage. No other users of napi->dev touch its netdev_ops.

Fixes: ce6aea93f751 ("net: network drivers no longer need to implement 
ndo_busy_poll()") # 4.9.y
Signed-off-by: Josh Elsasser 
Reviewed-by: Eric Dumazet 
Tested-by: Matteo Croce 
---

No changes since V2[1], resent as per discussiond on -stable[2]. I hope
this is the correct way to send net fixes for older LTS releases, I'm
going off of the latest netdev FAQ:

   For earlier stable releases, each stable branch maintainer is supposed
   to take care of them. If you find any patch is missing from an earlier
   stable branch, please notify sta...@vger.kernel.org with either a commit
   ID or a formal patch backported, and CC Dave and other relevant networking
   developers.

[1]: https://patchwork.ozlabs.org/patch/884986/
[2]: 
https://lore.kernel.org/stable/CAGnkfhx3ykbEsW+=FtpMFWU=_vnie7rppywpwqa1s1hpmxj...@mail.gmail.com/


 net/core/dev.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)


diff --git a/net/core/dev.c b/net/core/dev.c
index 4e10bae5e3da..f693afe608d7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5083,7 +5083,10 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
goto out;
 
/* Note: ndo_busy_poll method is optional in linux-4.5 */
-   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+   if (napi->dev->netdev_ops)
+   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+   else
+   busy_poll = NULL;
 
do {
rc = 0;
-- 
2.20.1

[PATCH 4.4.y] net: check before dereferencing netdev_ops during busy poll

2019-07-01 Thread Josh Elsasser

init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads
to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi
wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll.

Avoid this by ensuring napi->dev->netdev_ops is valid before following
the pointer, avoiding the following panic when busy polling on a dummy
netdev:

  BUG: unable to handle kernel NULL pointer dereference at 00c8
  IP: [] sk_busy_loop+0x92/0x2f0
  Call Trace:
   [] ? uart_write_room+0x74/0xf0
   [] sock_poll+0x99/0xa0
   [] do_sys_poll+0x2e2/0x520
   [] ? get_page_from_freelist+0x3bc/0xa30
   [] ? update_curr+0x62/0x140
   [] ? __slab_free+0xa1/0x2a0
   [] ? __slab_free+0xa1/0x2a0
   [] ? skb_free_head+0x21/0x30
   [] ? poll_initwait+0x50/0x50
   [] ? kmem_cache_free+0x1c6/0x1e0
   [] ? uart_write+0x124/0x1d0
   [] ? remove_wait_queue+0x4d/0x60
   [] ? __wake_up+0x44/0x50
   [] ? tty_write_unlock+0x31/0x40
   [] ? tty_ldisc_deref+0x16/0x20
   [] ? tty_write+0x1e0/0x2f0
   [] ? process_echoes+0x80/0x80
   [] ? __vfs_write+0x2b/0x130
   [] ? vfs_write+0x15a/0x1a0
   [] SyS_poll+0x75/0x100
   [] entry_SYSCALL_64_fastpath+0x24/0xcf

Commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()")
indirectly fixed this upstream in linux-4.11 by removing the offending
pointer usage. No other users of napi->dev touch its netdev_ops.

Fixes: 8b80cda536ea ("net: rename include/net/ll_poll.h to 
include/net/busy_poll.h") # 4.4.y
Signed-off-by: Josh Elsasser 
---

This is a straightforward backport of the 4.9.y fix[1] for this crash, which 
doesn't
apply to the older LTS releases. Only build-tested on 4.4.y, as I don't have 
access
to wireless hardware and firmware that runs on older LTS kernels.

[1]: 
https://lore.kernel.org/stable/20190701234143.72631-1-jelsas...@appneta.com/T/#u

 include/net/busy_poll.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/busy_poll.h b/include/net/busy_poll.h
index 1d67fb6b23a0..6d238506d49b 100644
--- a/include/net/busy_poll.h
+++ b/include/net/busy_poll.h
@@ -93,7 +93,7 @@ static inline bool sk_busy_loop(struct sock *sk, int nonblock)
goto out;
 
ops = napi->dev->netdev_ops;
-   if (!ops->ndo_busy_poll)
+   if (!ops || !ops->ndo_busy_poll)
goto out;
 
do {
-- 
2.20.1

Re: [PATCH RESEND 4.9.y] net: check before dereferencing netdev_ops during busy poll

2019-07-01 Thread Josh Elsasser

On Jul 1, 2019, at 6:51 PM, David Miller  wrote:

> I just tried to apply this with "git am" to the current v4.19 -stable
> branch and it failed.

This is only needed for the v4.9 stable kernel, ndo_busy_poll (and this NPE) 
went away in kernel 4.11.

Sorry, I probably should have called that out more explicitly.

Re: [PATCH] rtl_nic: add firmware rtl8125a-3

2019-08-27 Thread Josh Boyer

On Mon, Aug 26, 2019 at 6:23 PM Heiner Kallweit  wrote:
>
> This adds firmware rtl8125a-3 for Realtek's 2.5Gbps chip RTL8125.
>
> Signed-off-by: Heiner Kallweit 
> ---
> Firmware file was provided by Realtek and they asked me to submit it.

Can we get a Signed-off-by from someone at Realtek then?

josh

> The related extension to r8169 driver will be submitted in the next days.
> ---
>  WHENCE|   3 +++
>  rtl_nic/rtl8125a-3.fw | Bin 0 -> 3456 bytes
>  2 files changed, 3 insertions(+)
>  create mode 100644 rtl_nic/rtl8125a-3.fw
>
> diff --git a/WHENCE b/WHENCE
> index fb12924..dbec18a 100644
> --- a/WHENCE
> +++ b/WHENCE
> @@ -2906,6 +2906,9 @@ Version: 0.0.2
>  File: rtl_nic/rtl8107e-2.fw
>  Version: 0.0.2
>
> +File: rtl_nic/rtl8125a-3.fw
> +Version: 0.0.1
> +
>  Licence:
>   * Copyright © 2011-2013, Realtek Semiconductor Corporation
>   *
> diff --git a/rtl_nic/rtl8125a-3.fw b/rtl_nic/rtl8125a-3.fw
> new file mode 100644
> index 
> ..fac635263f92e8d9734456b75932b2088edd5ef9
> GIT binary patch
> literal 3456
> zcmb7G4@{Kj8Gr9M&hwp*MPE#webM6qsqQiuC(hXPi?+wDL#V(Z*4#K%FD{
> zd#PH+tTJbqZG?a`)*5G*m3F2zmN`7hx;2*IKiVu;;~Hwa_E@^5+Z^ooav$qyWa|!|
> z{J!V^^FHtMe%~vE5S!~a<M6|pO=$6Z+-!F`d2|JjuT=?tX!6t
> zo1eG1ysol-V@-KgU0(T?#@f6Efr9d!!2E*zz=G_mCu@CyK;goi!iD+b1yk6B2%b&6
> z7edS;xkzqO0?9-2l9EW0ltM}+rIFG}86+PmljJ95fhB~6Z~~0y3JYX}?Z^&0uqy1t
> zny?FN!)}}nC*Ym12kkF<@t&E4_=v=ynF9AnDz2DmaE+uRv6r!*@!^m!6NmhMOzq8r
> zcySgS;n{HZ&ViVjO+Em7C z8F2PIf~lcpz}RxchgOhZNnAy~1V! z;l`&BHaB5(q!u4F*5UicX4Dw$I
> zZZG}Y@U*d?z6a>rPW?gZU!wjH^_^&crVIA-hauiRf}*mc@O*L%b>cXFD^8&Io|9br
> zFS+(#A z<^?z-1MK-A9Fa>HXt;u_<`7nle1Pws`y;xZ4b$%$RvXt*Vtj-(#xP2a8yGS^#rwu*
> z7&RlXNB)8`;|q+Lf8+C)SZDkL{T(+MYZPk@p$0naYDvhUdK;W-&~&K<5x2TvCa4ES
> zJSq_Os=`o`>Ti(hqM4#RkyLfbOj8MwbamOxQ0|CNT`@D2E8 z9LCG0O+ej0lA~sw%T-;^=BaOn@)@g8tu_{^65~O&#t5oXW3d`Ciq!i?u^KfEWsf|f
> z%8X@d%v{dr6>6QaQuTMNV*C=d)+lAYWhy1Kp7A%(ze4qPRH@`pHTfshkXfVRB2TgY
> zO=`+Wt(q39Q=W61)m5XOc8&DkO5CPgp(bTNzg>y9p~8jDs$zJj5M4(
> zG`vTNq`hitUz@`B_Nx)&fWp3Z<>))8?4g&GICDrH_jW38Z z6ix)=-SB%nIOXwTePI%gc_p@upI;Gdo~F;Tmw)5`y%^_39nX1Ki-J6~L1FKz7NSjv
> z_`nkPz3?L$w%rohr-(vgBF1 z-4-4vre2o#VTHs{P{PCaz|dle45F5=t7KfRt3*ceh=r$#TAy7mkxtwHW#j)EHgmp)
> zO)hO_GH>cb5{`=!(@#j)h_-jxgCneiD9Hcd;ix`Q>rX~yy2c{beS3_|m>9m87;BeK
> z9^<~->L`kd5sZmZuw?QWNw>vliHU)j7&EQ4-f1m1#+c~6-j5U9k9~bn**#pV?$N|-
> zq$Nr8T#$I{J?c3t2VJ-F9=Aj_^{tkE`!wz?XF~Vaag5Xw_4`&lQTP1kQQmM$_|B0(
> zOkxdl@0Tb}k#O zV)t~$Ac`@GiSytUlb?qfh~`cEKhXXQ`gkQS+oQgJX8gniiK%+szlqBBQMQ+LjIoYA
> z7Pea0V&QHJcUss?e1U!-enM;`#@W7FhmW$!&b34|Z>oiU3%_IGY6~}8_<=_875$5K
> z-p<>elW^0nO-Xb$v)?-<5?*G%4-kJO5#)Zu+T&X8&rjZA4DRX1phN*@#Ln3Z5`&x>
> zM^Cgz;x5{-ci}0VyQ8FT#^zi&IL{9H?xHU!(>!!e${wT4$GqBa3H>xiG*Va3d7hBl
> z$-lAV)_T3WsTa}QwwT;{)^yFKtnb^bPxFJExzl=W&oejIewS6dj^AhH>tVh&*5~e`
> zja9dg7?{KsZ_$^!dgjpQL9gf03g&nvnzMv6vp=S9DYVsnn`y<1qkR|cAF7hLwo&4_
> z$0b%#ug_6NWfb*$Il}q#a(!Ob6=ePMXrpB-v<#HJEP3C!v!5@ zAjU7*@~eyS)8C3a`2}PA^u1Eoi5NfK?_u^^&&MtM&ozI+{=2wFEidM}?R=ite~tg7
> zpYGYyoP*`0xugV=o@1RyFwcMDQ>Obe;jicCb=vBh2MhU4 z8^49!$+$sGAily`auZvb-$fkYEIElgD0dJaC)$bQXUsw`@urBL? z@#-RprR2w-mqwqPKV$B{bA8MiN1HEyEpap~@gZZ_*el!TTw^JF*(K5a0Q2#@rtsUg
> zt6SovowDhaJ zME%~Yvas61t;E=S%Z@SO6V|;&h-h!3cbdD!=(z6g@jH#a_vpS&+;={={DQpi2 DeEY6F
>
> literal 0
> HcmV?d1
>
> --
> 2.23.0
>

[PATCH iproute2-next] ss: add option to print socket information on one line

2019-04-25 Thread Josh Hunt

Multi-line output in ss makes it difficult to search for things with
grep. This new option will make it easier to find sockets matching
certain criteria with simple grep commands.

Example without option:
$ ss -emoitn
State  Recv-Q Send-Q Local Address:Port   Peer Address:Port
ESTAB  0  0  127.0.0.1:13265  127.0.0.1:36743   
uid:1974 ino:48271 sk:1 <->
 skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno 
wscale:7,7 rto:211 rtt:10.245/16.616 ato:40 mss:65483 cwnd:10 
bytes_acked:41865496 bytes_received:21580440 segs_out:242496 segs_in:351446 
data_segs_out:242495 data_segs_in:242495 send 511.3Mbps lastsnd:2383 
lastrcv:2383 lastack:2342 pacing_rate 1022.6Mbps rcv_rtt:92427.6 
rcv_space:43725 minrtt:0.007

Example with new option:
$ ss -emoitnO
StateRecv-Q Send-Q  Local Address:PortPeer Address:Port
ESTAB0  0   127.0.0.1:13265  
127.0.0.1:36743 uid:1974 ino:48271 sk:1 <-> 
skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:7,7 
rto:211 rtt:10.067/16.429 ato:40 mss:65483 pmtu:65535 rcvmss:536 advmss:65483 
cwnd:10 bytes_sent:41868244 bytes_acked:41868244 bytes_received:21581866 
segs_out:242512 segs_in:351469 data_segs_out:242511 data_segs_in:242511 send 
520.4Mbps lastsnd:14355 lastrcv:14355 lastack:14314 pacing_rate 1040.7Mbps 
delivery_rate 74837.7Mbps delivered:242512 app_limited busy:1861946ms 
rcv_rtt:92427.6 rcv_space:43725 rcv_ssthresh:43690 minrtt:0.007

Signed-off-by: Josh Hunt 
---
 man/man8/ss.8 |  3 +++
 misc/ss.c | 51 +--
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 03a3dcc6c9c3..dd5ebc96ec03 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -24,6 +24,9 @@ Output version information.
 .B \-H, \-\-no-header
 Suppress header line.
 .TP
+.B \-O, \-\-one-line
+Print each socket's data on a single line.
+.TP
 .B \-n, \-\-numeric
 Do not try to resolve service names.
 .TP
diff --git a/misc/ss.c b/misc/ss.c
index 9cb3ee19e542..3a193db7d17f 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -121,6 +121,7 @@ static int follow_events;
 static int sctp_ino;
 static int show_tipcinfo;
 static int show_tos;
+int oneline = 0;
 
 enum col_id {
COL_NETID,
@@ -3053,7 +3054,8 @@ static int inet_show_sock(struct nlmsghdr *nlh,
}
 
if (show_mem || (show_tcpinfo && s->type != IPPROTO_UDP)) {
-   out("\n\t");
+   if (!oneline)
+   out("\n\t");
if (s->type == IPPROTO_SCTP)
sctp_show_info(nlh, r, tb);
else
@@ -3973,7 +3975,10 @@ static int packet_show_sock(struct nlmsghdr *nlh, void 
*arg)
 
if (show_details) {
if (pinfo) {
-   out("\n\tver:%d", pinfo->pdi_version);
+   if (oneline)
+   out(" ver:%d", pinfo->pdi_version);
+   else
+   out("\n\tver:%d", pinfo->pdi_version);
out(" cpy_thresh:%d", pinfo->pdi_copy_thresh);
out(" flags( ");
if (pinfo->pdi_flags & PDI_RUNNING)
@@ -3991,19 +3996,28 @@ static int packet_show_sock(struct nlmsghdr *nlh, void 
*arg)
out(" )");
}
if (ring_rx) {
-   out("\n\tring_rx(");
+   if (oneline)
+   out(" ring_rx(");
+   else
+   out("\n\tring_rx(");
packet_show_ring(ring_rx);
out(")");
}
if (ring_tx) {
-   out("\n\tring_tx(");
+   if (oneline)
+   out(" ring_tx(");
+   else
+   out("\n\tring_tx(");
packet_show_ring(ring_tx);
out(")");
}
if (has_fanout) {
uint16_t type = (fanout >> 16) & 0x;
 
-   out("\n\tfanout(");
+   if (oneline)
+   out(" fanout(");
+   else
+   out("\n\tfanout(");
out("id:%d,", fanout & 0x);
out("type:");
 
@@ -4032,7 +4046,10 @@ static int packet_show_sock(struct nlmsghdr *nlh, void 
*arg)
int num = RTA_PAYLOAD(tb[PACKET_DIAG_FILTER]) /

Re: [PATCH iproute2-next] ss: add option to print socket information on one line

2019-04-25 Thread Josh Hunt


On 4/25/19 3:59 PM, Stephen Hemminger wrote:

On Thu, 25 Apr 2019 17:21:48 -0400
Josh Hunt  wrote:


Multi-line output in ss makes it difficult to search for things with
grep. This new option will make it easier to find sockets matching
certain criteria with simple grep commands.

Example without option:
$ ss -emoitn
State  Recv-Q Send-Q Local Address:Port   Peer Address:Port
ESTAB  0  0  127.0.0.1:13265  127.0.0.1:36743   
uid:1974 ino:48271 sk:1 <->
 skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno 
wscale:7,7 rto:211 rtt:10.245/16.616 ato:40 mss:65483 cwnd:10 
bytes_acked:41865496 bytes_received:21580440 segs_out:242496 segs_in:351446 
data_segs_out:242495 data_segs_in:242495 send 511.3Mbps lastsnd:2383 
lastrcv:2383 lastack:2342 pacing_rate 1022.6Mbps rcv_rtt:92427.6 
rcv_space:43725 minrtt:0.007

Example with new option:
$ ss -emoitnO
StateRecv-Q Send-Q  Local Address:PortPeer Address:Port
ESTAB0  0   127.0.0.1:13265  127.0.0.1:36743 
uid:1974 ino:48271 sk:1 <-> skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) 
ts sack reno wscale:7,7 rto:211 rtt:10.067/16.429 ato:40 mss:65483 pmtu:65535 
rcvmss:536 advmss:65483 cwnd:10 bytes_sent:41868244 bytes_acked:41868244 
bytes_received:21581866 segs_out:242512 segs_in:351469 data_segs_out:242511 
data_segs_in:242511 send 520.4Mbps lastsnd:14355 lastrcv:14355 lastack:14314 
pacing_rate 1040.7Mbps delivery_rate 74837.7Mbps delivered:242512 app_limited 
busy:1861946ms rcv_rtt:92427.6 rcv_space:43725 rcv_ssthresh:43690 minrtt:0.007

Signed-off-by: Josh Hunt 


I agree with this, but probably time to give ss a json output as well.



OK sure. Do you want to take both in a series or will you take this now 
with a separate json patch later?


Josh

Re: [net-next 01/12] i40e: replace switch-statement to speed-up retpoline-enabled builds

2019-04-29 Thread Josh Elsasser

On Apr 29, 2019, at 12:16 PM, Jeff Kirsher  wrote:

> From: Björn Töpel 
> 
> GCC will generate jump tables for switch-statements with more than 5
> case statements. An entry into the jump table is an indirect call,
> which means that for CONFIG_RETPOLINE builds, this is rather
> expensive.
> 
> This commit replaces the switch-statement that acts on the XDP program
> result with an if-clause.

Apologies for the noise, but is this patch still required after the
recent threshold bump[0] and later removal[1] of switch-case jump
table generation when building with CONFIG_RETPOLINE?

[0]: https://lore.kernel.org/patchwork/patch/1044863/
[1]: https://lore.kernel.org/patchwork/patch/1054472/

If nothing else the commit message no longer seems accurate.

Regards,
-- Josh

Re: [PATCH iproute2-next] ss: add option to print socket information on one line

2019-04-30 Thread Josh Hunt


On 4/30/19 11:30 AM, David Ahern wrote:

On 4/25/19 3:21 PM, Josh Hunt wrote:

@@ -4877,6 +4903,7 @@ static void _usage(FILE *dest)
  "\n"
  "   -K, --kill  forcibly close sockets, display what was closed\n"
  "   -H, --no-header Suppress header line\n"
+"   -O, --one-line  socket's data printed on a single line\n"
  "\n"
  "   -A, --query=QUERY, --socket=QUERY\n"
  "   QUERY := 
{all|inet|tcp|udp|raw|unix|unix_dgram|unix_stream|unix_seqpacket|packet|netlink|vsock_stream|vsock_dgram|tipc}[,QUERY]\n"
@@ -5003,6 +5030,7 @@ static const struct option long_opts[] = {
{ "kill", 0, 0, 'K' },
{ "no-header", 0, 0, 'H' },
{ "xdp", 0, 0, OPT_XDPSOCK},
+   { "one-line", 0, 0, 'O' },


shame 'o' can not be used for consistency with ip, but we can have both
use 'oneline' as the long option without the '-'.



Yeah, thanks David. I saw that the other tools use --oneline, so I'll 
send an update with that changed.


Josh

Re: [PATCH iproute2-next] ss: add option to print socket information on one line

2019-04-30 Thread Josh Hunt


On 4/30/19 11:31 AM, Josh Hunt wrote:

On 4/30/19 11:30 AM, David Ahern wrote:

On 4/25/19 3:21 PM, Josh Hunt wrote:

@@ -4877,6 +4903,7 @@ static void _usage(FILE *dest)
  "\n"
  "   -K, --kill  forcibly close sockets, display what was 
closed\n"

  "   -H, --no-header Suppress header line\n"
+"   -O, --one-line  socket's data printed on a single line\n"
  "\n"
  "   -A, --query=QUERY, --socket=QUERY\n"
  "   QUERY := 
{all|inet|tcp|udp|raw|unix|unix_dgram|unix_stream|unix_seqpacket|packet|netlink|vsock_stream|vsock_dgram|tipc}[,QUERY]\n" 


@@ -5003,6 +5030,7 @@ static const struct option long_opts[] = {
  { "kill", 0, 0, 'K' },
  { "no-header", 0, 0, 'H' },
  { "xdp", 0, 0, OPT_XDPSOCK},
+    { "one-line", 0, 0, 'O' },


shame 'o' can not be used for consistency with ip, but we can have both
use 'oneline' as the long option without the '-'.



Yeah, thanks David. I saw that the other tools use --oneline, so I'll 
send an update with that changed.




Actually, David can you clarify what you meant by "use 'oneline' as the 
long option without the '-'."?


Thanks
Josh

Re: [PATCH iproute2-next] ss: add option to print socket information on one line

2019-04-30 Thread Josh Hunt


On 4/30/19 12:41 PM, David Ahern wrote:

On 4/30/19 12:55 PM, Josh Hunt wrote:

Actually, David can you clarify what you meant by "use 'oneline' as the
long option without the '-'."?


for your patch:
1,$s/one-line/oneline/

ip has -oneline which is most likely used as 'ip -o'. having ss with
--one-line vs --oneline is at least consistent to the level it can be.



OK, that's what I thought you meant, but wanted to confirm.

Thanks!
Josh

[PATCH v2] ss: add option to print socket information on one line

2019-04-30 Thread Josh Hunt

Multi-line output in ss makes it difficult to search for things with
grep. This new option will make it easier to find sockets matching
certain criteria with simple grep commands.

Example without option:
$ ss -emoitn
State  Recv-Q Send-Q Local Address:Port   Peer Address:Port
ESTAB  0  0  127.0.0.1:13265  127.0.0.1:36743   
uid:1974 ino:48271 sk:1 <->
 skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno 
wscale:7,7 rto:211 rtt:10.245/16.616 ato:40 mss:65483 cwnd:10 
bytes_acked:41865496 bytes_received:21580440 segs_out:242496 segs_in:351446 
data_segs_out:242495 data_segs_in:242495 send 511.3Mbps lastsnd:2383 
lastrcv:2383 lastack:2342 pacing_rate 1022.6Mbps rcv_rtt:92427.6 
rcv_space:43725 minrtt:0.007

Example with new option:
$ ss -emoitnO
StateRecv-Q Send-Q  Local Address:PortPeer Address:Port
ESTAB0  0   127.0.0.1:13265  
127.0.0.1:36743 uid:1974 ino:48271 sk:1 <-> 
skmem:(r0,rb2227595,t0,tb2626560,f0,w0,o0,bl0,d0) ts sack reno wscale:7,7 
rto:211 rtt:10.067/16.429 ato:40 mss:65483 pmtu:65535 rcvmss:536 advmss:65483 
cwnd:10 bytes_sent:41868244 bytes_acked:41868244 bytes_received:21581866 
segs_out:242512 segs_in:351469 data_segs_out:242511 data_segs_in:242511 send 
520.4Mbps lastsnd:14355 lastrcv:14355 lastack:14314 pacing_rate 1040.7Mbps 
delivery_rate 74837.7Mbps delivered:242512 app_limited busy:1861946ms 
rcv_rtt:92427.6 rcv_space:43725 rcv_ssthresh:43690 minrtt:0.007

Signed-off-by: Josh Hunt 
---

v1 -> v2
* Update long option to --oneline to match other ip tools as per David.

 man/man8/ss.8 |  3 +++
 misc/ss.c | 51 +--
 2 files changed, 44 insertions(+), 10 deletions(-)

diff --git a/man/man8/ss.8 b/man/man8/ss.8
index 03a3dcc6c9c3..9054fab9be69 100644
--- a/man/man8/ss.8
+++ b/man/man8/ss.8
@@ -24,6 +24,9 @@ Output version information.
 .B \-H, \-\-no-header
 Suppress header line.
 .TP
+.B \-O, \-\-oneline
+Print each socket's data on a single line.
+.TP
 .B \-n, \-\-numeric
 Do not try to resolve service names.
 .TP
diff --git a/misc/ss.c b/misc/ss.c
index 9cb3ee19e542..e8e7b62eb4a5 100644
--- a/misc/ss.c
+++ b/misc/ss.c
@@ -121,6 +121,7 @@ static int follow_events;
 static int sctp_ino;
 static int show_tipcinfo;
 static int show_tos;
+int oneline = 0;
 
 enum col_id {
COL_NETID,
@@ -3053,7 +3054,8 @@ static int inet_show_sock(struct nlmsghdr *nlh,
}
 
if (show_mem || (show_tcpinfo && s->type != IPPROTO_UDP)) {
-   out("\n\t");
+   if (!oneline)
+   out("\n\t");
if (s->type == IPPROTO_SCTP)
sctp_show_info(nlh, r, tb);
else
@@ -3973,7 +3975,10 @@ static int packet_show_sock(struct nlmsghdr *nlh, void 
*arg)
 
if (show_details) {
if (pinfo) {
-   out("\n\tver:%d", pinfo->pdi_version);
+   if (oneline)
+   out(" ver:%d", pinfo->pdi_version);
+   else
+   out("\n\tver:%d", pinfo->pdi_version);
out(" cpy_thresh:%d", pinfo->pdi_copy_thresh);
out(" flags( ");
if (pinfo->pdi_flags & PDI_RUNNING)
@@ -3991,19 +3996,28 @@ static int packet_show_sock(struct nlmsghdr *nlh, void 
*arg)
out(" )");
}
if (ring_rx) {
-   out("\n\tring_rx(");
+   if (oneline)
+   out(" ring_rx(");
+   else
+   out("\n\tring_rx(");
packet_show_ring(ring_rx);
out(")");
}
if (ring_tx) {
-   out("\n\tring_tx(");
+   if (oneline)
+   out(" ring_tx(");
+   else
+   out("\n\tring_tx(");
packet_show_ring(ring_tx);
out(")");
}
if (has_fanout) {
uint16_t type = (fanout >> 16) & 0x;
 
-   out("\n\tfanout(");
+   if (oneline)
+   out(" fanout(");
+   else
+   out("\n\tfanout(");
out("id:%d,", fanout & 0x);
out("type:");
 
@@ -4032,7 +4046,10 @@ static int packet_show_sock(struct nlmsghdr *nlh, void 
*arg)

Re: [PATCH] can: check for null sk before deferencing it via the call to sock_net

2017-10-16 Thread Josh Boyer

On Fri, Sep 8, 2017 at 1:46 PM, Oliver Hartkopp  wrote:
>
>
> On 09/08/2017 05:02 PM, Colin King wrote:
>>
>> From: Colin Ian King 
>>
>> The assignment of net via call sock_net will dereference sk. This
>> is performed before a sanity null check on sk, so there could be
>> a potential null dereference on the sock_net call if sk is null.
>> Fix this by assigning net after the sk null check. Also replace
>> the sk == NULL with the more usual !sk idiom.
>>
>> Detected by CoverityScan CID#1431862 ("Dereference before null check")
>>
>> Fixes: 384317ef4187 ("can: network namespace support for CAN_BCM
>> protocol")
>> Signed-off-by: Colin Ian King 
>
>
> Acked-by: Oliver Hartkopp 

I don't see this one queued up in the net or net-next trees.  Did it
fall through the cracks or did it get queued up elsewhere?  Seems like
it's a good candidate to get into 4.14?

josh

>
>
> Thanks Collin!
>
>
>> ---
>>   net/can/bcm.c | 5 +++--
>>   1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/can/bcm.c b/net/can/bcm.c
>> index 47a8748d953a..a3791674b8ce 100644
>> --- a/net/can/bcm.c
>> +++ b/net/can/bcm.c
>> @@ -1493,13 +1493,14 @@ static int bcm_init(struct sock *sk)
>>   static int bcm_release(struct socket *sock)
>>   {
>> struct sock *sk = sock->sk;
>> -   struct net *net = sock_net(sk);
>> +   struct net *net;
>> struct bcm_sock *bo;
>> struct bcm_op *op, *next;
>>   - if (sk == NULL)
>> +   if (!sk)
>> return 0;
>>   + net = sock_net(sk);
>> bo = bcm_sk(sk);
>> /* remove bcm_ops, timer, rx_unregister(), etc. */
>>
>

Re: [PATCH 0/3] Remove accidental VLA usage

2018-03-08 Thread Josh Poimboeuf

On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote:
> This series adds SIMPLE_MAX() to be used in places where a stack array
> is actually fixed, but the compiler still warns about VLA usage due to
> confusion caused by the safety checks in the max() macro.
> 
> I'm sending these via -mm since that's where I've introduced SIMPLE_MAX(),
> and they should all have no operational differences.

What if we instead simplify the max() macro's type checking so that GCC
can more easily fold the array size constants?  The below patch seems to
work:

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index 3fd291503576..ec863726da29 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -782,42 +782,32 @@ ftrace_vprintk(const char *fmt, va_list ap)
 static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
 #endif /* CONFIG_TRACING */
 
-/*
- * min()/max()/clamp() macros that also do
- * strict type-checking.. See the
- * "unnecessary" pointer comparison.
- */
-#define __min(t1, t2, min1, min2, x, y) ({ \
-   t1 min1 = (x);  \
-   t2 min2 = (y);  \
-   (void) (&min1 == &min2);\
-   min1 < min2 ? min1 : min2; })
+extern long __error_incompatible_types_in_min_macro;
+extern long __error_incompatible_types_in_max_macro;
+
+#define __min(t1, t2, x, y)\
+   __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \
+ (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
+ (t1)__error_incompatible_types_in_min_macro)
 
 /**
  * min - return minimum of two values of the same or compatible types
  * @x: first value
  * @y: second value
  */
-#define min(x, y)  \
-   __min(typeof(x), typeof(y), \
- __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
- x, y)
+#define min(x, y) __min(typeof(x), typeof(y), x, y)\
 
-#define __max(t1, t2, max1, max2, x, y) ({ \
-   t1 max1 = (x);  \
-   t2 max2 = (y);  \
-   (void) (&max1 == &max2);\
-   max1 > max2 ? max1 : max2; })
+#define __max(t1, t2, x, y)\
+   __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \
+ (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
+ (t1)__error_incompatible_types_in_max_macro)
 
 /**
  * max - return maximum of two values of the same or compatible types
  * @x: first value
  * @y: second value
  */
-#define max(x, y)  \
-   __max(typeof(x), typeof(y), \
- __UNIQUE_ID(max1_), __UNIQUE_ID(max2_),   \
- x, y)
+#define max(x, y) __max(typeof(x), typeof(y), x, y)
 
 /**
  * min3 - return minimum of three values
@@ -869,10 +859,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
  * @x: first value
  * @y: second value
  */
-#define min_t(type, x, y)  \
-   __min(type, type,   \
- __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
- x, y)
+#define min_t(type, x, y) __min(type, type, x, y)
 
 /**
  * max_t - return maximum of two values, using the specified type
@@ -880,10 +867,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
oops_dump_mode) { }
  * @x: first value
  * @y: second value
  */
-#define max_t(type, x, y)  \
-   __max(type, type,   \
- __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
- x, y)
+#define max_t(type, x, y) __max(type, type, x, y)  
\
 
 /**
  * clamp_t - return a value clamped to a given range using a given type

Re: [PATCH 0/3] Remove accidental VLA usage

2018-03-08 Thread Josh Poimboeuf

On Thu, Mar 08, 2018 at 10:02:01AM -0800, Kees Cook wrote:
> On Thu, Mar 8, 2018 at 7:02 AM, Josh Poimboeuf  wrote:
> > On Wed, Mar 07, 2018 at 07:30:44PM -0800, Kees Cook wrote:
> >> This series adds SIMPLE_MAX() to be used in places where a stack array
> >> is actually fixed, but the compiler still warns about VLA usage due to
> >> confusion caused by the safety checks in the max() macro.
> >>
> >> I'm sending these via -mm since that's where I've introduced SIMPLE_MAX(),
> >> and they should all have no operational differences.
> >
> > What if we instead simplify the max() macro's type checking so that GCC
> > can more easily fold the array size constants?  The below patch seems to
> > work:
> 
> Oh magic! Very nice. I couldn't figure out how to do this when I
> stared at it. Yes, let me respin. (I assume I can add your S-o-b?)

I'm going to be traveling for a few days, so I bequeath the patch to
you.  You can add my SOB.  I agree with Steve's suggestion to run it
through 0-day.

> 
> -Kees
> 
> >
> > diff --git a/include/linux/kernel.h b/include/linux/kernel.h
> > index 3fd291503576..ec863726da29 100644
> > --- a/include/linux/kernel.h
> > +++ b/include/linux/kernel.h
> > @@ -782,42 +782,32 @@ ftrace_vprintk(const char *fmt, va_list ap)
> >  static inline void ftrace_dump(enum ftrace_dump_mode oops_dump_mode) { }
> >  #endif /* CONFIG_TRACING */
> >
> > -/*
> > - * min()/max()/clamp() macros that also do
> > - * strict type-checking.. See the
> > - * "unnecessary" pointer comparison.
> > - */
> > -#define __min(t1, t2, min1, min2, x, y) ({ \
> > -   t1 min1 = (x);  \
> > -   t2 min2 = (y);  \
> > -   (void) (&min1 == &min2);\
> > -   min1 < min2 ? min1 : min2; })
> > +extern long __error_incompatible_types_in_min_macro;
> > +extern long __error_incompatible_types_in_max_macro;
> > +
> > +#define __min(t1, t2, x, y)\
> > +   __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \
> > + (t1)(x) < (t2)(y) ? (t1)(x) : (t2)(y),\
> > + (t1)__error_incompatible_types_in_min_macro)
> >
> >  /**
> >   * min - return minimum of two values of the same or compatible types
> >   * @x: first value
> >   * @y: second value
> >   */
> > -#define min(x, y)  \
> > -   __min(typeof(x), typeof(y), \
> > - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
> > - x, y)
> > +#define min(x, y) __min(typeof(x), typeof(y), x, y)\
> >
> > -#define __max(t1, t2, max1, max2, x, y) ({ \
> > -   t1 max1 = (x);  \
> > -   t2 max2 = (y);  \
> > -   (void) (&max1 == &max2);\
> > -   max1 > max2 ? max1 : max2; })
> > +#define __max(t1, t2, x, y)\
> > +   __builtin_choose_expr(__builtin_types_compatible_p(t1, t2), \
> > + (t1)(x) > (t2)(y) ? (t1)(x) : (t2)(y),\
> > + (t1)__error_incompatible_types_in_max_macro)
> >
> >  /**
> >   * max - return maximum of two values of the same or compatible types
> >   * @x: first value
> >   * @y: second value
> >   */
> > -#define max(x, y)  \
> > -   __max(typeof(x), typeof(y), \
> > - __UNIQUE_ID(max1_), __UNIQUE_ID(max2_),   \
> > - x, y)
> > +#define max(x, y) __max(typeof(x), typeof(y), x, y)
> >
> >  /**
> >   * min3 - return minimum of three values
> > @@ -869,10 +859,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
> > oops_dump_mode) { }
> >   * @x: first value
> >   * @y: second value
> >   */
> > -#define min_t(type, x, y)  \
> > -   __min(type, type,   \
> > - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
> > - x, y)
> > +#define min_t(type, x, y) __min(type, type, x, y)
> >
> >  /**
> >   * max_t - return maximum of two values, using the specified type
> > @@ -880,10 +867,7 @@ static inline void ftrace_dump(enum ftrace_dump_mode 
> > oops_dump_mode) { }
> >   * @x: first value
> >   * @y: second value
> >   */
> > -#define max_t(type, x, y)  \
> > -   __max(type, type,   \
> > - __UNIQUE_ID(min1_), __UNIQUE_ID(min2_),   \
> > - x, y)
> > +#define max_t(type, x, y) __max(type, type, x, y)  
> > \
> >
> >  /**
> >   * clamp_t - return a value clamped to a given range using a given type
> 
> 
> 
> -- 
> Kees Cook
> Pixel Security

-- 
Josh

[PATCH 1/1] net: check dev->reg_state before deref of napi netdev_ops

2018-03-11 Thread Josh Elsasser

init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads
to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi
wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll.

Avoid this by ensuring that napi->dev is not a dummy device before
dereferencing napi dev's netdev_ops, preventing the following panic:

  BUG: unable to handle kernel NULL pointer dereference at 00c8
  IP: [] sk_busy_loop+0x92/0x2f0
  Call Trace:
   [] ? uart_write_room+0x74/0xf0
   [] sock_poll+0x99/0xa0
   [] do_sys_poll+0x2e2/0x520
   [] ? get_page_from_freelist+0x3bc/0xa30
   [] ? update_curr+0x62/0x140
   [] ? __slab_free+0xa1/0x2a0
   [] ? __slab_free+0xa1/0x2a0
   [] ? skb_free_head+0x21/0x30
   [] ? poll_initwait+0x50/0x50
   [] ? kmem_cache_free+0x1c6/0x1e0
   [] ? uart_write+0x124/0x1d0
   [] ? remove_wait_queue+0x4d/0x60
   [] ? __wake_up+0x44/0x50
   [] ? tty_write_unlock+0x31/0x40
   [] ? tty_ldisc_deref+0x16/0x20
   [] ? tty_write+0x1e0/0x2f0
   [] ? process_echoes+0x80/0x80
   [] ? __vfs_write+0x2b/0x130
   [] ? vfs_write+0x15a/0x1a0
   [] SyS_poll+0x75/0x100
   [] entry_SYSCALL_64_fastpath+0x24/0xcf

Commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()")
indirectly fixed this upstream in linux-4.11 by removing the offending
pointer usage. No other users of napi->dev touch its netdev_ops.

Fixes: 060212928670 ("net: add low latency socket poll")
Fixes: ce6aea93f751 ("net: network drivers no longer need to implement 
ndo_busy_poll()") - 4.9.y
Signed-off-by: Josh Elsasser 
---
 net/core/dev.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8898618bf341..d0f67d544587 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5042,7 +5042,10 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
goto out;
 
/* Note: ndo_busy_poll method is optional in linux-4.5 */
-   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+   if (napi->dev->reg_state != NETREG_DUMMY)
+   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+   else
+   busy_poll = NULL;
 
do {
rc = 0;
-- 
2.11.0

[PATCH 0/1] net: avoid a kernel panic during sk_busy_loop

2018-03-11 Thread Josh Elsasser

Hi Dave,

I stumbled across a reproducible kernel panic while playing around with
busy_poll on a Linux 4.9.86 kernel. There's an unfortunate interaction
between init_dummy_netdev, which doesn't bother to fill in netdev_ops, and
sk_busy_loop, which assumes netdev_ops is a valid pointer.

To reproduce on the device under test (DUT), I did:

  $ ip addr show dev wlan0
  8: wlan0:  mtu 1500 qdisc mq [...]
  inet 172.16.122.6/23 brd 172.16.123.255 scope global wlan0
  $ sysctl -w net.core.busy_read=50
  $ nc -l 172.16.122.6 5001

Then transmitted some data to this socket from a second host:

  $ echo "foo" | nc 172.16.122.6 5001

The DUT immediately hits a kernel panic.

I've attached a patch that applies cleanly to the 4.9.87 stable release.
This fix isn't necessary for net/net-next (ndo_busy_poll was removed in
linux-4.11), but a further backport of this commit is likely required for
any stable releases older than linux-4.5.

I hope this is the right way to raise something like this. I couldn't find
a clear answer from the -stable and netdev howtos for bugs against features
that no longer exist in mainline.

Thanks,
Josh

Re: [PATCH 1/1] qed: Add firmware 8.33.11.0

2018-03-12 Thread Josh Boyer

On Feb 28, Rahul Verma wrote:
> This patch add a new qed firmware with fixes and added
> support for new features.
> -Support VLAN remove action in steering flow.
> -Optimized the FW flow and several bug fixes.
> -Allow VXLAN steering.
> -Supports T10DIF and SRQ.
> -Support for port redirection for RDMA bonding
>
> Signed-off-by: Rahul Verma 
> ---
>  WHENCE   |   1 +
>  qed/qed_init_values_zipped-8.33.11.0.bin | Bin 0 -> 852456 bytes
>  2 files changed, 1 insertion(+)
>  create mode 100755 qed/qed_init_values_zipped-8.33.11.0.bin

I had to fixup a small conflict in WHENCE, but no big deal.  Applied
and pushed out.

josh

Re: [PATCH linux-firmware] Mellanox: Add new mlxsw_spectrum firmware 13.1620.192

2018-03-12 Thread Josh Boyer

On Tue, Feb 27, 2018 at 3:51 AM, Tal Bar  wrote:
> This new firmware contains:
> - Support for auto-neg disable mode
>
> Signed-off-by: Tal Bar 
> ---
>  WHENCE   |   1 +
>  mellanox/mlxsw_spectrum-13.1620.192.mfa2 | Bin 0 -> 1091848 bytes
>  2 files changed, 1 insertion(+)
>  create mode 100644 mellanox/mlxsw_spectrum-13.1620.192.mfa2

Applied and pushed out.  Thanks.

josh

Re: [PATCH 1/1] net: check dev->reg_state before deref of napi netdev_ops

2018-03-12 Thread Josh Elsasser

On Mon, Mar 12, 2018 at 4:17 PM, Cong Wang  wrote:
> On Sun, Mar 11, 2018 at 12:22 PM, Josh Elsasser  wrote:
>> init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads
>> to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi
>> wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll.
>>
>> Avoid this by ensuring that napi->dev is not a dummy device before
>> dereferencing napi dev's netdev_ops, preventing the following panic:
>
> Hmm, how about just checking ->netdev_ops? Checking reg_state looks
> odd, although works.

Fair point. I was trying to differentiate between an unexpected
NULL pointer and a dummy netdev, but I guess it was clever
at the expense of readability.

I'll push up a v2 that just does the obvious.

[PATCH 0/1] net: avoid a kernel panic during sk_busy_loop

2018-03-12 Thread Josh Elsasser

V2: just check napi->dev->netdev_ops instead of getting clever with the
netdev registration state.

Original cover letter:

Hi Dave,

I stumbled across a reproducible kernel panic while playing around with
busy_poll on a Linux 4.9.86 kernel. There's an unfortunate interaction
between init_dummy_netdev, which doesn't bother to fill in netdev_ops, and
sk_busy_loop, which assumed netdev_ops is a valid pointer.

To reproduce on the device under test (DUT), I did:

  $ ip addr show dev wlan0
  8: wlan0:  mtu 1500 qdisc mq [...]
  inet 172.16.122.6/23 brd 172.16.123.255 scope global wlan0
  $ sysctl -w net.core.busy_read=50
  $ nc -l 172.16.122.6 5001

Then transmitted some data to this socket from a second host:

  $ echo "foo" | nc 172.16.122.6 5001

The DUT immediately hits a kernel panic.

I've attached a patch that applies cleanly to the 4.9.87 stable release.
This fix isn't necessary for net/net-next (ndo_busy_poll was removed in
linux-4.11), but a further backport of this commit is likely required for
any stable releases older than linux-4.5.

I hope this is the right way to raise something like this. I couldn't find
a clear answer from the -stable and netdev on how to handle bugs in features
that no longer exist in mainline.

Thanks,
Josh

[PATCH v2 1/1] net: check before dereferencing netdev_ops during busy poll

2018-03-12 Thread Josh Elsasser

init_dummy_netdev() leaves its netdev_ops pointer zeroed. This leads
to a NULL pointer dereference when sk_busy_loop fires against an iwlwifi
wireless adapter and checks napi->dev->netdev_ops->ndo_busy_poll.

Avoid this by ensuring napi->dev->netdev_ops is valid before following
the pointer, avoiding the following panic when busy polling on a dummy
netdev:

  BUG: unable to handle kernel NULL pointer dereference at 00c8
  IP: [] sk_busy_loop+0x92/0x2f0
  Call Trace:
   [] ? uart_write_room+0x74/0xf0
   [] sock_poll+0x99/0xa0
   [] do_sys_poll+0x2e2/0x520
   [] ? get_page_from_freelist+0x3bc/0xa30
   [] ? update_curr+0x62/0x140
   [] ? __slab_free+0xa1/0x2a0
   [] ? __slab_free+0xa1/0x2a0
   [] ? skb_free_head+0x21/0x30
   [] ? poll_initwait+0x50/0x50
   [] ? kmem_cache_free+0x1c6/0x1e0
   [] ? uart_write+0x124/0x1d0
   [] ? remove_wait_queue+0x4d/0x60
   [] ? __wake_up+0x44/0x50
   [] ? tty_write_unlock+0x31/0x40
   [] ? tty_ldisc_deref+0x16/0x20
   [] ? tty_write+0x1e0/0x2f0
   [] ? process_echoes+0x80/0x80
   [] ? __vfs_write+0x2b/0x130
   [] ? vfs_write+0x15a/0x1a0
   [] SyS_poll+0x75/0x100
   [] entry_SYSCALL_64_fastpath+0x24/0xcf

Commit 79e7fff47b7b ("net: remove support for per driver ndo_busy_poll()")
indirectly fixed this upstream in linux-4.11 by removing the offending
pointer usage. No other users of napi->dev touch its netdev_ops.

Fixes: 060212928670 ("net: add low latency socket poll")
Fixes: ce6aea93f751 ("net: network drivers no longer need to implement 
ndo_busy_poll()") - 4.9.y
Signed-off-by: Josh Elsasser 
---
 net/core/dev.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/net/core/dev.c b/net/core/dev.c
index 8898618bf341..1f50c131ed15 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5042,7 +5042,10 @@ bool sk_busy_loop(struct sock *sk, int nonblock)
goto out;
 
/* Note: ndo_busy_poll method is optional in linux-4.5 */
-   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+   if (napi->dev->netdev_ops)
+   busy_poll = napi->dev->netdev_ops->ndo_busy_poll;
+   else
+   busy_poll = NULL;
 
do {
rc = 0;
-- 
2.11.0

Re: [PATCH v5 0/2] Remove false-positive VLAs when using max()

2018-03-17 Thread Josh Poimboeuf

On Sat, Mar 17, 2018 at 01:07:32PM -0700, Kees Cook wrote:
> On Sat, Mar 17, 2018 at 11:52 AM, Linus Torvalds
>  wrote:
> > So the above is completely insane, bit there is actually a chance that
> > using that completely crazy "x -> sizeof(char[x])" conversion actually
> > helps, because it really does have a (very odd) evaluation-time
> > change.  sizeof() has to be evaluated as part of the constant
> > expression evaluation, in ways that "__builtin_constant_p()" isn't
> > specified to be done.
> >
> > But it is also definitely me grasping at straws. If that doesn't work
> > for 4.4, there's nothing else I can possibly see.
> 
> No luck! :( gcc 4.4 refuses to play along. And, hilariously, not only
> does it not change the complaint about __builtin_choose_expr(), it
> also thinks that's a VLA now.
> 
> ./include/linux/mm.h: In function ‘get_mm_hiwater_rss’:
> ./include/linux/mm.h:1567: warning: variable length array is used
> ./include/linux/mm.h:1567: error: first argument to
> ‘__builtin_choose_expr’ not a constant
> 
> 6.8 is happy with it (of course).
> 
> I do think the earlier version (without the
> sizeof-hiding-builting_constant_p) provides a template for a
> const_max() that both you and Rasmus would be happy with, though!

I thought we were dropping support for 4.4 (for other reasons).  Isn't
it 4.6 we should be looking at?

-- 
Josh

Re: WARNING: kernel stack frame pointer at ffff880156a5fea0 in bash:2103 has bad value 00007ffec7d87e50

2017-09-26 Thread Josh Poimboeuf

On Tue, Sep 26, 2017 at 11:51:31PM +0200, Richard Weinberger wrote:
> Alexei,
> 
> CC'ing Josh and Ingo.
> 
> Am Dienstag, 26. September 2017, 06:09:02 CEST schrieb Alexei Starovoitov:
> > On Mon, Sep 25, 2017 at 11:23:31PM +0200, Richard Weinberger wrote:
> > > Hi!
> > > 
> > > While playing with bcc's opensnoop tool on Linux 4.14-rc2 I managed to
> > > trigger this splat:
> > > 
> > > [  297.629773] WARNING: kernel stack frame pointer at 880156a5fea0 in
> > > bash:2103 has bad value 7ffec7d87e50
> > > [  297.629777] unwind stack type:0 next_sp:  (null) mask:0x6
> > > graph_idx:0
> > > [  297.629783] 88015b207ae0: 88015b207b68 (0x88015b207b68)
> > > [  297.629790] 88015b207ae8: b163c00e
> > > (__save_stack_trace+0x6e/
> > > 0xd0)
> > > [  297.629792] 88015b207af0:  ...
> > > [  297.629795] 88015b207af8: 880156a58000 (0x880156a58000)
> > > [  297.629799] 88015b207b00: 880156a6 (0x880156a6)
> > > [  297.629800] 88015b207b08:  ...
> > > [  297.629803] 88015b207b10: 0006 (0x6)
> > > [  297.629806] 88015b207b18: 880151b02700 (0x880151b02700)
> > > [  297.629809] 88015b207b20: 0101 (0x101)
> > > [  297.629812] 88015b207b28: 880156a5fea0 (0x880156a5fea0)
> > > [  297.629815] 88015b207b30: 88015b207ae0 (0x88015b207ae0)
> > > [  297.629818] 88015b207b38: c0050282 (0xc0050282)
> > > [  297.629819] 88015b207b40:  ...
> > > [  297.629822] 88015b207b48: 0100 (0x100)
> > > [  297.629825] 88015b207b50: 880157b98280 (0x880157b98280)
> > > [  297.629828] 88015b207b58: 880157b98380 (0x880157b98380)
> > > [  297.629831] 88015b207b60: 88015ad2b500 (0x88015ad2b500)
> > > [  297.629834] 88015b207b68: 88015b207b78 (0x88015b207b78)
> > > [  297.629838] 88015b207b70: b163c086
> > > (save_stack_trace+0x16/0x20) [  297.629841] 88015b207b78:
> > > 88015b207da8 (0x88015b207da8) [  297.629847] 88015b207b80:
> > > b18a8ed6 (save_stack+0x46/0xd0) [  297.629850] 88015b207b88:
> > > 004c (0x4c)
> > > [  297.629852] 88015b207b90: 88015b207ba0 (0x88015b207ba0)
> > > [  297.629855] 88015b207b98: 8801 (0x8801)
> > > [  297.629859] 88015b207ba0: b163c086
> > > (save_stack_trace+0x16/0x20) [  297.629864] 88015b207ba8:
> > > b18a8ed6 (save_stack+0x46/0xd0) [  297.629868] 88015b207bb0:
> > > b18a9752 (kasan_slab_free+0x72/0xc0)
> > Thanks for the report!
> > I'm not sure I understand what's going on here.
> > It seems you have kasan enabled and it's trying to do save_stack()
> > and something crashing?
> > I don't see any bpf related helpers in the stack trace.
> > Which architecture is this? and .config ?
> > Is bpf jit enabled? If so, make sure that net.core.bpf_jit_kallsyms=1
> 
> I found some time to dig a little further.
> It seems to happen only when CONFIG_DEBUG_SPINLOCK is enabled, please see the 
> attached .config. The JIT is off.
> KAsan is also not involved at all, the regular stack saving machinery from 
> the 
> trace framework initiates the stack unwinder.
> 
> The issue arises as soon as in pre_handler_kretprobe() 
> raw_spin_lock_irqsave() 
> is being called.
> It happens on all releases that have commit c32c47c68a0a ("x86/unwind: Warn 
> on 
> bad frame pointer").
> Interestingly it does not happen when I run 
> samples/kprobes/kretprobe_example.ko. So, BPF must be involved somehow.
> 
> Here is another variant of the warning, it matches the attached .config:

I can take a look at it.  Unfortunately, for these types of issues I
often need the vmlinux file to be able to make sense of the unwinder
dump.  So if you happen to have somewhere to copy the vmlinux to, that
would be helpful.  Or if you give me your GCC version I can try to
rebuild it locally.

-- 
Josh

Re: WARNING: kernel stack frame pointer at ffff880156a5fea0 in bash:2103 has bad value 00007ffec7d87e50

2017-09-27 Thread Josh Poimboeuf

On Wed, Sep 27, 2017 at 08:51:22AM +0200, Richard Weinberger wrote:
> Am Mittwoch, 27. September 2017, 00:42:46 CEST schrieb Josh Poimboeuf:
> > > Here is another variant of the warning, it matches the attached .config:
> > I can take a look at it.  Unfortunately, for these types of issues I
> > often need the vmlinux file to be able to make sense of the unwinder
> > dump.  So if you happen to have somewhere to copy the vmlinux to, that
> > would be helpful.  Or if you give me your GCC version I can try to
> > rebuild it locally.
> 
> There you go:
> http://git.infradead.org/~rw/bpf_splat/vmlinux.xz

Thanks.  Can you test this fix?


diff --git a/arch/x86/kernel/kprobes/common.h b/arch/x86/kernel/kprobes/common.h
index db2182d63ed0..3fc0f9a794cb 100644
--- a/arch/x86/kernel/kprobes/common.h
+++ b/arch/x86/kernel/kprobes/common.h
@@ -3,6 +3,15 @@
 
 /* Kprobes and Optprobes common header */
 
+#include 
+
+#ifdef CONFIG_FRAME_POINTER
+# define SAVE_RBP_STRING " push %" _ASM_BP "\n" \
+"  mov  %" _ASM_SP ", %" _ASM_BP "\n"
+#else
+# define SAVE_RBP_STRING " push %" _ASM_BP "\n"
+#endif
+
 #ifdef CONFIG_X86_64
 #define SAVE_REGS_STRING   \
/* Skip cs, ip, orig_ax. */ \
@@ -17,7 +26,7 @@
"   pushq %r10\n"   \
"   pushq %r11\n"   \
"   pushq %rbx\n"   \
-   "   pushq %rbp\n"   \
+   SAVE_RBP_STRING \
"   pushq %r12\n"   \
"   pushq %r13\n"   \
"   pushq %r14\n"   \
@@ -48,7 +57,7 @@
"   pushl %es\n"\
"   pushl %ds\n"\
"   pushl %eax\n"   \
-   "   pushl %ebp\n"   \
+   SAVE_RBP_STRING \
"   pushl %edi\n"   \
"   pushl %esi\n"   \
"   pushl %edx\n"   \

Re: [PATCH] rtlwifi: rtl8723de: Add firmware for new driver/device

2018-01-03 Thread Josh Boyer

On Tue, Jan 2, 2018 at 7:44 PM, Larry Finger  wrote:
> A driver for the RTL8723DE is nearing submission to staging. This commit 
> supplies
> the firmware for it.
>
> Signed-off-by: Larry Finger 
> Cc: Ping-Ke Shih 
> ---
>  WHENCE  |   9 +
>  rtlwifi/rtl8723defw.bin | Bin 0 -> 27726 bytes
>  2 files changed, 9 insertions(+)
>  create mode 100644 rtlwifi/rtl8723defw.bin

Applied and pushed out.  Thanks.

josh

Re: [PATCH 00/18] prevent bounds-check bypass via speculative execution

2018-01-09 Thread Josh Poimboeuf

On Tue, Jan 09, 2018 at 11:44:05AM -0800, Dan Williams wrote:
> On Tue, Jan 9, 2018 at 11:34 AM, Jiri Kosina  wrote:
> > On Fri, 5 Jan 2018, Dan Williams wrote:
> >
> > [ ... snip ... ]
> >> Andi Kleen (1):
> >>   x86, barrier: stop speculation for failed access_ok
> >>
> >> Dan Williams (13):
> >>   x86: implement nospec_barrier()
> >>   [media] uvcvideo: prevent bounds-check bypass via speculative 
> >> execution
> >>   carl9170: prevent bounds-check bypass via speculative execution
> >>   p54: prevent bounds-check bypass via speculative execution
> >>   qla2xxx: prevent bounds-check bypass via speculative execution
> >>   cw1200: prevent bounds-check bypass via speculative execution
> >>   Thermal/int340x: prevent bounds-check bypass via speculative 
> >> execution
> >>   ipv6: prevent bounds-check bypass via speculative execution
> >>   ipv4: prevent bounds-check bypass via speculative execution
> >>   vfs, fdtable: prevent bounds-check bypass via speculative execution
> >>   net: mpls: prevent bounds-check bypass via speculative execution
> >>   udf: prevent bounds-check bypass via speculative execution
> >>   userns: prevent bounds-check bypass via speculative execution
> >>
> >> Mark Rutland (4):
> >>   asm-generic/barrier: add generic nospec helpers
> >>   Documentation: document nospec helpers
> >>   arm64: implement nospec_ptr()
> >>   arm: implement nospec_ptr()
> >
> > So considering the recent publication of [1], how come we all of a sudden
> > don't need the barriers in ___bpf_prog_run(), namely for LD_IMM_DW and
> > LDX_MEM_##SIZEOP, and something comparable for eBPF JIT?
> >
> > Is this going to be handled in eBPF in some other way?
> >
> > Without that in place, and considering Jann Horn's paper, it would seem
> > like PTI doesn't really lock it down fully, right?
> 
> Here is the latest (v3) bpf fix:
> 
> https://patchwork.ozlabs.org/patch/856645/
> 
> I currently have v2 on my 'nospec' branch and will move that to v3 for
> the next update, unless it goes upstream before then.

That patch seems specific to CONFIG_BPF_SYSCALL.  Is the bpf() syscall
the only attack vector?  Or are there other ways to run bpf programs
that we should be worried about?

-- 
Josh

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf

On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  wrote:
> > From: Andi Kleen 
> >
> > When access_ok fails we should always stop speculating.
> > Add the required barriers to the x86 access_ok macro.
> 
> Honestly, this seems completely bogus.
> 
> The description is pure garbage afaik.
> 
> The fact is, we have to stop speculating when access_ok() does *not*
> fail - because that's when we'll actually do the access. And it's that
> access that needs to be non-speculative.
> 
> That actually seems to be what the code does (it stops speculation
> when __range_not_ok() returns false, but access_ok() is
> !__range_not_ok()). But the explanation is crap, and dangerous.

The description also seems to be missing the "why", as it's not
self-evident (to me, at least).

Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?

i.e., wouldn't the pattern be:

get_user(uval, uptr);
if (uval < array_size) {
lfence();
foo = a2[a1[uval] * 256];
}

Shouldn't the lfence come much later, *after* reading the variable and
comparing it and branching accordingly?

-- 
Josh

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf

On Tue, Jan 09, 2018 at 01:47:09PM -0800, Dan Williams wrote:
> On Tue, Jan 9, 2018 at 1:41 PM, Josh Poimboeuf  wrote:
> > On Fri, Jan 05, 2018 at 06:52:07PM -0800, Linus Torvalds wrote:
> >> On Fri, Jan 5, 2018 at 5:10 PM, Dan Williams  
> >> wrote:
> >> > From: Andi Kleen 
> >> >
> >> > When access_ok fails we should always stop speculating.
> >> > Add the required barriers to the x86 access_ok macro.
> >>
> >> Honestly, this seems completely bogus.
> >>
> >> The description is pure garbage afaik.
> >>
> >> The fact is, we have to stop speculating when access_ok() does *not*
> >> fail - because that's when we'll actually do the access. And it's that
> >> access that needs to be non-speculative.
> >>
> >> That actually seems to be what the code does (it stops speculation
> >> when __range_not_ok() returns false, but access_ok() is
> >> !__range_not_ok()). But the explanation is crap, and dangerous.
> >
> > The description also seems to be missing the "why", as it's not
> > self-evident (to me, at least).
> >
> > Isn't this (access_ok/uaccess_begin/ASM_STAC) too early for the lfence?
> >
> > i.e., wouldn't the pattern be:
> >
> > get_user(uval, uptr);
> > if (uval < array_size) {
> > lfence();
> > foo = a2[a1[uval] * 256];
> > }
> >
> > Shouldn't the lfence come much later, *after* reading the variable and
> > comparing it and branching accordingly?
> 
> The goal of putting the lfence in uaccess_begin() is to prevent
> speculation past access_ok().

Right, but what's the purpose of preventing speculation past
access_ok()?

-- 
Josh

Re: [PATCH 06/18] x86, barrier: stop speculation for failed access_ok

2018-01-09 Thread Josh Poimboeuf

On Tue, Jan 09, 2018 at 01:59:04PM -0800, Dan Williams wrote:
> > Right, but what's the purpose of preventing speculation past
> > access_ok()?
> 
> Caution. It's the same rationale for the nospec_array_ptr() patches.
> If we, kernel community, can identify any possible speculation past a
> bounds check we should inject a speculation mitigation. Unless there's
> a way to be 100% certain that the first unwanted speculation can be
> turned into a gadget later on in the instruction stream, err on the
> side of shutting it down early.

I'm all for being cautious.  The nospec_array_ptr() patches are fine,
and they make sense in light of the variant 1 CVE.

But that still doesn't answer my question.  I haven't seen *any*
rationale for this patch.  It would be helpful to at least describe
what's being protected against, even if it's hypothetical.  How can we
review it if the commit log doesn't describe its purpose?

-- 
Josh

Re: [PATCH v2 1/1] qed: Add firmware 8.37.2.0

2018-05-25 Thread Josh Boyer

On Mon, May 21, 2018 at 10:25 AM Rahul Verma  wrote:

> This patch add a new qed firmware with fixes and added
> support for new features.
> -Fix aRFS for tunneled traffic without inner IP.
> -Fix chip configuration which may fail under heavy traffic conditions.
> -Fix iSCSI recovery flow.
> -Support receiving any-VNI in VXLAN and GENEVE RX classification.
> -Support additional RoCE statistics for error cases.
> -Support RoCE T10DIF.

> v1->v2:
> Added signoff for Ariel Elior.

> Signed-off-by: Rahul Verma 
> Signed-off-by: Ariel Elior 
> ---
>   WHENCE  |   1 +
>   qed/qed_init_values_zipped-8.37.2.0.bin | Bin 0 -> 867472 bytes
>   2 files changed, 1 insertion(+)
>   create mode 100755 qed/qed_init_values_zipped-8.37.2.0.bin

Applied and pushed out.

josh

[PATCH] net: qmi_wwan: Add Netgear Aircard 779S

2018-05-27 Thread Josh Hill

Add support for Netgear Aircard 779S

Signed-off-by: Josh Hill 

---
 drivers/net/usb/qmi_wwan.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/drivers/net/usb/qmi_wwan.c b/drivers/net/usb/qmi_wwan.c
index 42565dd..0946808 100644
--- a/drivers/net/usb/qmi_wwan.c
+++ b/drivers/net/usb/qmi_wwan.c
@@ -1103,6 +1103,7 @@ static const struct usb_device_id products[] = {
{QMI_FIXED_INTF(0x05c6, 0x920d, 5)},
{QMI_QUIRK_SET_DTR(0x05c6, 0x9625, 4)}, /* YUGA CLM920-NC5 */
{QMI_FIXED_INTF(0x0846, 0x68a2, 8)},
+   {QMI_FIXED_INTF(0x0846, 0x68d3, 8)},/* Netgear Aircard 779S */
{QMI_FIXED_INTF(0x12d1, 0x140c, 1)},/* Huawei E173 */
{QMI_FIXED_INTF(0x12d1, 0x14ac, 1)},/* Huawei E1820 */
{QMI_FIXED_INTF(0x1435, 0xd181, 3)},/* Wistron NeWeb D18Q1 */
-- 
2.7.4

Re: pull request Cavium liquidio vswitch firmware v1.7.2

2018-06-06 Thread Josh Boyer

On Wed, Jun 6, 2018 at 1:13 PM Felix Manlunas  wrote:
>
> On Mon, May 21, 2018 at 10:39:20AM -0700, Felix Manlunas wrote:
> > The following changes since commit 2a9b2cf50fb32e36e4fc1586c2f6f1421913b553:
> >
> >   Merge branch 'for-upstreaming-v1.7.2' of 
> > https://github.com/felix-cavium/linux-firmware (2018-05-18 08:35:22 -0400)
> >
> > are available in the git repository at:
> >
> >   https://github.com/felix-cavium/linux-firmware.git 
> > for-upstreaming-v1.7.2-vsw
> >
> > for you to fetch changes up to 0e193ca65d8b064502d61163597bf14eef81710f:
> >
> >   linux-firmware: liquidio: update vswitch firmware to v1.7.2 (2018-05-19 
> > 23:29:03 -0700)
> >
> > Signed-off-by: Manish Awasthi 
> > Signed-off-by: Felix Manlunas 
> > 
> > Felix Manlunas (1):
> >   linux-firmware: liquidio: update vswitch firmware to v1.7.2
> >
> >  WHENCE|   2 +-
> >  liquidio/lio_23xx_vsw.bin | Bin 19922416 -> 20434408 bytes
> >  2 files changed, 1 insertion(+), 1 deletion(-)
>
> Hello Maintainers of linux-firmware.git,
>
> Any feedback about this submission?  We sent it two weeks ago, but we
> haven't heard anything.

Thanks for the ping.  I missed this one and then confused it with an
older pull request with a similar subject line.  I've pulled and
pushed out now.

josh

Re: [linux-firmware][pull request] ice: Add package file for Intel E800 series driver

2019-10-04 Thread Josh Boyer

On Mon, Sep 23, 2019 at 5:56 PM Jeff Kirsher
 wrote:
>
> From: Tony Nguyen 
>
> The ice driver must load a package file to the firmware to utilize full
> functionality; add the package file to /lib/firmware/intel/ice/ddp. Also
> add a symlink, ice.pkg, so the driver can refer to the package by a
> consistent name.
>
> Signed-off-by: Tony Nguyen 
> Tested-by: Andrew Bowers 
> Signed-off-by: Jeff Kirsher 
> ---
> The following are changes since commit 
> 417a9c6e197a8d3eec792494efc87a2b42f76324:
>   amdgpu: add initial navi10 firmware
> and are available in the git repository at:
>   git://git.kernel.org/pub/scm/linux/kernel/git/jkirsher/firmware dev-queue
>
>  LICENSE.ice   |  39 ++
>  WHENCE|   9 
>  intel/ice/ddp/ice-1.3.4.0.pkg | Bin 0 -> 577796 bytes
>  intel/ice/ddp/ice.pkg |   1 +
>  4 files changed, 49 insertions(+)
>  create mode 100644 LICENSE.ice
>  create mode 100644 intel/ice/ddp/ice-1.3.4.0.pkg
>  create mode 12 intel/ice/ddp/ice.pkg

Thanks.  I added another commit on top to remove the symlink and fix
the WHENCE file because we merged a commit that changes how symlinks
are handled.  All applied and pushed out.

josh

[PATCH 3/3] i40e: Add UDP segmentation offload support

2019-10-09 Thread Josh Hunt

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the i40e driver.

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |  1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6031223eafab..56f8c52cbba1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -12911,6 +12911,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
  NETIF_F_GSO_IPXIP6|
  NETIF_F_GSO_UDP_TUNNEL|
  NETIF_F_GSO_UDP_TUNNEL_CSUM   |
+ NETIF_F_GSO_UDP_L4|
  NETIF_F_SCTP_CRC  |
  NETIF_F_RXHASH|
  NETIF_F_RXCSUM|
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e3f29dc8b290..0b32f04a6255 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2960,10 +2960,16 @@ static int i40e_tso(struct i40e_tx_buffer *first, u8 
*hdr_len,
 
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen));
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   if (skb->csum_offset == offsetof(struct tcphdr, check)) {
+   csum_replace_by_diff(&l4.tcp->check, (__force 
__wsum)htonl(paylen));
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   } else {
+   csum_replace_by_diff(&l4.udp->check, (__force 
__wsum)htonl(paylen));
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   }
 
/* pull values out of skb_shinfo */
gso_size = skb_shinfo(skb)->gso_size;
-- 
2.7.4

[PATCH 1/3] igb: Add UDP segmentation offload support

2019-10-09 Thread Josh Hunt

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the igb driver.

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/igb/e1000_82575.h |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c| 23 +--
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h 
b/drivers/net/ethernet/intel/igb/e1000_82575.h
index 6ad775b1a4c5..63ec253ac788 100644
--- a/drivers/net/ethernet/intel/igb/e1000_82575.h
+++ b/drivers/net/ethernet/intel/igb/e1000_82575.h
@@ -127,6 +127,7 @@ struct e1000_adv_tx_context_desc {
 };
 
 #define E1000_ADVTXD_MACLEN_SHIFT9  /* Adv ctxt desc mac len shift */
+#define E1000_ADVTXD_TUCMD_L4T_UDP 0x  /* L4 Packet TYPE of UDP */
 #define E1000_ADVTXD_TUCMD_IPV40x0400  /* IP Packet Type: 1=IPv4 */
 #define E1000_ADVTXD_TUCMD_L4T_TCP 0x0800  /* L4 Packet TYPE of TCP */
 #define E1000_ADVTXD_TUCMD_L4T_SCTP 0x1000 /* L4 packet TYPE of SCTP */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 105b0624081a..5eabfac5a18d 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2516,6 +2516,7 @@ igb_features_check(struct sk_buff *skb, struct net_device 
*dev,
if (unlikely(mac_hdr_len > IGB_MAX_MAC_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_TSO |
NETIF_F_TSO6);
@@ -2524,6 +2525,7 @@ igb_features_check(struct sk_buff *skb, struct net_device 
*dev,
if (unlikely(network_hdr_len >  IGB_MAX_NETWORK_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_TSO |
NETIF_F_TSO6);
 
@@ -3120,7 +3122,7 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
NETIF_F_HW_CSUM;
 
if (hw->mac.type >= e1000_82576)
-   netdev->features |= NETIF_F_SCTP_CRC;
+   netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4;
 
if (hw->mac.type >= e1000_i350)
netdev->features |= NETIF_F_HW_TC;
@@ -5694,6 +5696,7 @@ static int igb_tso(struct igb_ring *tx_ring,
} ip;
union {
struct tcphdr *tcp;
+   struct udphdr *udp;
unsigned char *hdr;
} l4;
u32 paylen, l4_offset;
@@ -5713,7 +5716,8 @@ static int igb_tso(struct igb_ring *tx_ring,
l4.hdr = skb_checksum_start(skb);
 
/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
-   type_tucmd = E1000_ADVTXD_TUCMD_L4T_TCP;
+   type_tucmd = (skb->csum_offset == offsetof(struct tcphdr, check)) ?
+ E1000_ADVTXD_TUCMD_L4T_TCP : E1000_ADVTXD_TUCMD_L4T_UDP;
 
/* initialize outer IP header fields */
if (ip.v4->version == 4) {
@@ -5741,12 +5745,19 @@ static int igb_tso(struct igb_ring *tx_ring,
/* determine offset of inner transport header */
l4_offset = l4.hdr - skb->data;
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
-
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, htonl(paylen));
+   if (type_tucmd & E1000_ADVTXD_TUCMD_L4T_TCP) {
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   csum_replace_by_diff(&l4.tcp->check,
+   (__force __wsum)htonl(paylen));
+   } else {
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   csum_replace_by_diff(&l4.udp->check,
+(__force __wsum)htonl(paylen));
+   }
 
/* update gso size and bytecount with header size */
first->gso_segs = skb_shinfo(skb)->gso_segs;
-- 
2.7.4

[PATCH 0/3] igb, ixgbe, i40e UDP segmentation offload support

2019-10-09 Thread Josh Hunt

Alexander Duyck posted a series in 2018 proposing adding UDP segmentation
offload support to ixgbe and ixgbevf, but those patches were never
accepted:

https://lore.kernel.org/netdev/20180504003556.4769.11407.stgit@localhost.localdomain/

This series is a repost of his ixgbe patch along with a similar
change to the igb and i40e drivers. Testing using the udpgso_bench_tx
benchmark shows a noticeable performance improvement with these changes
applied.

All #s below were run with:
udpgso_bench_tx -C 1 -4 -D 172.25.43.133 -z -l 30 -u -S 0 -s $pkt_size

igb::

SW GSO (ethtool -K eth0 tx-udp-segmentation off):
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472120143.64   113 81263   81263   83.55   1.35
2944120160.09   114 40638   40638   62.88   1.81
5888120160.64   114 20319   20319   43.59   2.61
11776   120160.76   114 10160   10160   37.52   3.03
23552   120159.25   114 5080508034.75   3.28
47104   120160.55   114 2540254032.83   3.47
61824   120160.56   114 1935193532.09   3.55

HW GSO offload (ethtool -K eth0 tx-udp-segmentation on):
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472120144.65   113 81264   81264   83.03   1.36
2944120161.56   114 40638   40638   41  2.78
5888120160.23   114 20319   20319   23.76   4.79
11776   120161.16   114 10160   10160   15.82   7.20
23552   120156.45   114 5079507912.88.90
47104   120159.33   114 254025408.8212.92
61824   120158.43   114 193519358.2413.83

ixgbe::
SW GSO:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

14721070565.90  1015724112  724112  100 10.15
29441201579.19  1140406342  406342  95.69   11.91
58881201217.55  1140203185  203185  55.38   20.58
11776   1201613.49  1140101588  101588  42.15   27.04
23552   1201631.32  114050795   50795   35.97   31.69
47104   1201626.38  114025397   25397   33.51   34.01
61824   1201625.52  114019350   19350   32.83   34.72

HW GSO Offload:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

14721058681.25  1004715954  715954  100 10.04
29441201730.86  1134404254  404254  61.28   18.50
58881201776.61  1131201608  201608  30.25   37.38
11776   1201795.90  1130100676  100676  16.63   67.94
23552   1201807.90  112950304   50304   10.07   112.11
47104   1201748.35  112825143   25143   6.8 165.88
61824   1200770.45  112819140   19140   5.38209.66

i40e::
SW GSO:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472650122.83   616 439362  439362  100 6.16
2944943993.53   895 319042  319042  100 8.95
58881199751.90  1138202857  202857  82.51   13.79
11776   1200288.08  1139101477  101477  64.34   17.70
23552   1201596.56  114050793   50793   59.74   19.08
47104   1201597.98  114025396   25396   56.31   20.24
61824   1201610.43  114019350   19350   55.48   20.54

HW GSO offload:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472657424.83   623 444653  444653  100 6.23
29441201242.87  1139406226  406226  91.45   12.45
58881201739.95  1140203199  203199  57.46   19.83
11776   1201557.36  1140101584  101584  36.83   30.95
23552   1201525.17  114050790   50790   23.86   47.77
47104   1201514.54  114025394   25394   17.45   65.32
61824   1201478.91  114019348   19348   14.79   77.07

I was not sure how to proper attribute Alexander on the ixgbe patch so
please adjust this as necessary.

Thanks!

Josh Hunt (3):
  igb: Add UDP segmentation offload support
  ixgbe: Add UDP segmentation offload support
  i40e: Add UDP segmentation offload support

 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 12 +---
 drivers/net/ethernet/intel/igb/e1000_82575.h  |  1 +
 drivers/net/ethernet/intel

[PATCH 2/3] ixgbe: Add UDP segmentation offload support

2019-10-09 Thread Josh Hunt

Repost from a series by Alexander Duyck to add UDP segmentation offload
support to the igb driver:
https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 +++-
 1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1ce2397306b9..2b01d264e5ce 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7946,6 +7946,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
} ip;
union {
struct tcphdr *tcp;
+   struct udphdr *udp;
unsigned char *hdr;
} l4;
u32 paylen, l4_offset;
@@ -7969,6 +7970,9 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
l4.hdr = skb_checksum_start(skb);
 
/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
+   type_tucmd = (skb->csum_offset == offsetof(struct tcphdr, check)) ?
+ IXGBE_ADVTXD_TUCMD_L4T_TCP : IXGBE_ADVTXD_TUCMD_L4T_UDP;
+
type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP;
 
/* initialize outer IP header fields */
@@ -7999,12 +8003,20 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
/* determine offset of inner transport header */
l4_offset = l4.hdr - skb->data;
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
-
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen));
+
+   if (type_tucmd & IXGBE_ADVTXD_TUCMD_L4T_TCP) {
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   csum_replace_by_diff(&l4.tcp->check,
+(__force __wsum)htonl(paylen));
+   } else {
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   csum_replace_by_diff(&l4.udp->check,
+(__force __wsum)htonl(paylen));
+   }
 
/* update gso size and bytecount with header size */
first->gso_segs = skb_shinfo(skb)->gso_segs;
@@ -10190,6 +10202,7 @@ ixgbe_features_check(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely(mac_hdr_len > IXGBE_MAX_MAC_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_TSO |
NETIF_F_TSO6);
@@ -10198,6 +10211,7 @@ ixgbe_features_check(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely(network_hdr_len >  IXGBE_MAX_NETWORK_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_TSO |
NETIF_F_TSO6);
 
@@ -10907,7 +10921,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
IXGBE_GSO_PARTIAL_FEATURES;
 
if (hw->mac.type >= ixgbe_mac_82599EB)
-   netdev->features |= NETIF_F_SCTP_CRC;
+   netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4;
 
 #ifdef CONFIG_IXGBE_IPSEC
 #define IXGBE_ESP_FEATURES (NETIF_F_HW_ESP | \
-- 
2.7.4

Re: [PATCH 3/3] i40e: Add UDP segmentation offload support

2019-10-09 Thread Josh Hunt


On 10/9/19 5:39 PM, Samudrala, Sridhar wrote:



On 10/9/2019 3:06 PM, Josh Hunt wrote:

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the i40e driver.

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
  drivers/net/ethernet/intel/i40e/i40e_main.c |  1 +
  drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +---
  2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c

index 6031223eafab..56f8c52cbba1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -12911,6 +12911,7 @@ static int i40e_config_netdev(struct i40e_vsi 
*vsi)

    NETIF_F_GSO_IPXIP6    |
    NETIF_F_GSO_UDP_TUNNEL    |
    NETIF_F_GSO_UDP_TUNNEL_CSUM    |
+  NETIF_F_GSO_UDP_L4    |
    NETIF_F_SCTP_CRC    |
    NETIF_F_RXHASH    |
    NETIF_F_RXCSUM    |
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c

index e3f29dc8b290..0b32f04a6255 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2960,10 +2960,16 @@ static int i40e_tso(struct i40e_tx_buffer 
*first, u8 *hdr_len,

  /* remove payload length from inner checksum */
  paylen = skb->len - l4_offset;
-    csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen));
-    /* compute length of segmentation header */
-    *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+    if (skb->csum_offset == offsetof(struct tcphdr, check)) {


Isn't it more relevant to check for gso_type rather than base this on 
the csum_offset?
Thanks Sridhar for the review. Yeah I think you're right. I will change 
this on all 3 patches.


Josh

Re: [PATCH 2/3] ixgbe: Add UDP segmentation offload support

2019-10-09 Thread Josh Hunt


On 10/9/19 3:06 PM, Josh Hunt wrote:

Repost from a series by Alexander Duyck to add UDP segmentation offload
support to the igb driver:
https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 +++-
  1 file changed, 19 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1ce2397306b9..2b01d264e5ce 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7946,6 +7946,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
} ip;
union {
struct tcphdr *tcp;
+   struct udphdr *udp;
unsigned char *hdr;
} l4;
u32 paylen, l4_offset;
@@ -7969,6 +7970,9 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
l4.hdr = skb_checksum_start(skb);
  
  	/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */

+   type_tucmd = (skb->csum_offset == offsetof(struct tcphdr, check)) ?
+ IXGBE_ADVTXD_TUCMD_L4T_TCP : IXGBE_ADVTXD_TUCMD_L4T_UDP;
+
type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP;


Copy/paste bug ^^. Will fix in v2.

Josh

Re: [PATCH 0/3] igb, ixgbe, i40e UDP segmentation offload support

2019-10-10 Thread Josh Hunt


On 10/9/19 3:44 PM, Alexander Duyck wrote:

On Wed, Oct 9, 2019 at 3:08 PM Josh Hunt  wrote:


Alexander Duyck posted a series in 2018 proposing adding UDP segmentation
offload support to ixgbe and ixgbevf, but those patches were never
accepted:

https://lore.kernel.org/netdev/20180504003556.4769.11407.stgit@localhost.localdomain/

This series is a repost of his ixgbe patch along with a similar
change to the igb and i40e drivers. Testing using the udpgso_bench_tx
benchmark shows a noticeable performance improvement with these changes
applied.

All #s below were run with:
udpgso_bench_tx -C 1 -4 -D 172.25.43.133 -z -l 30 -u -S 0 -s $pkt_size

igb::

SW GSO (ethtool -K eth0 tx-udp-segmentation off):
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472120143.64   113 81263   81263   83.55   1.35
2944120160.09   114 40638   40638   62.88   1.81
5888120160.64   114 20319   20319   43.59   2.61
11776   120160.76   114 10160   10160   37.52   3.03
23552   120159.25   114 5080508034.75   3.28
47104   120160.55   114 2540254032.83   3.47
61824   120160.56   114 1935193532.09   3.55

HW GSO offload (ethtool -K eth0 tx-udp-segmentation on):
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472120144.65   113 81264   81264   83.03   1.36
2944120161.56   114 40638   40638   41  2.78
5888120160.23   114 20319   20319   23.76   4.79
11776   120161.16   114 10160   10160   15.82   7.20
23552   120156.45   114 5079507912.88.90
47104   120159.33   114 254025408.8212.92
61824   120158.43   114 193519358.2413.83

ixgbe::
SW GSO:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

14721070565.90  1015724112  724112  100 10.15
29441201579.19  1140406342  406342  95.69   11.91
58881201217.55  1140203185  203185  55.38   20.58
11776   1201613.49  1140101588  101588  42.15   27.04
23552   1201631.32  114050795   50795   35.97   31.69
47104   1201626.38  114025397   25397   33.51   34.01
61824   1201625.52  114019350   19350   32.83   34.72

HW GSO Offload:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

14721058681.25  1004715954  715954  100 10.04
29441201730.86  1134404254  404254  61.28   18.50
58881201776.61  1131201608  201608  30.25   37.38
11776   1201795.90  1130100676  100676  16.63   67.94
23552   1201807.90  112950304   50304   10.07   112.11
47104   1201748.35  112825143   25143   6.8 165.88
61824   1200770.45  112819140   19140   5.38209.66

i40e::
SW GSO:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472650122.83   616 439362  439362  100 6.16
2944943993.53   895 319042  319042  100 8.95
58881199751.90  1138202857  202857  82.51   13.79
11776   1200288.08  1139101477  101477  64.34   17.70
23552   1201596.56  114050793   50793   59.74   19.08
47104   1201597.98  114025396   25396   56.31   20.24
61824   1201610.43  114019350   19350   55.48   20.54

HW GSO offload:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472657424.83   623 444653  444653  100 6.23
29441201242.87  1139406226  406226  91.45   12.45
58881201739.95  1140203199  203199  57.46   19.83
11776   1201557.36  1140101584  101584  36.83   30.95
23552   1201525.17  114050790   50790   23.86   47.77
47104   1201514.54  114025394   25394   17.45   65.32
61824   1201478.91  114019348   19348   14.79   77.07

I was not sure how to proper attribute Alexander on the ixgbe patch so
please adjust this as necessary.


For the ixgbe patch I would be good with:
Suggested-by: Alexander Duyck 

The big hurdle for this will be validation. I know that there are some
parts such as the 82598 in the case of the ixgbe driver or 82575 in
the case of igb that didn't support the feature

Re: [PATCH 0/3] igb, ixgbe, i40e UDP segmentation offload support

2019-10-10 Thread Josh Hunt


On 10/10/19 2:32 PM, Alexander Duyck wrote:

On Thu, Oct 10, 2019 at 2:17 PM Josh Hunt  wrote:


On 10/9/19 3:44 PM, Alexander Duyck wrote:

On Wed, Oct 9, 2019 at 3:08 PM Josh Hunt  wrote:


Alexander Duyck posted a series in 2018 proposing adding UDP segmentation
offload support to ixgbe and ixgbevf, but those patches were never
accepted:

https://lore.kernel.org/netdev/20180504003556.4769.11407.stgit@localhost.localdomain/

This series is a repost of his ixgbe patch along with a similar
change to the igb and i40e drivers. Testing using the udpgso_bench_tx
benchmark shows a noticeable performance improvement with these changes
applied.

All #s below were run with:
udpgso_bench_tx -C 1 -4 -D 172.25.43.133 -z -l 30 -u -S 0 -s $pkt_size

igb::

SW GSO (ethtool -K eth0 tx-udp-segmentation off):
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472120143.64   113 81263   81263   83.55   1.35
2944120160.09   114 40638   40638   62.88   1.81
5888120160.64   114 20319   20319   43.59   2.61
11776   120160.76   114 10160   10160   37.52   3.03
23552   120159.25   114 5080508034.75   3.28
47104   120160.55   114 2540254032.83   3.47
61824   120160.56   114 1935193532.09   3.55

HW GSO offload (ethtool -K eth0 tx-udp-segmentation on):
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472120144.65   113 81264   81264   83.03   1.36
2944120161.56   114 40638   40638   41  2.78
5888120160.23   114 20319   20319   23.76   4.79
11776   120161.16   114 10160   10160   15.82   7.20
23552   120156.45   114 5079507912.88.90
47104   120159.33   114 254025408.8212.92
61824   120158.43   114 193519358.2413.83

ixgbe::
SW GSO:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

14721070565.90  1015724112  724112  100 10.15
29441201579.19  1140406342  406342  95.69   11.91
58881201217.55  1140203185  203185  55.38   20.58
11776   1201613.49  1140101588  101588  42.15   27.04
23552   1201631.32  114050795   50795   35.97   31.69
47104   1201626.38  114025397   25397   33.51   34.01
61824   1201625.52  114019350   19350   32.83   34.72

HW GSO Offload:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

14721058681.25  1004715954  715954  100 10.04
29441201730.86  1134404254  404254  61.28   18.50
58881201776.61  1131201608  201608  30.25   37.38
11776   1201795.90  1130100676  100676  16.63   67.94
23552   1201807.90  112950304   50304   10.07   112.11
47104   1201748.35  112825143   25143   6.8 165.88
61824   1200770.45  112819140   19140   5.38209.66

i40e::
SW GSO:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472650122.83   616 439362  439362  100 6.16
2944943993.53   895 319042  319042  100 8.95
58881199751.90  1138202857  202857  82.51   13.79
11776   1200288.08  1139101477  101477  64.34   17.70
23552   1201596.56  114050793   50793   59.74   19.08
47104   1201597.98  114025396   25396   56.31   20.24
61824   1201610.43  114019350   19350   55.48   20.54

HW GSO offload:
$pkt_size   kB/s(sar)   MB/sCalls/s Msg/s   CPU MB2CPU

1472657424.83   623 444653  444653  100 6.23
29441201242.87  1139406226  406226  91.45   12.45
58881201739.95  1140203199  203199  57.46   19.83
11776   1201557.36  1140101584  101584  36.83   30.95
23552   1201525.17  114050790   50790   23.86   47.77
47104   1201514.54  114025394   25394   17.45   65.32
61824   1201478.91  114019348   19348   14.79   77.07

I was not sure how to proper attribute Alexander on the ixgbe patch so
please adjust this as necessary.


For the ixgbe patch I would be good with:
Suggested-by: Alexander Duyck 

The big hurdle for this will be validation. I know that there are some
parts such as the 82598

[PATCH v2 1/3] igb: Add UDP segmentation offload support

2019-10-10 Thread Josh Hunt

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the igb driver.

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/igb/e1000_82575.h |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c| 23 +--
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h 
b/drivers/net/ethernet/intel/igb/e1000_82575.h
index 6ad775b1a4c5..63ec253ac788 100644
--- a/drivers/net/ethernet/intel/igb/e1000_82575.h
+++ b/drivers/net/ethernet/intel/igb/e1000_82575.h
@@ -127,6 +127,7 @@ struct e1000_adv_tx_context_desc {
 };
 
 #define E1000_ADVTXD_MACLEN_SHIFT9  /* Adv ctxt desc mac len shift */
+#define E1000_ADVTXD_TUCMD_L4T_UDP 0x  /* L4 Packet TYPE of UDP */
 #define E1000_ADVTXD_TUCMD_IPV40x0400  /* IP Packet Type: 1=IPv4 */
 #define E1000_ADVTXD_TUCMD_L4T_TCP 0x0800  /* L4 Packet TYPE of TCP */
 #define E1000_ADVTXD_TUCMD_L4T_SCTP 0x1000 /* L4 packet TYPE of SCTP */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 105b0624081a..4131bc8b079e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2516,6 +2516,7 @@ igb_features_check(struct sk_buff *skb, struct net_device 
*dev,
if (unlikely(mac_hdr_len > IGB_MAX_MAC_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_TSO |
NETIF_F_TSO6);
@@ -2524,6 +2525,7 @@ igb_features_check(struct sk_buff *skb, struct net_device 
*dev,
if (unlikely(network_hdr_len >  IGB_MAX_NETWORK_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_TSO |
NETIF_F_TSO6);
 
@@ -3120,7 +3122,7 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
NETIF_F_HW_CSUM;
 
if (hw->mac.type >= e1000_82576)
-   netdev->features |= NETIF_F_SCTP_CRC;
+   netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4;
 
if (hw->mac.type >= e1000_i350)
netdev->features |= NETIF_F_HW_TC;
@@ -5694,6 +5696,7 @@ static int igb_tso(struct igb_ring *tx_ring,
} ip;
union {
struct tcphdr *tcp;
+   struct udphdr *udp;
unsigned char *hdr;
} l4;
u32 paylen, l4_offset;
@@ -5713,7 +5716,8 @@ static int igb_tso(struct igb_ring *tx_ring,
l4.hdr = skb_checksum_start(skb);
 
/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
-   type_tucmd = E1000_ADVTXD_TUCMD_L4T_TCP;
+   type_tucmd = (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)) ?
+ E1000_ADVTXD_TUCMD_L4T_TCP : E1000_ADVTXD_TUCMD_L4T_UDP;
 
/* initialize outer IP header fields */
if (ip.v4->version == 4) {
@@ -5741,12 +5745,19 @@ static int igb_tso(struct igb_ring *tx_ring,
/* determine offset of inner transport header */
l4_offset = l4.hdr - skb->data;
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
-
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, htonl(paylen));
+   if (type_tucmd & E1000_ADVTXD_TUCMD_L4T_TCP) {
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   csum_replace_by_diff(&l4.tcp->check,
+   (__force __wsum)htonl(paylen));
+   } else {
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   csum_replace_by_diff(&l4.udp->check,
+(__force __wsum)htonl(paylen));
+   }
 
/* update gso size and bytecount with header size */
first->gso_segs = skb_shinfo(skb)->gso_segs;
-- 
2.7.4

[PATCH v2 2/3] ixgbe: Add UDP segmentation offload support

2019-10-10 Thread Josh Hunt

Repost from a series by Alexander Duyck to add UDP segmentation offload
support to the igb driver:
https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/

CC: Alexander Duyck 
CC: Willem de Bruijn 
Suggested-by: Alexander Duyck 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1ce2397306b9..7d50c1a4a3be 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7946,6 +7946,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
} ip;
union {
struct tcphdr *tcp;
+   struct udphdr *udp;
unsigned char *hdr;
} l4;
u32 paylen, l4_offset;
@@ -7969,7 +7970,8 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
l4.hdr = skb_checksum_start(skb);
 
/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
-   type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP;
+   type_tucmd = (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)) ?
+ IXGBE_ADVTXD_TUCMD_L4T_TCP : IXGBE_ADVTXD_TUCMD_L4T_UDP;
 
/* initialize outer IP header fields */
if (ip.v4->version == 4) {
@@ -7999,12 +8001,20 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
/* determine offset of inner transport header */
l4_offset = l4.hdr - skb->data;
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
-
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen));
+
+   if (type_tucmd & IXGBE_ADVTXD_TUCMD_L4T_TCP) {
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   csum_replace_by_diff(&l4.tcp->check,
+(__force __wsum)htonl(paylen));
+   } else {
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   csum_replace_by_diff(&l4.udp->check,
+(__force __wsum)htonl(paylen));
+   }
 
/* update gso size and bytecount with header size */
first->gso_segs = skb_shinfo(skb)->gso_segs;
@@ -10190,6 +10200,7 @@ ixgbe_features_check(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely(mac_hdr_len > IXGBE_MAX_MAC_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_TSO |
NETIF_F_TSO6);
@@ -10198,6 +10209,7 @@ ixgbe_features_check(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely(network_hdr_len >  IXGBE_MAX_NETWORK_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_TSO |
NETIF_F_TSO6);
 
@@ -10907,7 +10919,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
IXGBE_GSO_PARTIAL_FEATURES;
 
if (hw->mac.type >= ixgbe_mac_82599EB)
-   netdev->features |= NETIF_F_SCTP_CRC;
+   netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4;
 
 #ifdef CONFIG_IXGBE_IPSEC
 #define IXGBE_ESP_FEATURES (NETIF_F_HW_ESP | \
-- 
2.7.4

[PATCH v2 0/3] igb, ixgbe, i40e UDP segmentation offload support

2019-10-10 Thread Josh Hunt

91.67  113925394   25394   17.14   66.45
61824   1201598.88  114019350   19350   15.11   75.44

Josh Hunt (3):
  igb: Add UDP segmentation offload support
  ixgbe: Add UDP segmentation offload support
  i40e: Add UDP segmentation offload support

 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 12 +---
 drivers/net/ethernet/intel/igb/e1000_82575.h  |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c | 23 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 ++--
 5 files changed, 46 insertions(+), 15 deletions(-)

-- 
2.7.4

[PATCH v2 3/3] i40e: Add UDP segmentation offload support

2019-10-10 Thread Josh Hunt

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the i40e driver.

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |  1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6031223eafab..56f8c52cbba1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -12911,6 +12911,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
  NETIF_F_GSO_IPXIP6|
  NETIF_F_GSO_UDP_TUNNEL|
  NETIF_F_GSO_UDP_TUNNEL_CSUM   |
+ NETIF_F_GSO_UDP_L4|
  NETIF_F_SCTP_CRC  |
  NETIF_F_RXHASH|
  NETIF_F_RXCSUM|
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e3f29dc8b290..2250284e27fd 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2960,10 +2960,16 @@ static int i40e_tso(struct i40e_tx_buffer *first, u8 
*hdr_len,
 
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen));
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   if (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)) {
+   csum_replace_by_diff(&l4.tcp->check, (__force 
__wsum)htonl(paylen));
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   } else {
+   csum_replace_by_diff(&l4.udp->check, (__force 
__wsum)htonl(paylen));
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   }
 
/* pull values out of skb_shinfo */
gso_size = skb_shinfo(skb)->gso_size;
-- 
2.7.4

Re: [PATCH 0/3] igb, ixgbe, i40e UDP segmentation offload support

2019-10-10 Thread Josh Hunt


On 10/10/19 5:21 PM, Brown, Aaron F wrote:

Adding Andrew as he is most likely going to be testing this patch.

Unfortunately my mail server flags attached scripts as potential threats and 
strips them out.  Can you resent it as an tar file?  I don't believe it's smart 
enough to open up tar and flag it as a script.



Hi Aaron

It looks like the netdev archive has the file. Can you try grabbing it 
from here?


https://lore.kernel.org/netdev/0e0e706c-4ce9-c27a-af55-339b4eb6d...@akamai.com/2-udpgso_bench.sh

If that doesn't work I can try your tar workaround.

Thanks
Josh

Re: [PATCH v2 1/3] igb: Add UDP segmentation offload support

2019-10-11 Thread Josh Hunt





On 10/11/19 8:29 AM, Alexander Duyck wrote:

On Thu, 2019-10-10 at 20:25 -0400, Josh Hunt wrote:

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the igb driver.

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
  drivers/net/ethernet/intel/igb/e1000_82575.h |  1 +
  drivers/net/ethernet/intel/igb/igb_main.c| 23 +--
  2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h 
b/drivers/net/ethernet/intel/igb/e1000_82575.h
index 6ad775b1a4c5..63ec253ac788 100644
--- a/drivers/net/ethernet/intel/igb/e1000_82575.h
+++ b/drivers/net/ethernet/intel/igb/e1000_82575.h
@@ -127,6 +127,7 @@ struct e1000_adv_tx_context_desc {
  };
  
  #define E1000_ADVTXD_MACLEN_SHIFT9  /* Adv ctxt desc mac len shift */

+#define E1000_ADVTXD_TUCMD_L4T_UDP 0x  /* L4 Packet TYPE of UDP */
  #define E1000_ADVTXD_TUCMD_IPV40x0400  /* IP Packet Type: 1=IPv4 */
  #define E1000_ADVTXD_TUCMD_L4T_TCP 0x0800  /* L4 Packet TYPE of TCP */
  #define E1000_ADVTXD_TUCMD_L4T_SCTP 0x1000 /* L4 packet TYPE of SCTP */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 105b0624081a..4131bc8b079e 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2516,6 +2516,7 @@ igb_features_check(struct sk_buff *skb, struct net_device 
*dev,
if (unlikely(mac_hdr_len > IGB_MAX_MAC_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_TSO |
NETIF_F_TSO6);
@@ -2524,6 +2525,7 @@ igb_features_check(struct sk_buff *skb, struct net_device 
*dev,
if (unlikely(network_hdr_len >  IGB_MAX_NETWORK_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_TSO |
NETIF_F_TSO6);
  
@@ -3120,7 +3122,7 @@ static int igb_probe(struct pci_dev *pdev, const struct pci_device_id *ent)

NETIF_F_HW_CSUM;
  
  	if (hw->mac.type >= e1000_82576)

-   netdev->features |= NETIF_F_SCTP_CRC;
+   netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4;
  
  	if (hw->mac.type >= e1000_i350)

netdev->features |= NETIF_F_HW_TC;
@@ -5694,6 +5696,7 @@ static int igb_tso(struct igb_ring *tx_ring,
} ip;
union {
struct tcphdr *tcp;
+   struct udphdr *udp;
unsigned char *hdr;
} l4;
u32 paylen, l4_offset;
@@ -5713,7 +5716,8 @@ static int igb_tso(struct igb_ring *tx_ring,
l4.hdr = skb_checksum_start(skb);
  
  	/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */

-   type_tucmd = E1000_ADVTXD_TUCMD_L4T_TCP;
+   type_tucmd = (!(skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4)) ?
+ E1000_ADVTXD_TUCMD_L4T_TCP : E1000_ADVTXD_TUCMD_L4T_UDP;


The logic here seems a bit convoluted. Instead of testing for
'!SKB_GSO_UDP_L4' why not just make L4T_UDP the first option and drop the
'!'? That will make the TCP offload the default case rather than the UDP
offload.

The same applies to the other 2 patches.


OK sure. I think I was trying to keep the TCP case first for some 
reason, but agree that I should probably just reverse the order. Will 
send a v3.


Josh

[PATCH v3 3/3] i40e: Add UDP segmentation offload support

2019-10-11 Thread Josh Hunt

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the i40e driver.

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/i40e/i40e_main.c |  1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c | 12 +---
 2 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6031223eafab..56f8c52cbba1 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -12911,6 +12911,7 @@ static int i40e_config_netdev(struct i40e_vsi *vsi)
  NETIF_F_GSO_IPXIP6|
  NETIF_F_GSO_UDP_TUNNEL|
  NETIF_F_GSO_UDP_TUNNEL_CSUM   |
+ NETIF_F_GSO_UDP_L4|
  NETIF_F_SCTP_CRC  |
  NETIF_F_RXHASH|
  NETIF_F_RXCSUM|
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c 
b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index e3f29dc8b290..b8496037ef7f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2960,10 +2960,16 @@ static int i40e_tso(struct i40e_tx_buffer *first, u8 
*hdr_len,
 
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen));
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   if (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) {
+   csum_replace_by_diff(&l4.udp->check, (__force 
__wsum)htonl(paylen));
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   } else {
+   csum_replace_by_diff(&l4.tcp->check, (__force 
__wsum)htonl(paylen));
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   }
 
/* pull values out of skb_shinfo */
gso_size = skb_shinfo(skb)->gso_size;
-- 
2.7.4

[PATCH v3 2/3] ixgbe: Add UDP segmentation offload support

2019-10-11 Thread Josh Hunt

Repost from a series by Alexander Duyck to add UDP segmentation offload
support to the igb driver:
https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.localdomain/

CC: Alexander Duyck 
CC: Willem de Bruijn 
Suggested-by: Alexander Duyck 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 ++--
 1 file changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 1ce2397306b9..6c9edd272c7a 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -7946,6 +7946,7 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
} ip;
union {
struct tcphdr *tcp;
+   struct udphdr *udp;
unsigned char *hdr;
} l4;
u32 paylen, l4_offset;
@@ -7969,7 +7970,8 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
l4.hdr = skb_checksum_start(skb);
 
/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
-   type_tucmd = IXGBE_ADVTXD_TUCMD_L4T_TCP;
+   type_tucmd = (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) ?
+ IXGBE_ADVTXD_TUCMD_L4T_UDP : IXGBE_ADVTXD_TUCMD_L4T_TCP;
 
/* initialize outer IP header fields */
if (ip.v4->version == 4) {
@@ -7999,12 +8001,20 @@ static int ixgbe_tso(struct ixgbe_ring *tx_ring,
/* determine offset of inner transport header */
l4_offset = l4.hdr - skb->data;
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
-
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, (__force __wsum)htonl(paylen));
+
+   if (type_tucmd & IXGBE_ADVTXD_TUCMD_L4T_TCP) {
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   csum_replace_by_diff(&l4.tcp->check,
+(__force __wsum)htonl(paylen));
+   } else {
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   csum_replace_by_diff(&l4.udp->check,
+(__force __wsum)htonl(paylen));
+   }
 
/* update gso size and bytecount with header size */
first->gso_segs = skb_shinfo(skb)->gso_segs;
@@ -10190,6 +10200,7 @@ ixgbe_features_check(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely(mac_hdr_len > IXGBE_MAX_MAC_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_TSO |
NETIF_F_TSO6);
@@ -10198,6 +10209,7 @@ ixgbe_features_check(struct sk_buff *skb, struct 
net_device *dev,
if (unlikely(network_hdr_len >  IXGBE_MAX_NETWORK_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_TSO |
NETIF_F_TSO6);
 
@@ -10907,7 +10919,7 @@ static int ixgbe_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
IXGBE_GSO_PARTIAL_FEATURES;
 
if (hw->mac.type >= ixgbe_mac_82599EB)
-   netdev->features |= NETIF_F_SCTP_CRC;
+   netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4;
 
 #ifdef CONFIG_IXGBE_IPSEC
 #define IXGBE_ESP_FEATURES (NETIF_F_HW_ESP | \
-- 
2.7.4

[PATCH v3 0/3] igb, ixgbe, i40e UDP segmentation offload support

2019-10-11 Thread Josh Hunt

  114050794   50794   24.23   47.04
47104   1201491.67  113925394   25394   17.14   66.45
61824   1201598.88  114019350   19350   15.11   75.44

Josh Hunt (3):
  igb: Add UDP segmentation offload support
  ixgbe: Add UDP segmentation offload support
  i40e: Add UDP segmentation offload support

 drivers/net/ethernet/intel/i40e/i40e_main.c   |  1 +
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   | 12 +---
 drivers/net/ethernet/intel/igb/e1000_82575.h  |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c | 23 +--
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24 ++--
 5 files changed, 46 insertions(+), 15 deletions(-)

-- 
2.7.4

[PATCH v3 1/3] igb: Add UDP segmentation offload support

2019-10-11 Thread Josh Hunt

Based on a series from Alexander Duyck this change adds UDP segmentation
offload support to the igb driver.

CC: Alexander Duyck 
CC: Willem de Bruijn 
Signed-off-by: Josh Hunt 
---
 drivers/net/ethernet/intel/igb/e1000_82575.h |  1 +
 drivers/net/ethernet/intel/igb/igb_main.c| 23 +--
 2 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ethernet/intel/igb/e1000_82575.h 
b/drivers/net/ethernet/intel/igb/e1000_82575.h
index 6ad775b1a4c5..63ec253ac788 100644
--- a/drivers/net/ethernet/intel/igb/e1000_82575.h
+++ b/drivers/net/ethernet/intel/igb/e1000_82575.h
@@ -127,6 +127,7 @@ struct e1000_adv_tx_context_desc {
 };
 
 #define E1000_ADVTXD_MACLEN_SHIFT9  /* Adv ctxt desc mac len shift */
+#define E1000_ADVTXD_TUCMD_L4T_UDP 0x  /* L4 Packet TYPE of UDP */
 #define E1000_ADVTXD_TUCMD_IPV40x0400  /* IP Packet Type: 1=IPv4 */
 #define E1000_ADVTXD_TUCMD_L4T_TCP 0x0800  /* L4 Packet TYPE of TCP */
 #define E1000_ADVTXD_TUCMD_L4T_SCTP 0x1000 /* L4 packet TYPE of SCTP */
diff --git a/drivers/net/ethernet/intel/igb/igb_main.c 
b/drivers/net/ethernet/intel/igb/igb_main.c
index 105b0624081a..b65e8a063aef 100644
--- a/drivers/net/ethernet/intel/igb/igb_main.c
+++ b/drivers/net/ethernet/intel/igb/igb_main.c
@@ -2516,6 +2516,7 @@ igb_features_check(struct sk_buff *skb, struct net_device 
*dev,
if (unlikely(mac_hdr_len > IGB_MAX_MAC_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_HW_VLAN_CTAG_TX |
NETIF_F_TSO |
NETIF_F_TSO6);
@@ -2524,6 +2525,7 @@ igb_features_check(struct sk_buff *skb, struct net_device 
*dev,
if (unlikely(network_hdr_len >  IGB_MAX_NETWORK_HDR_LEN))
return features & ~(NETIF_F_HW_CSUM |
NETIF_F_SCTP_CRC |
+   NETIF_F_GSO_UDP_L4 |
NETIF_F_TSO |
NETIF_F_TSO6);
 
@@ -3120,7 +3122,7 @@ static int igb_probe(struct pci_dev *pdev, const struct 
pci_device_id *ent)
NETIF_F_HW_CSUM;
 
if (hw->mac.type >= e1000_82576)
-   netdev->features |= NETIF_F_SCTP_CRC;
+   netdev->features |= NETIF_F_SCTP_CRC | NETIF_F_GSO_UDP_L4;
 
if (hw->mac.type >= e1000_i350)
netdev->features |= NETIF_F_HW_TC;
@@ -5694,6 +5696,7 @@ static int igb_tso(struct igb_ring *tx_ring,
} ip;
union {
struct tcphdr *tcp;
+   struct udphdr *udp;
unsigned char *hdr;
} l4;
u32 paylen, l4_offset;
@@ -5713,7 +5716,8 @@ static int igb_tso(struct igb_ring *tx_ring,
l4.hdr = skb_checksum_start(skb);
 
/* ADV DTYP TUCMD MKRLOC/ISCSIHEDLEN */
-   type_tucmd = E1000_ADVTXD_TUCMD_L4T_TCP;
+   type_tucmd = (skb_shinfo(skb)->gso_type & SKB_GSO_UDP_L4) ?
+ E1000_ADVTXD_TUCMD_L4T_UDP : E1000_ADVTXD_TUCMD_L4T_TCP;
 
/* initialize outer IP header fields */
if (ip.v4->version == 4) {
@@ -5741,12 +5745,19 @@ static int igb_tso(struct igb_ring *tx_ring,
/* determine offset of inner transport header */
l4_offset = l4.hdr - skb->data;
 
-   /* compute length of segmentation header */
-   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
-
/* remove payload length from inner checksum */
paylen = skb->len - l4_offset;
-   csum_replace_by_diff(&l4.tcp->check, htonl(paylen));
+   if (type_tucmd & E1000_ADVTXD_TUCMD_L4T_TCP) {
+   /* compute length of segmentation header */
+   *hdr_len = (l4.tcp->doff * 4) + l4_offset;
+   csum_replace_by_diff(&l4.tcp->check,
+   (__force __wsum)htonl(paylen));
+   } else {
+   /* compute length of segmentation header */
+   *hdr_len = sizeof(*l4.udp) + l4_offset;
+   csum_replace_by_diff(&l4.udp->check,
+(__force __wsum)htonl(paylen));
+   }
 
/* update gso size and bytecount with header size */
first->gso_segs = skb_shinfo(skb)->gso_segs;
-- 
2.7.4

Re: [Intel-wired-lan] [PATCH v3 2/3] ixgbe: Add UDP segmentation offload support

2019-10-18 Thread Josh Hunt

On Thu, Oct 17, 2019 at 5:33 AM Bowers, AndrewX
 wrote:
>
> > -Original Message-
> > From: Intel-wired-lan [mailto:intel-wired-lan-boun...@osuosl.org] On
> > Behalf Of Josh Hunt
> > Sent: Friday, October 11, 2019 9:54 AM
> > To: netdev@vger.kernel.org; intel-wired-...@lists.osuosl.org; Kirsher,
> > Jeffrey T 
> > Cc: Duyck, Alexander H ;
> > will...@google.com; Josh Hunt ;
> > alexander.h.du...@linux.intel.com
> > Subject: [Intel-wired-lan] [PATCH v3 2/3] ixgbe: Add UDP segmentation
> > offload support
> >
> > Repost from a series by Alexander Duyck to add UDP segmentation offload
> > support to the igb driver:
> > https://lore.kernel.org/netdev/20180504003916.4769.66271.stgit@localhost.l
> > ocaldomain/
> >
> > CC: Alexander Duyck 
> > CC: Willem de Bruijn 
> > Suggested-by: Alexander Duyck 
> > Signed-off-by: Josh Hunt 
> > ---
> >  drivers/net/ethernet/intel/ixgbe/ixgbe_main.c | 24
> > ++--
> >  1 file changed, 18 insertions(+), 6 deletions(-)
>
> Tested-by: Andrew Bowers 
>
>
Andrew, thanks for testing. Were you able to validate the igb driver?

Thanks
-- 
Josh

Re: Crash when receiving FIN-ACK in TCP_FIN_WAIT1 state

2019-10-21 Thread Josh Hunt

On Mon, Oct 21, 2019 at 10:42 AM Subash Abhinov Kasiviswanathan
 wrote:
>
> > Please give us a pointer to the exact git tree and sha1.
> >
> > I do not analyze TCP stack problems without an exact starting point,
> > or at least a crystal ball, which I do not have.
>
> Hi Eric
>
> We are at this commit - Merge 4.19.75 into android-4.19-q.
>
> https://android.googlesource.com/kernel/common/+/cfe571e421ab873403a8f75413e04ecf5bf7e393
>
> --
> Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum,
> a Linux Foundation Collaborative Project

We were investigating a problem in this area a few months back (it
turned out to be something else), but were wondering if retrans_out,
lost_out, and sacked_out should be cleared in tcp_rtx_queue_purge()? I
meant to get back to it, but it got buried and this just thread jogged
my memory...

-- 
Josh

Re: [PATCH firmware] rtl_nic: add firmware files for RTL8153

2019-10-23 Thread Josh Boyer

On Tue, Oct 22, 2019 at 11:40 PM Hayes Wang  wrote:
>
> This adds the firmware for Realtek RTL8153 Based USB Ethernet Adapters.
>
> 1. Fix compatible issue for Asmedia hub.
> 2. Fix compatible issue for Compal platform.
> 3. Fix sometimes the device is lost after rebooting.
> 4. Improve the compatibility for EEE.
>
> Signed-off-by: Hayes Wang 
> ---
>  WHENCE|  11 +++
>  rtl_nic/rtl8153a-2.fw | Bin 0 -> 1768 bytes
>  rtl_nic/rtl8153a-3.fw | Bin 0 -> 1440 bytes
>  rtl_nic/rtl8153a-4.fw | Bin 0 -> 712 bytes
>  rtl_nic/rtl8153b-2.fw | Bin 0 -> 1088 bytes
>  5 files changed, 11 insertions(+)
>  create mode 100644 rtl_nic/rtl8153a-2.fw
>  create mode 100644 rtl_nic/rtl8153a-3.fw
>  create mode 100644 rtl_nic/rtl8153a-4.fw
>  create mode 100644 rtl_nic/rtl8153b-2.fw

Applied and pushed out.

josh

Re: [PATCH net] ixgbe: recognize 1000BaseLX SFP modules as 1Gbps

2018-11-26 Thread Josh Elsasser

Bjørn Mork  wrote:

> Not that it matters much I guess, but I think LX SFPs were unsupported
> at that time. The LX support appears to have been added under the radar
> while refactoring ixgbe_setup_sfp_modules_X550em in commit e23f33367882
> ("ixgbe: Fix 1G and 10G link stability for X550EM_x SFP+")

Looks like you’re right. Want me to respin with an additional “Fixes” tag?

- Josh

[PATCH net] ixgbe: recognize 1000BaseLX SFP modules as 1Gbps

2018-11-24 Thread Josh Elsasser

Add the two 1000BaseLX enum values to the X550's check for 1Gbps modules,
allowing the core driver code to establish a link over this SFP type.

This is done by the out-of-tree driver but the fix wasn't in mainline.

Fixes: 6a14ee0cfb19 ("ixgbe: Add X550 support function pointers")
Signed-off-by: Josh Elsasser 
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c 
b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
index 10dbaf4f6e80..9c42f741ed5e 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_x550.c
@@ -2262,7 +2262,9 @@ static s32 ixgbe_get_link_capabilities_X550em(struct 
ixgbe_hw *hw,
*autoneg = false;
 
if (hw->phy.sfp_type == ixgbe_sfp_type_1g_sx_core0 ||
-   hw->phy.sfp_type == ixgbe_sfp_type_1g_sx_core1) {
+   hw->phy.sfp_type == ixgbe_sfp_type_1g_sx_core1 ||
+   hw->phy.sfp_type == ixgbe_sfp_type_1g_lx_core0 ||
+   hw->phy.sfp_type == ixgbe_sfp_type_1g_lx_core1) {
*speed = IXGBE_LINK_SPEED_1GB_FULL;
return 0;
}
-- 
2.19.1

[PATCH net] rhashtable: avoid reschedule loop after rapid growth and shrink

2019-01-23 Thread Josh Elsasser

When running workloads with large bursts of fragmented packets, we've seen
a few machines stuck returning -EEXIST from rht_shrink() and endlessly
rescheduling their hash table's deferred work, pegging a CPU core.

Root cause is commit da20420f83ea ("rhashtable: Add nested tables"), which
stops ignoring the return code of rhashtable_shrink() and the reallocs
used to grow the hashtable. This uncovers a bug in the shrink logic where
"needs to shrink" check runs against the last table but the actual shrink
operation runs on the first bucket_table in the hashtable (see below):

 +---++--+  +---+
 | ht|| "first" tbl  |  | "last" tbl|
 | - tbl ---> | - future_tbl -> |  - future_tbl ---> NULL
 +---++--+  +---+
   ^^^  ^^^
   used by rhashtable_shrink()  used by rht_shrink_below_30()

A rehash then stalls out when both the last table needs to shrink, the
first table has more elements than the target size, but rht_shrink() hits
a non-NULL future_tbl and returns -EEXIST. This skips the item rehashing
and kicks off a reschedule loop, as no forward progress can be made while
the rhashtable needs to shrink.

Extend rhashtable_shrink() with a "tbl" param to avoid endless exit-and-
reschedules after hitting the EEXIST, allowing it to check a future_tbl
pointer that can actually be non-NULL and make forward progress when the
hashtable needs to shrink.

Fixes: da20420f83ea ("rhashtable: Add nested tables")
Signed-off-by: Josh Elsasser 
---
 lib/rhashtable.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/rhashtable.c b/lib/rhashtable.c
index 852ffa5160f1..98e91f9544fa 100644
--- a/lib/rhashtable.c
+++ b/lib/rhashtable.c
@@ -377,9 +377,9 @@ static int rhashtable_rehash_alloc(struct rhashtable *ht,
  * It is valid to have concurrent insertions and deletions protected by per
  * bucket locks or concurrent RCU protected lookups and traversals.
  */
-static int rhashtable_shrink(struct rhashtable *ht)
+static int rhashtable_shrink(struct rhashtable *ht,
+struct bucket_table *old_tbl)
 {
-   struct bucket_table *old_tbl = rht_dereference(ht->tbl, ht);
unsigned int nelems = atomic_read(&ht->nelems);
unsigned int size = 0;
 
@@ -412,7 +412,7 @@ static void rht_deferred_worker(struct work_struct *work)
if (rht_grow_above_75(ht, tbl))
err = rhashtable_rehash_alloc(ht, tbl, tbl->size * 2);
else if (ht->p.automatic_shrinking && rht_shrink_below_30(ht, tbl))
-   err = rhashtable_shrink(ht);
+   err = rhashtable_shrink(ht, tbl);
else if (tbl->nest)
err = rhashtable_rehash_alloc(ht, tbl, tbl->size);
 
-- 
2.19.1

Re: [v2 PATCH] rhashtable: Still do rehash when we get EEXIST

2019-01-23 Thread Josh Elsasser

On Jan 23, 2019, at 7:08 PM, Herbert Xu  wrote:

> Thanks for catching this!
> 
> Although I think we should fix this in a different way.  The problem
> here is that the shrink cannot proceed because there was a previous
> rehash that is still incomplete.  We should wait for its completion
> and then reattempt a shrinnk should it still be necessary.
> 
> So something like this:

SGTM. 

I can't test this right now because our VM server's down after a power
 outage this evening, but I tried a similar patch that swallowed the
 -EEXIST err and even with that oversight the hashtable dodged the
reschedule loop.

- Josh

Re: Bug in MACSec - stops passing traffic after approx 5TB

2018-10-17 Thread Josh Coombs

I've got wpa_supplicant working with macsec on Fedora, my test bed has
shuffled 16 billion packets so far without interruption.  I am a bit
concerned that I've just pushed the resource exhaustion issue down the
road though, looking at the output of ip macsec show I see four SAs
for TX and RX, it appears to negotiate a new pair every 3 to 3.5
billion packets.  It doesn't appear to be ripping down old SAs.  What
happens when available SA slots run out?

Joshua Coombs
GWI

office 207-494-2140
www.gwi.net

On Mon, Oct 15, 2018 at 11:45 AM Josh Coombs  wrote:
>
> And confirmed, starting with a high packet number results in a very
> short testbed run, 296 packets and then nothing, just as you surmised.
> Sorry for raising the alarm falsely.  Looks like I need to roll my own
> build of wpa_supplicant as the ubuntu builds don't include the macsec
> driver, haven't tested Gentoo's ebuilds yet to see if they do.
>
> Josh Coombs
>
> On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs  wrote:
> >
> > On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca  
> > wrote:
> > >
> > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> > > > I initially mistook this for a traffic control issue, but after
> > > > stripping the test beds down to just the MACSec component, I can still
> > > > replicate the issue.  After approximately 5TB of transfer / 4 billion
> > > > packets over a MACSec link it stops passing traffic.
> > >
> > > I think you're just hitting packet number exhaustion. After 2^32
> > > packets, the packet number would wrap to 0 and start being reused,
> > > which breaks the crypto used by macsec. Before this point, you have to
> > > add a new SA, and tell the macsec device to switch to it.
> >
> > I had not considered that, I naively thought as long as I didn't
> > specify a replay window, it'd roll the PN over on it's own and life
> > would be good.  I'll test that theory tomorrow, should be easy to
> > prove out.
> >
> > > That's why you should be using wpa_supplicant. It will monitor the
> > > growth of the packet number, and handle the rekey for you.
> >
> > Thank you for the heads up, I'll read up on this as well.
> >
> > Josh C

Re: Bug in MACSec - stops passing traffic after approx 5TB

2018-10-17 Thread Josh Coombs

I see it reusing SAs, so I'm good.

Joshua Coombs


On Wed, Oct 17, 2018 at 9:45 AM Josh Coombs  wrote:
>
> I've got wpa_supplicant working with macsec on Fedora, my test bed has
> shuffled 16 billion packets so far without interruption.  I am a bit
> concerned that I've just pushed the resource exhaustion issue down the
> road though, looking at the output of ip macsec show I see four SAs
> for TX and RX, it appears to negotiate a new pair every 3 to 3.5
> billion packets.  It doesn't appear to be ripping down old SAs.  What
> happens when available SA slots run out?
>
> Joshua Coombs
> GWI
>
> office 207-494-2140
> www.gwi.net
>
> On Mon, Oct 15, 2018 at 11:45 AM Josh Coombs  wrote:
> >
> > And confirmed, starting with a high packet number results in a very
> > short testbed run, 296 packets and then nothing, just as you surmised.
> > Sorry for raising the alarm falsely.  Looks like I need to roll my own
> > build of wpa_supplicant as the ubuntu builds don't include the macsec
> > driver, haven't tested Gentoo's ebuilds yet to see if they do.
> >
> > Josh Coombs
> >
> > On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs  wrote:
> > >
> > > On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca  
> > > wrote:
> > > >
> > > > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> > > > > I initially mistook this for a traffic control issue, but after
> > > > > stripping the test beds down to just the MACSec component, I can still
> > > > > replicate the issue.  After approximately 5TB of transfer / 4 billion
> > > > > packets over a MACSec link it stops passing traffic.
> > > >
> > > > I think you're just hitting packet number exhaustion. After 2^32
> > > > packets, the packet number would wrap to 0 and start being reused,
> > > > which breaks the crypto used by macsec. Before this point, you have to
> > > > add a new SA, and tell the macsec device to switch to it.
> > >
> > > I had not considered that, I naively thought as long as I didn't
> > > specify a replay window, it'd roll the PN over on it's own and life
> > > would be good.  I'll test that theory tomorrow, should be easy to
> > > prove out.
> > >
> > > > That's why you should be using wpa_supplicant. It will monitor the
> > > > growth of the packet number, and handle the rekey for you.
> > >
> > > Thank you for the heads up, I'll read up on this as well.
> > >
> > > Josh C

Re: [PATCH linux-firmware] linux-firmware: liquidio: fix GPL compliance issue

2018-10-18 Thread Josh Boyer

On Wed, Oct 17, 2018 at 5:09 PM John W. Linville  wrote:
>
> On Wed, Oct 17, 2018 at 07:34:42PM +, Manlunas, Felix wrote:
> > On Fri, Sep 28, 2018 at 04:50:51PM -0700, Felix Manlunas wrote:
> > > Part of the code inside the lio_vsw_23xx.bin firmware image is under GPL,
> > > but the LICENCE.cavium file neglects to indicate that.  However,
> > > LICENCE.cavium does correctly specify the license that covers the other
> > > Cavium firmware images that do not contain any GPL code.
> > >
> > > Fix the GPL compliance issue by adding a new file, 
> > > LICENCE.cavium_liquidio,
> > > which correctly shows the GPL boilerplate.  This new file specifies the
> > > licenses for all liquidio firmware, including the ones that do not have
> > > GPL code.
> > >
> > > Change the liquidio section of WHENCE to point to LICENCE.cavium_liquidio.
> > >
> > > Reported-by: Florian Weimer 
> > > Signed-off-by: Manish Awasthi 
> > > Signed-off-by: Manoj Panicker 
> > > Signed-off-by: Faisal Masood 
> > > Signed-off-by: Felix Manlunas 
> > > ---
> > >  LICENCE.cavium_liquidio | 429 
> > > 
> > >  WHENCE  |   2 +-
> > >  2 files changed, 430 insertions(+), 1 deletion(-)
> > >  create mode 100644 LICENCE.cavium_liquidio
> >
> > Hello Maintainers of linux-firmware.git,
> >
> > Any feedback about this patch?
>
> I would prefer to see an offer that included a defined URL for anyone
> to download the source for the kernel in question without having to
> announce themselves. The "send an email to i...@cavium.com" offer may
> (or may not) be sufficient for the letter of the law. But it seems
> both fragile and prone to subjective frustrations and delays for
> users to obtain the sources at some future date.

I agree with John here, but I also believe the patch is better than
the current text in the upstream repo.  I've committed it and pushed
it out.  If there are improvements to be made on source availability,
we can take those in a different patch.

Thank you for taking this seriously and responding quickly.

josh

Re: [v2 PATCH] rhashtable: Still do rehash when we get EEXIST

2019-01-26 Thread Josh Elsasser

On Jan 23, 2019, at 7:40 PM, Josh Elsasser  wrote:
> On Jan 23, 2019, at 7:08 PM, Herbert Xu  wrote:
> 
>> Thanks for catching this!
>> 
>> Although I think we should fix this in a different way.  The problem
>> here is that the shrink cannot proceed because there was a previous
>> rehash that is still incomplete.  We should wait for its completion
>> and then reattempt a shrinnk should it still be necessary.
> 
> I can't test this right now because our VM server's down 

Got one of the poor little reproducer VM's back up and running and loaded
up this patch. Works like a charm. For the v2 PATCH, can add my:

Tested-by: Josh Elsasser

[PATCH net] net: set default network namespace in init_dummy_netdev()

2019-01-26 Thread Josh Elsasser

Assign a default net namespace to netdevs created by init_dummy_netdev().
Fixes a NULL pointer dereference caused by busy-polling a socket bound to
an iwlwifi wireless device, which bumps the per-net BUSYPOLLRXPACKETS stat
if napi_poll() received packets:

  BUG: unable to handle kernel NULL pointer dereference at 0190
  IP: napi_busy_loop+0xd6/0x200
  Call Trace:
sock_poll+0x5e/0x80
do_sys_poll+0x324/0x5a0
SyS_poll+0x6c/0xf0
do_syscall_64+0x6b/0x1f0
entry_SYSCALL_64_after_hwframe+0x3d/0xa2

Fixes: 7db6b048da3b ("net: Commonize busy polling code to focus on napi_id 
instead of socket")
Signed-off-by: Josh Elsasser 
---
 net/core/dev.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/core/dev.c b/net/core/dev.c
index 82f20022259d..d1043d49979c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -8712,7 +8712,9 @@ int init_dummy_netdev(struct net_device *dev)
set_bit(__LINK_STATE_PRESENT, &dev->state);
set_bit(__LINK_STATE_START, &dev->state);
 
+   /* napi_busy_loop stats accounting wants this */
+   dev_net_set(dev, &init_net);
+
/* Note : We dont allocate pcpu_refcnt for dummy devices,
 * because users of this 'device' dont need to change
 * its refcount.
-- 
2.19.1

Re: [Intel-wired-lan] [PATCH net] ixgbe: recognize 1000BaseLX SFP modules as 1Gbps

2018-11-26 Thread Josh Elsasser

Jeff Kirsher  wrote:

> No need to re-spin the patch, just reply with the additional "Fixes"
> tag and if patchwork does not pick it up, I will add it to the patch I
> have in my tree for validation and review.

Thanks, let’s try that.

Fixes: e23f33367882 ("ixgbe: Fix 1G and 10G link stability for X550EM_x SFP+”)

Re: [PATCH linux-firmware] Mellanox: Add new mlxsw_spectrum firmware 13.1910.622

2018-12-14 Thread Josh Boyer

On Thu, Dec 13, 2018 at 3:45 PM Shalom Toledo  wrote:
>
> This new firmware contains: * New packet traps for discarded packets * Secure 
> firmware flash bug fix * Fence mechanism bug fix * TCAM RMA bug fix 
> Signed-off-by: Shalom Toledo

Applied and pushed out.

josh

Possible bug in traffic control?

2018-10-09 Thread Josh Coombs

 multicast off
sysctl -w net.ipv6.conf."$dif".autoconf=0
sysctl -w net.ipv6.conf."$dif".accept_ra=0
sysctl -w net.ipv4.conf."$dif".arp_ignore=8
sysctl -w net.ipv4.conf."$dif".rp_filter=0
ip link set "$eif" down promisc on arp off multicast off
sysctl -w net.ipv6.conf."$eif".autoconf=0
sysctl -w net.ipv6.conf."$eif".accept_ra=0
sysctl -w net.ipv4.conf."$eif".arp_ignore=8
sysctl -w net.ipv4.conf."$eif".rp_filter=0

# Set up traffic mirroring:
echo "* Start Port Mirror"
# sif to eif
tc qdisc add dev "$sif" ingress
tc filter add dev "$sif" parent : \
  protocol all \
  u32 match u8 0 0 \
  action mirred egress mirror dev "$eif"

# eif to sif
tc qdisc add dev "$eif" ingress
tc filter add dev "$eif" parent : \
  protocol all \
  u32 match u8 0 0 \
  action mirred egress mirror dev "$sif"

# Bring up the interfaces:
echo "* Light tunnel NICS"
ip link set "$sif" up
ip link set "$dif" up
ip link set "$eif" up

echo " --=[ MACSec Up ]=--"
---

Josh Coombs

Re: Possible bug in traffic control?

2018-10-10 Thread Josh Coombs

2.3 billion 1 byte packets failed to re-create the bug.  To try and
simplify the setup I removed macsec from the equation, using a single
host in the middle as the bridge.  Interestingly, rather than 1.3Gbits
a second in both directions, it ran around 8Mbits a second.  Switching
the filter from u32 to matchall didn't change the performance.  Going
back to the four machine test bed, again removing macsec and just
bridging through radically decreased the throughput to around 8Mbits.
Flip on macsec for the bridge and 1.3Gbits?
On Tue, Oct 9, 2018 at 11:58 AM Josh Coombs  wrote:
>
> Hello all, I'm looking for some guidance in chasing what I believe to
> be a bug in kernel traffic control filters.  If I'm pinging the wrong
> list let me know.
>
> I have a homebrew MACSec bridge setup using two pairs of PCs.  I
> establish a MACSec link between them, and then use TC to bridge a
> second ethernet interface over the MACSec link. The second interface
> is connected to a Juniper switch at each end, and I'm using LACP over
> the links to bond them up for redundancy.  It turns out I need that
> redundancy as after awhile one pair of bridges will stop flowing
> packets in one direction.  I've since replicated this failure with a
> group of VMs as well.
>
> My test setup to replicate the failure inside ESXi:
> - Two MACSec bridge VMs, A and Z
> - Two IPerf VMs, A and Z
> My VMs are currently built using Ubuntu Server 18.04 to be quick, no
> additional packages are required outside of iperf3.  Kernel ver as
> shipped currently is 4.15.0-36.  I highly advise using a CPU with AES
> instruction support as MACSec eats CPU without it and will take longer
> to reproduce the symptoms.
>
> - A 'MACSec Bridge' network
> - A 'A Side link' network
> - A 'Z Side link' network
> In ESXi I used a dedicated vSwitch, 9000 MTU (to allow full 1500 eth
> packets + MACSec to pass on the bridge) and the security policy is
> full open (allow promiscuous, allow forged, allow mac changes) as
> we're abusing the networks as direct point to point links.  If using
> physical machines, just cable up, my example script bumps the MTU as
> required.
>
> The MACSec boxes have two ethernet interfaces each.  One pair is on
> the MACSec Bridge network.  The other interfaces go to the A and Z
> IPerf boxes respectively via their dedicated networks.  A and Z need
> their interfaces configured with IPs in a common subnet, such as
> 192.168.0.1/30 and 192.168.0.2/30.
>
> My script sets up MACSec, tweaks MTUs, and touches a few sysctls to
> turn the involved interfaces into silent actors.  It then uses TC to
> start the actual bridging.  From there I've been firing up iperf 3
> sessions in both directions between A and Z to hammer the bridge until
> it fails.  When it does, I can see packets stop being bridged in one
> direction on one MACSec host, but not the other.  The second host
> continues to flow packets in both directions.  Nothing is logged to
> dmesg when this fault occurs.  The fault seems to occur at roughly the
> same packet / traffic amount each time.  On my main application it's
> after approximately 2.5TB of traffic (random mix of sizes) and with my
> test bed it was after 5.5TB of 1500 byte packets.
>
> On the impacted MACSec node, watching interface packet counters via
> ifconfig and actual traffic with tcpdump I can see packets coming in
> MACSec and going out the host interface, the host reply coming in but
> not showing up on the MACSec interface to cross the bridge.  Clearing
> out the tc filter and qdisc and re-adding does not restore traffic
> flow.
>
> There is a PPA with 4.18 available for Ubuntu that I'm going to test
> with next to see if that makes a difference in behavior.  In the mean
> time I'd appreciate any suggestions on how to diagnose this.
>
> My MACSec bridge setup script, update sif, dif, the keys and rxmac to
> match your setup.  The rxmac is the mac addy of the remote bridge
> interface.  Keys need to be flipped between systems.
> ---
> #!/bin/bash
>
> # Interfaces:
> # sif = Ingress physical interface (Source)
> # dif = Egress physical interface (Dest)
> # eif = Encrypted interface
> sif=eno2
> dif=enp1s0f0
> eif=macsec0
>
> # MACSec Keys:
> # txkey = Transmit (Local) key
> # rxkey = Receive (Remote) key
> # rxmac = Receive (Remote) MAC addy
> txkey=
> rxkey=
> rxmac=00:11:22:33:44:55
>
> # Use jumbo frames for macsec to allow full 1500 MTU passthrough:
> echo "* MTU update"
> ip link set "$sif" mtu 9000
> ip link set "$dif" mtu 9000
>
> # Bring up macsec:

Re: Possible bug in traffic control?

2018-10-11 Thread Josh Coombs

I'm actually leaning towards macsec now.  I'm at 6TB transferred in a
double hop, no macsec over the bridge setup without triggering the
fault.  I'm going to let it continue to churn and setup a second
testbed that JUST uses macsec without traffic control bridging to see
if I can trip the issue there.That should determine if it's macsec
itself, or an interaction between macsec and traffic control.

Joshua Coombs
GWI

office 207-494-2140
www.gwi.net

On Wed, Oct 10, 2018 at 12:39 PM Cong Wang  wrote:
>
> On Wed, Oct 10, 2018 at 8:54 AM Josh Coombs  wrote:
> >
> > 2.3 billion 1 byte packets failed to re-create the bug.  To try and
> > simplify the setup I removed macsec from the equation, using a single
> > host in the middle as the bridge.  Interestingly, rather than 1.3Gbits
> > a second in both directions, it ran around 8Mbits a second.  Switching
> > the filter from u32 to matchall didn't change the performance.  Going
> > back to the four machine test bed, again removing macsec and just
> > bridging through radically decreased the throughput to around 8Mbits.
> > Flip on macsec for the bridge and 1.3Gbits?
>
> This is a great narrow down! We can rule out macsec for guilty.
>
> Can you share a minimum reproducer for this problem? If so I can take
> a look.

Re: Possible bug in traffic control?

2018-10-12 Thread Josh Coombs

I've been able to narrow the scope down, the issue is with macsec
itself.  I setup two hosts with a macsec link between them, and let a
couple iperf3 sessions blast traffic across.  At approximately 4.2
billion packets / 6TB data transferred one end stopped transmitting
packets.  Doing a tcpdump on the impacted node's macsec0 interface
shows packets coming in from the remote node, in this case arp
requests, and arp replies from the local host, but watching the
interface counters for macsec0 no packets are being recorded as
transmitting.  Again, nothing in dmesg implying an error.

Deleting the macsec interface via ip link delete macsec0 and
re-creating it gets traffic flowing again without a reboot.

Meanwhile my traffic control bridge without macsec has shuffled 19TB
via 22 billion packets and not skipped a beat, so it appears my
initial assumption of it being the culprit was wrong.

To replicate, setup two hosts with a direct ethernet link between each other.
- Bring up macsec between the two hosts, setup a dedicated /30 on the link.
- Use iperf3 or another traffic generating tool over the /30, one
session for each direction.
- Wait for traffic to stop.

My test bed is on Ubuntu Server 18.04 currently, kernel 4.15.0-36.
I'm going to spin up a vanilla kernel on 4.15 and then -current to see
if this is an Ubuntu-ism from their patches, specific to 4.15, or a
general issue with macsec.

The script I used on each host (keys, rxmacs and IPs updated as appropriate):
#!/bin/bash

# Interfaces:
# dif = Egress physical interface (Dest)
# eif = Encrypted interface
dif=ens224
eif=macsec0

# MACSec Keys:
# txkey = Transmit (Local) key
# rxkey = Receive (Remote) key
# rxmac = Receive (Remote) MAC addy
txkey=60995924232808431491190820961556
rxkey=87345530111733181210202106249824
rxmac=00:0c:29:c5:95:df

# Clear any existing IP config
ifconfig $dif 0.0.0.0

# Bring up macsec:
echo "* Enable MACSec"
modprobe macsec
ip link add link "$dif" "$eif" type macsec
ip macsec add "$eif" tx sa 0 pn 1 on key 02 "$txkey"
ip macsec add "$eif" rx address "$rxmac" port 1
ip macsec add "$eif" rx address "$rxmac" port 1 sa 0 pn 1 on key 01 "$rxkey"
ip link set "$eif" type macsec encrypt on

# Bring up the interfaces:
echo "* Light tunnel NICS"
ip link set "$dif" up
ip link set "$eif" up

# Set IP
ifconfig $eif 192.168.211.1/30

echo " --=[ MACSec Up ]=--"
On Thu, Oct 11, 2018 at 10:05 AM Josh Coombs  wrote:
>
> I'm actually leaning towards macsec now.  I'm at 6TB transferred in a
> double hop, no macsec over the bridge setup without triggering the
> fault.  I'm going to let it continue to churn and setup a second
> testbed that JUST uses macsec without traffic control bridging to see
> if I can trip the issue there.That should determine if it's macsec
> itself, or an interaction between macsec and traffic control.
>
> Joshua Coombs
> GWI
>
> office 207-494-2140
> www.gwi.net
>
> On Wed, Oct 10, 2018 at 12:39 PM Cong Wang  wrote:
> >
> > On Wed, Oct 10, 2018 at 8:54 AM Josh Coombs  wrote:
> > >
> > > 2.3 billion 1 byte packets failed to re-create the bug.  To try and
> > > simplify the setup I removed macsec from the equation, using a single
> > > host in the middle as the bridge.  Interestingly, rather than 1.3Gbits
> > > a second in both directions, it ran around 8Mbits a second.  Switching
> > > the filter from u32 to matchall didn't change the performance.  Going
> > > back to the four machine test bed, again removing macsec and just
> > > bridging through radically decreased the throughput to around 8Mbits.
> > > Flip on macsec for the bridge and 1.3Gbits?
> >
> > This is a great narrow down! We can rule out macsec for guilty.
> >
> > Can you share a minimum reproducer for this problem? If so I can take
> > a look.

Bug in MACSec - stops passing traffic after approx 5TB

2018-10-14 Thread Josh Coombs

I initially mistook this for a traffic control issue, but after
stripping the test beds down to just the MACSec component, I can still
replicate the issue.  After approximately 5TB of transfer / 4 billion
packets over a MACSec link it stops passing traffic.  I have
replicated this now on both Ubuntu Server 18.04 with their patched
4.15 kernel and Gentoo with a vanilla 4.18.13 kernel.

As noted before, my test setup consists of two machines with a direct
ethernet connection to each other.  MACSec is setup on the link, along
with a /30 network.  I then hammer the link with iperf3 in both
directions.  On a gigabit link it takes me one to two days to trip the
bug.  Nothing is logged to dmesg.  If I remove and re-add the MACSec
link, or reboot the machines traffic flow resumes.  I can replicate on
physical hardware or in VMs.

How should I proceed to help diag and correct this?

To replicate, setup two machines with a direct ethernet connection.
If simulating in ESXi setup a dedicated vSwitch, allow promiscuous,
forged and MAC changes should be enabled, MTU increased to 9000, then
setup a dedicated port group and VLAN to simulate the direct
connection.

I use the following script to setup the MACSec connection, adjusting
keys, rxmac, interface and IPs as appropriate:

---
The script I used on each host (keys, rxmacs and IPs updated as appropriate):
#!/bin/bash

# Interfaces:
# dif = Egress physical interface (Dest)
# eif = Encrypted interface
dif=ens224
eif=macsec0

# MACSec Keys:
# txkey = Transmit (Local) key
# rxkey = Receive (Remote) key
# rxmac = Receive (Remote) MAC addy
txkey=60995924232808431491190820961556
rxkey=87345530111733181210202106249824
rxmac=00:0c:29:c5:95:df

# Clear any existing IP config
ifconfig $dif 0.0.0.0

# Bring up macsec:
echo "* Enable MACSec"
modprobe macsec
ip link add link "$dif" "$eif" type macsec
ip macsec add "$eif" tx sa 0 pn 1 on key 02 "$txkey"
ip macsec add "$eif" rx address "$rxmac" port 1
ip macsec add "$eif" rx address "$rxmac" port 1 sa 0 pn 1 on key 01 "$rxkey"
ip link set "$eif" type macsec encrypt on

# Bring up the interfaces:
echo "* Light tunnel NICS"
ip link set "$dif" up
ip link set "$eif" up

# Set IP
ifconfig $eif 192.168.211.1/30

Once you can ping across the link, use iperf3 or a similar network
stress tool to flood the link with traffic in both directions and wait
for the bug to trigger.

Josh Coombs

Re: Bug in MACSec - stops passing traffic after approx 5TB

2018-10-14 Thread Josh Coombs

On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca  wrote:
>
> 2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> > I initially mistook this for a traffic control issue, but after
> > stripping the test beds down to just the MACSec component, I can still
> > replicate the issue.  After approximately 5TB of transfer / 4 billion
> > packets over a MACSec link it stops passing traffic.
>
> I think you're just hitting packet number exhaustion. After 2^32
> packets, the packet number would wrap to 0 and start being reused,
> which breaks the crypto used by macsec. Before this point, you have to
> add a new SA, and tell the macsec device to switch to it.

I had not considered that, I naively thought as long as I didn't
specify a replay window, it'd roll the PN over on it's own and life
would be good.  I'll test that theory tomorrow, should be easy to
prove out.

> That's why you should be using wpa_supplicant. It will monitor the
> growth of the packet number, and handle the rekey for you.

Thank you for the heads up, I'll read up on this as well.

Josh C

Re: Bug in MACSec - stops passing traffic after approx 5TB

2018-10-15 Thread Josh Coombs

And confirmed, starting with a high packet number results in a very
short testbed run, 296 packets and then nothing, just as you surmised.
Sorry for raising the alarm falsely.  Looks like I need to roll my own
build of wpa_supplicant as the ubuntu builds don't include the macsec
driver, haven't tested Gentoo's ebuilds yet to see if they do.

Josh Coombs

On Sun, Oct 14, 2018 at 4:52 PM Josh Coombs  wrote:
>
> On Sun, Oct 14, 2018 at 4:24 PM Sabrina Dubroca  wrote:
> >
> > 2018-10-14, 10:59:31 -0400, Josh Coombs wrote:
> > > I initially mistook this for a traffic control issue, but after
> > > stripping the test beds down to just the MACSec component, I can still
> > > replicate the issue.  After approximately 5TB of transfer / 4 billion
> > > packets over a MACSec link it stops passing traffic.
> >
> > I think you're just hitting packet number exhaustion. After 2^32
> > packets, the packet number would wrap to 0 and start being reused,
> > which breaks the crypto used by macsec. Before this point, you have to
> > add a new SA, and tell the macsec device to switch to it.
>
> I had not considered that, I naively thought as long as I didn't
> specify a replay window, it'd roll the PN over on it's own and life
> would be good.  I'll test that theory tomorrow, should be easy to
> prove out.
>
> > That's why you should be using wpa_supplicant. It will monitor the
> > growth of the packet number, and handle the rekey for you.
>
> Thank you for the heads up, I'll read up on this as well.
>
> Josh C

Re: [PATCH 1/1] bnx2x: Add FW 7.13.11.0.

2019-02-12 Thread Josh Boyer

On Sat, Feb 9, 2019 at 11:06 PM Rahul Verma  wrote:
>
> From: Rahul Verma 
>
> This patch adds new FW for bnx2x, which adds the following:
>  - TX VLAN filtering support.
>  - Enable TPA only for packets without VLAN.
>
> It also addresses few critical issues,
>  - Fairness algorithm misbehaviour when minimum bandwidth configured
>for all PFs.
>  - Error recovery issue on TAPE devices.
>  - FW not discarding FIP frames that are not designated to PF.
>  - Kernel driver initialization failure after preboot driver.
>  - VxLAN stops working after sending inner IP fragmented traffic.
>  - Issues in the following FW flows:
> SD VLAN update, TX packet drop, packet padding flow, vlan add/remove.
>
> Signed-off-by: Sudarsana Reddy Kalluru 
> Signed-off-by: Ariel Elior 
> Signed-off-by: Rahul Verma 

Applied and pushed out.

josh

Re: pull-request Cavium liquidio firmware v1.7.2

2018-05-18 Thread Josh Boyer

On Wed, May 9, 2018 at 12:54 AM Felix Manlunas 
wrote:

> The following changes since commit
8fc2d4e55685bf73b6f7752383da9067404a74bb:

>Merge git://git.marvell.com/mwifiex-firmware (2018-05-07 09:09:40 -0400)

> are available in the git repository at:

>https://github.com/felix-cavium/linux-firmware.git
for-upstreaming-v1.7.2

> for you to fetch changes up to d3b6941e1a85cbff895a92aa9e36b50deaeac970:

>linux-firmware: liquidio: update firmware to v1.7.2 (2018-05-08
19:02:41 -0700)

> Signed-off-by: Felix Manlunas 
> 
> Felix Manlunas (1):
>linux-firmware: liquidio: update firmware to v1.7.2

>   WHENCE |   8 
>   liquidio/lio_210nv_nic.bin | Bin 1265368 -> 1281464 bytes
>   liquidio/lio_210sv_nic.bin | Bin 1163128 -> 1179352 bytes
>   liquidio/lio_23xx_nic.bin  | Bin 1271456 -> 1287264 bytes
>   liquidio/lio_410nv_nic.bin | Bin 1265368 -> 1281464 bytes
>   5 files changed, 4 insertions(+), 4 deletions(-)

Pulled and pushed out.  Thanks.

josh

Re: [PATCH] bpf: Tweak BPF jump table optimizations for objtool compatibility

2020-05-07 Thread Josh Poimboeuf

On Wed, May 06, 2020 at 05:03:57PM -0700, Alexei Starovoitov wrote:
> > > > > diff --git a/include/linux/compiler-gcc.h 
> > > > > b/include/linux/compiler-gcc.h
> > > > > index d7ee4c6bad48..05104c3cc033 100644
> > > > > --- a/include/linux/compiler-gcc.h
> > > > > +++ b/include/linux/compiler-gcc.h
> > > > > @@ -171,4 +171,4 @@
> > > > >  #define __diag_GCC_8(s)
> > > > >  #endif
> > > > >
> > > > > -#define __no_fgcse __attribute__((optimize("-fno-gcse")))
> > > > > +#define __no_fgcse 
> > > > > __attribute__((optimize("-fno-gcse,-fno-omit-frame-pointer")))
> > > > > --
> > > > > 2.23.0
> > > > >
> > > > > I've tested it with gcc 8,9,10 and clang 11 with FP=y and with ORC=y.
> > > > > All works.
> > > > > I think it's safer to go with frame pointers even for ORC=y 
> > > > > considering
> > > > > all the pain this issue had caused. Even if objtool gets confused 
> > > > > again
> > > > > in the future __bpf_prog_run() will have frame pointers and kernel 
> > > > > stack
> > > > > unwinding can fall back from ORC to FP for that frame.
> > > > > wdyt?
> > > >
> > > > It seems dangerous to me.  The GCC manual recommends against it.
> > > 
> > > The manual can says that it's broken. That won't stop the world from 
> > > using it.
> > > Just google projects that are using it. For example: qt, lz4, unreal 
> > > engine, etc
> > > Telling compiler to disable gcse via flag is a guaranteed way to avoid
> > > that optimization that breaks objtool whereas messing with C code is 
> > > nothing
> > > but guess work. gcc can still do gcse.
> > 
> > But the manual's right, it is broken.  How do you know other important
> > flags won't also be stripped?
> 
> What flags are you worried about?
> I've checked that important things like -mno-red-zone, -fsanitize are 
> preserved.

It's not any specific flags I'm worried about, it's all of them.  There
are a lot of possibilities, with all the different configs, and arches.
Flags are usually added for a good reason, so one randomly missing flag
could have unforeseen results.

And I don't have any visibility into how GCC decides which flags to
drop, and when.  But the docs aren't comforting.

Even if things seem to work now, that could (silently) change at any
point in time.  This time objtool warned about the missing frame
pointer, but that's not necessarily going to happen for other flags.

If we go this route, I would much rather do -fno-gcse on a file-wide
basis.

-- 
Josh

[PATCH] bpf: Tweak BPF jump table optimizations for objtool compatibility

2020-04-30 Thread Josh Poimboeuf

Objtool decodes instructions and follows all potential code branches
within a function.  But it's not an emulator, so it doesn't track
register values.  For that reason, it usually can't follow
intra-function indirect branches, unless they're using a jump table
which follows a certain format (e.g., GCC switch statement jump tables).

In most cases, the generated code for the BPF jump table looks a lot
like a GCC jump table, so objtool can follow it.  However, with
RETPOLINE=n, GCC keeps the jump table address in a register, and then
does 160+ indirect jumps with it.  When objtool encounters the indirect
jumps, it can't tell which jump table is being used (or even whether
they might be sibling calls instead).

This was fixed before by disabling an optimization in ___bpf_prog_run(),
using the "optimize" function attribute.  However, that attribute is bad
news.  It doesn't append options to the command-line arguments.  Instead
it starts from a blank slate.  And according to recent GCC documentation
it's not recommended for production use.  So revert the previous fix:

  3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")

With that reverted, solve the original problem in a different way by
getting rid of the "goto select_insn" indirection, and instead just goto
the jump table directly.  This simplifies the code a bit and helps GCC
generate saner code for the jump table branches, at least in the
RETPOLINE=n case.

But, in the RETPOLINE=y case, this simpler code actually causes GCC to
generate far worse code, ballooning the function text size by +40%.  So
leave that code the way it was.  In fact Alexei prefers to leave *all*
the code the way it was, except where needed by objtool.  So even
non-x86 RETPOLINE=n code will continue to have "goto select_insn".

This stuff is crazy voodoo, and far from ideal.  But it works for now.
Eventually, there's a plan to create a compiler plugin for annotating
jump tables.  That will make this a lot less fragile.

Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for 
___bpf_prog_run()")
Reported-by: Randy Dunlap 
Reported-by: Arnd Bergmann 
Signed-off-by: Josh Poimboeuf 
---
 include/linux/compiler-gcc.h   |  2 --
 include/linux/compiler_types.h |  4 
 kernel/bpf/core.c  | 10 +++---
 3 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/include/linux/compiler-gcc.h b/include/linux/compiler-gcc.h
index cf294faec2f8..2c8583eb5de8 100644
--- a/include/linux/compiler-gcc.h
+++ b/include/linux/compiler-gcc.h
@@ -176,5 +176,3 @@
 #else
 #define __diag_GCC_8(s)
 #endif
-
-#define __no_fgcse __attribute__((optimize("-fno-gcse")))
diff --git a/include/linux/compiler_types.h b/include/linux/compiler_types.h
index e970f97a7fcb..58105f1deb79 100644
--- a/include/linux/compiler_types.h
+++ b/include/linux/compiler_types.h
@@ -203,10 +203,6 @@ struct ftrace_likely_data {
 #define asm_inline asm
 #endif
 
-#ifndef __no_fgcse
-# define __no_fgcse
-#endif
-
 /* Are two types/vars the same type (ignoring qualifiers)? */
 #define __same_type(a, b) __builtin_types_compatible_p(typeof(a), typeof(b))
 
diff --git a/kernel/bpf/core.c b/kernel/bpf/core.c
index 916f5132a984..eec470c598ad 100644
--- a/kernel/bpf/core.c
+++ b/kernel/bpf/core.c
@@ -1364,7 +1364,7 @@ u64 __weak bpf_probe_read_kernel(void *dst, u32 size, 
const void *unsafe_ptr)
  *
  * Decode and execute eBPF instructions.
  */
-static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, 
u64 *stack)
+static u64 ___bpf_prog_run(u64 *regs, const struct bpf_insn *insn, u64 *stack)
 {
 #define BPF_INSN_2_LBL(x, y)[BPF_##x | BPF_##y] = &&x##_##y
 #define BPF_INSN_3_LBL(x, y, z) [BPF_##x | BPF_##y | BPF_##z] = &&x##_##y##_##z
@@ -1384,11 +1384,15 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const 
struct bpf_insn *insn, u6
 #undef BPF_INSN_2_LBL
u32 tail_call_cnt = 0;
 
+#if defined(CONFIG_X86_64) && !defined(CONFIG_RETPOLINE)
+#define CONT({ insn++; goto *jumptable[insn->code]; })
+#define CONT_JMP ({ insn++; goto *jumptable[insn->code]; })
+#else
 #define CONT({ insn++; goto select_insn; })
 #define CONT_JMP ({ insn++; goto select_insn; })
-
 select_insn:
goto *jumptable[insn->code];
+#endif
 
/* ALU */
 #define ALU(OPCODE, OP)\
@@ -1547,7 +1551,7 @@ static u64 __no_fgcse ___bpf_prog_run(u64 *regs, const 
struct bpf_insn *insn, u6
 * where arg1_type is ARG_PTR_TO_CTX.
 */
insn = prog->insnsi;
-   goto select_insn;
+   CONT;
 out:
CONT;
}
-- 
2.21.1

Re: [PATCH] bpf: Tweak BPF jump table optimizations for objtool compatibility

2020-05-01 Thread Josh Poimboeuf

On Fri, May 01, 2020 at 12:09:30PM -0700, Alexei Starovoitov wrote:
> On Thu, Apr 30, 2020 at 02:07:43PM -0500, Josh Poimboeuf wrote:
> > Objtool decodes instructions and follows all potential code branches
> > within a function.  But it's not an emulator, so it doesn't track
> > register values.  For that reason, it usually can't follow
> > intra-function indirect branches, unless they're using a jump table
> > which follows a certain format (e.g., GCC switch statement jump tables).
> > 
> > In most cases, the generated code for the BPF jump table looks a lot
> > like a GCC jump table, so objtool can follow it.  However, with
> > RETPOLINE=n, GCC keeps the jump table address in a register, and then
> > does 160+ indirect jumps with it.  When objtool encounters the indirect
> > jumps, it can't tell which jump table is being used (or even whether
> > they might be sibling calls instead).
> > 
> > This was fixed before by disabling an optimization in ___bpf_prog_run(),
> > using the "optimize" function attribute.  However, that attribute is bad
> > news.  It doesn't append options to the command-line arguments.  Instead
> > it starts from a blank slate.  And according to recent GCC documentation
> > it's not recommended for production use.  So revert the previous fix:
> > 
> >   3193c0836f20 ("bpf: Disable GCC -fgcse optimization for 
> > ___bpf_prog_run()")
> > 
> > With that reverted, solve the original problem in a different way by
> > getting rid of the "goto select_insn" indirection, and instead just goto
> > the jump table directly.  This simplifies the code a bit and helps GCC
> > generate saner code for the jump table branches, at least in the
> > RETPOLINE=n case.
> > 
> > But, in the RETPOLINE=y case, this simpler code actually causes GCC to
> > generate far worse code, ballooning the function text size by +40%.  So
> > leave that code the way it was.  In fact Alexei prefers to leave *all*
> > the code the way it was, except where needed by objtool.  So even
> > non-x86 RETPOLINE=n code will continue to have "goto select_insn".
> > 
> > This stuff is crazy voodoo, and far from ideal.  But it works for now.
> > Eventually, there's a plan to create a compiler plugin for annotating
> > jump tables.  That will make this a lot less fragile.
> 
> I don't like this commit log.
> Here you're saying that the code recognized by objtool is sane and good
> whereas well optimized gcc code is somehow voodoo and bad.
> That is just wrong.

I have no idea what you're talking about.

Are you saying that ballooning the function text size by 40% is well
optimized GCC code?  It seems like a bug to me.  That's the only place I
said anything bad about GCC code.

When I said "this stuff is crazy voodoo" I was referring to the patch
itself.  I agree it's horrible, it's only the best approach we're able
to come up with at the moment.

If any of that isn't clear then can you provide specifics about what
seems wrong?

-- 
Josh

1 2 3 >

1 - 100 of 232 matches

Mail list logo