On Thu, 30 Apr 2020 12:45:49 -0700
Jakub Kicinski <k...@kernel.org> wrote:

> On Thu, 30 Apr 2020 13:42:22 +0200 Jesper Dangaard Brouer wrote:
> > Currently if the default qdisc setup/init fails, the device ends up with
> > qdisc "noop", which causes all TX packets to get dropped.
> > 
> > With the introduction of sysctl net/core/default_qdisc it is possible
> > to change the default qdisc to be more advanced, which opens for the
> > possibility that Qdisc_ops->init() can fail.
> > 
> > This patch detect these kind of failures, and choose to fallback to
> > qdisc "noqueue", which is so simple that its init call will not fail.
> > This allows the interface to continue functioning.
> > 
> > V2:
> > As this also captures memory failures, which are transient, the
> > device is not kept in IFF_NO_QUEUE state.  This allows the net_device
> > to retry to default qdisc assignment.
> > 
> > Signed-off-by: Jesper Dangaard Brouer <bro...@redhat.com>  
> 
> I have mixed feelings about this one, I wonder if I'm the only one.
> Seems like failure to allocate the default qdisc is pretty critical,
> the log message may be missed, especially in the boot time noise.
> 
> I think a WARN_ON() is in order here, I'd personally just replace the
> netdev_info with a WARN_ON, without the fallback.

It is good that we agree that failure to default qdisc is pretty
critical.  I guess we disagree on whether (1) we keep network
functioning in a degraded state, (2) drop all packets on net_device
such that people notice.

This change propose (1) keeping the box functioning.  For me it was a
pretty bad experience, that when I pushed a new kernel over the network
to my embedded box, then I lost all network connectivity.  I
fortunately had serial console access (as this was not an OpenWRT box
but a full devel board) so I could debug, but I could no-longer upgrade
the kernel.  I clearly noticed, as the box was not operational, but I
guess most people would just give up at this point. (Imagine a small
OpenWRT box config setting default_qdisc to fq_codel, which brick the
box as it cannot allocate memory).

I hope that people will notice this degrade state, when they start to
transfer data to the device.  Because running 'noqueue' on a physical
device will result in net_crit_ratelimited() messages below:

 [86971.609318] Virtual device eth0 asks to queue packet!
 [86971.622183] Virtual device eth0 asks to queue packet!
 [86971.627510] Virtual device eth0 asks to queue packet!

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

Reply via email to