On Wed, Apr 29, 2015 at 4:07 PM, Anuradha Karuppiah <anurad...@cumulusnetworks.com> wrote: > On Wed, Apr 29, 2015 at 3:13 PM, Stephen Hemminger > <step...@networkplumber.org> wrote: >> On Mon, 27 Apr 2015 10:38:21 -0700 >> anurad...@cumulusnetworks.com wrote: >> >>> From: Anuradha Karuppiah <anurad...@cumulusnetworks.com> >>> >>> This patch introduces an IFF_PROTO_DOWN flag that can be used by >>> user space applications to notify drivers that errors have been >>> detected on the device. >>> >>> Signed-off-by: Anuradha Karuppiah <anurad...@cumulusnetworks.com> >>> Signed-off-by: Andy Gospodarek <go...@cumulusnetworks.com> >>> Signed-off-by: Roopa Prabhu <ro...@cumulusnetworks.com> >>> Signed-off-by: Wilson Kok <w...@cumulusnetworks.com> >> >> I worry that adding another bit to an already complex state API >> will break userspace. >> >> There are lots of things besides iproute2 which look at those >> flags including routing daemons (quagga), network manager, netplugd, >> and switch controllers. > > Yes, I understand your concerns here. And tried to work around introducing > a separate error flag by clearing IFF_UP on proto_down/detecting errors (as > Scott also brought up earlier). > > That implementation failed because of the following reasons - > 1. There is no way to disambiguate between admin_down (!IFF_UP) and an > APP/driver enforced error_down (IFF_PROTO_DOWN). Administrator or > automation-scripts that monitor the config assumed that switch-port > configuration had somehow fallen out of sync (and attempted to reinstate the > admin_up repeatedly). > > 2. Automatic error recovery was not possible; consider the following scenario > for e.g. > a. The MLAG peer-link is down so the MLAG app on the secondary switch has > proto_down’ed all the MLAG ports (including switch-port swp1) by > clearing > IFF_UP. > b. At the same time the administrator is in the process of making some > changes on the network connected to swp1. To avoid doing it live he > would > admin_disable swp1 (!IFF_UP) by doing an "ip link set swp1 down" (this > is a no-op as event #a has already cleared IFF_UP on swp1). > c. If the MLAG peer-link recovers at this point the MLAG app on the > secondary switch would try to automatically recover the MLAG ports > by clearing proto_down (i.e. setting IFF_UP); including on swp1. Doing > that overrides the administrator’s directive to keep swp1 admin_down. > Overriding an admin-down in a live network can be very dangerous so it > is not possible to do auto-error-recovery unless we have a way to > disambiguate between the admin and error states.
I have the need to disambiguate the error state but it doesn't have to be an IFF_X attribute. Stephen, Do you think it would be more easily consumable if it were a new/separate net_device attribute instead of being a new bit in "&struct net_device flags"? -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html