roopa <ro...@cumulusnetworks.com> writes: > On 11/3/15, 7:08 AM, Robert Shearman wrote: >> On 03/11/15 05:13, Roopa Prabhu wrote: >>> From: Roopa Prabhu <ro...@cumulusnetworks.com> >>> >>> Adds support for RTNH_F_DEAD and RTNH_F_LINKDOWN flags on mpls >>> routes due to link events. Also adds code to ignore dead >>> routes during route selection >>> >>> Signed-off-by: Roopa Prabhu <ro...@cumulusnetworks.com> >>> --- >>> RFC to v1: >>> Addressed a few comments from Eric and Robert: >>> - remove support for weighted nexthops >>> - Use rt_nhn_alive in the rt structure to keep count of alive >>> routes. >>> What i have not done is: sort nexthops on link events. >>> I am not comfortable recreating or sorting nexthops on >>> every carrier change. This leaves scope for optimizing in the >>> future >>> >>> v1 to v2: >>> Fix dead nexthop checks as suggested by dave >>> >>> net/mpls/af_mpls.c | 191 >>> ++++++++++++++++++++++++++++++++++++++++++++-------- >>> net/mpls/internal.h | 3 + >>> 2 files changed, 166 insertions(+), 28 deletions(-) >>> >>> diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c >>> index c70d750..5e88118 100644 >>> --- a/net/mpls/af_mpls.c >>> +++ b/net/mpls/af_mpls.c >>> @@ -96,22 +96,15 @@ bool mpls_pkt_too_big(const struct sk_buff *skb, >>> unsigned int mtu) >>> } >>> EXPORT_SYMBOL_GPL(mpls_pkt_too_big); >>> >>> -static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt, >>> - struct sk_buff *skb, bool bos) >>> +static u32 mpls_multipath_hash(struct mpls_route *rt, >>> + struct sk_buff *skb, bool bos) >>> { >>> struct mpls_entry_decoded dec; >>> struct mpls_shim_hdr *hdr; >>> bool eli_seen = false; >>> int label_index; >>> - int nh_index = 0; >>> u32 hash = 0; >>> >>> - /* No need to look further into packet if there's only >>> - * one path >>> - */ >>> - if (rt->rt_nhn == 1) >>> - goto out; >>> - >>> for (label_index = 0; label_index < MAX_MP_SELECT_LABELS && !bos; >>> label_index++) { >>> if (!pskb_may_pull(skb, sizeof(*hdr) * label_index)) >>> @@ -165,9 +158,37 @@ static struct mpls_nh *mpls_select_multipath(struct >>> mpls_route *rt, >>> } >>> } >>> >>> - nh_index = hash % rt->rt_nhn; >>> + return hash; >>> +} >>> + >>> +static struct mpls_nh *mpls_select_multipath(struct mpls_route *rt, >>> + struct sk_buff *skb, bool bos) >>> +{ >>> + u32 hash = 0; >>> + int nh_index; >>> + int n = 0; >>> + >>> + /* No need to look further into packet if there's only >>> + * one path >>> + */ >>> + if (rt->rt_nhn == 1) >>> + goto out; >>> + >>> + if (rt->rt_nhn_alive <= 0) >>> + return NULL; >>> + >>> + hash = mpls_multipath_hash(rt, skb, bos); >>> + nh_index = hash % rt->rt_nhn_alive; >>> + for_nexthops(rt) { >> >> This should be optimised in the common rt->rt_nhn_alive == rt->rt_nhn case >> by doing the direct array index and avoid the nh walk. > > sure, that is an optimization. i will add that. >> >>> + if (nh->nh_flags & (RTNH_F_DEAD | RTNH_F_LINKDOWN)) >>> + continue; >>> + if (n == nh_index) >>> + return nh; >>> + n++; >>> + } endfor_nexthops(rt); >>> + >>> out: >>> - return &rt->rt_nh[nh_index]; >>> + return &rt->rt_nh[0]; >>> } >>> >>> static bool mpls_egress(struct mpls_route *rt, struct sk_buff *skb, >>> @@ -365,6 +386,7 @@ static struct mpls_route *mpls_rt_alloc(int num_nh, u8 >>> max_alen) >>> GFP_KERNEL); >>> if (rt) { >>> rt->rt_nhn = num_nh; >>> + rt->rt_nhn_alive = num_nh; >>> rt->rt_max_alen = max_alen_aligned; >>> } >>> >>> @@ -536,6 +558,15 @@ static int mpls_nh_assign_dev(struct net *net, struct >>> mpls_route *rt, >>> >>> RCU_INIT_POINTER(nh->nh_dev, dev); >>> >>> + if (!netif_carrier_ok(dev)) >>> + nh->nh_flags |= RTNH_F_LINKDOWN; >>> + >>> + if (!(dev->flags & IFF_UP)) >>> + nh->nh_flags |= RTNH_F_DEAD; >> >> Below you're testing IFF_RUNNING | IFF_LOWER_UP. Is the difference intended >> here? > Not really. I can change it. This is during adding the route. I tried to keep > the checks consistent with ipv4. > >>> + >>> + if (nh->nh_flags & (RTNH_F_DEAD | RTNH_F_LINKDOWN)) >>> + rt->rt_nhn_alive--; >> >> You don't update your new rt->rt_flags field here. This highlights the issue >> with duplicating data, which you're doing with the rt_flags field. > I am not sure I understand completely. Would you prefer i loop and derive the > rt_flags from nh_flags during dumps than storing them in rt_flags ?. Sure I > can do that. I did not think it was such a big deal because, all routes (ipv4 > and others) have rt_flags and all routes today carry RTNH_ flags and you have > to send them to userspace in rtm_flags anyways. > I am just trying to keep the use and semantics of RTNH_F flags for mpls > routes consistent with the other family routes. > >> >>> + >>> return 0; >>> >>> errout: >>> @@ -577,7 +608,7 @@ errout: >>> } >>> >>> static int mpls_nh_build(struct net *net, struct mpls_route *rt, >>> - struct mpls_nh *nh, int oif, >>> + struct mpls_nh *nh, int oif, int hops, >> >> This change isn't required. > ack, this is a leftover from my attempt to add weights. will remove it. >> >>> struct nlattr *via, struct nlattr *newdst) >>> { >>> int err = -ENOMEM; >>> @@ -666,7 +697,7 @@ static int mpls_nh_build_multi(struct mpls_route_config >>> *cfg, >>> /* neither weighted multipath nor any flags >>> * are supported >>> */ >>> - if (rtnh->rtnh_hops || rtnh->rtnh_flags) >>> + if (rtnh->rtnh_flags || rtnh->rtnh_flags) >> >> As the build bot has pointed out, this change is not entirely correct. > yes, this was fixed later. >> >>> goto errout; >>> >>> attrlen = rtnh_attrlen(rtnh); >>> @@ -681,8 +712,8 @@ static int mpls_nh_build_multi(struct mpls_route_config >>> *cfg, >>> goto errout; >>> >>> err = mpls_nh_build(cfg->rc_nlinfo.nl_net, rt, nh, >>> - rtnh->rtnh_ifindex, nla_via, >>> - nla_newdst); >>> + rtnh->rtnh_ifindex, rtnh->rtnh_hops, >>> + nla_via, nla_newdst); >> >> This change isn't required. > yep, ack. same here left over from weights. will remove it. > >> >>> if (err) >>> goto errout; >>> >>> @@ -875,34 +906,100 @@ free: >>> return ERR_PTR(err); >>> } >>> >>> -static void mpls_ifdown(struct net_device *dev) >>> +static void mpls_ifdown(struct net_device *dev, int event) >>> { >>> struct mpls_route __rcu **platform_label; >>> struct net *net = dev_net(dev); >>> - struct mpls_dev *mdev; >>> unsigned index; >>> + int dead; >>> >>> platform_label = rtnl_dereference(net->mpls.platform_label); >>> for (index = 0; index < net->mpls.platform_labels; index++) { >>> struct mpls_route *rt = rtnl_dereference(platform_label[index]); >>> + >>> if (!rt) >>> continue; >>> + dead = 0; >>> for_nexthops(rt) { >>> + if ((event == NETDEV_DOWN && >>> + (nh->nh_flags & RTNH_F_DEAD)) || >>> + (event == NETDEV_CHANGE && >>> + (nh->nh_flags & RTNH_F_LINKDOWN))) { >>> + dead++; >>> + continue; >>> + } >>> + >>> if (rtnl_dereference(nh->nh_dev) != dev) >>> continue; >>> - nh->nh_dev = NULL; >>> + switch (event) { >>> + case NETDEV_DOWN: >>> + case NETDEV_UNREGISTER: >>> + nh->nh_flags |= RTNH_F_DEAD; >>> + /* fall through */ >>> + case NETDEV_CHANGE: >>> + nh->nh_flags |= RTNH_F_LINKDOWN; >>> + rt->rt_nhn_alive--; >> >> Are these operations atomic on all architectures? > > I think so.
Ugh. I don't see how. A) You are doing read-modify-write. B) You are modifying multiple fields. So while the individual field writes may be atomic the changes certainly are not atomic. >> Even if they are, I'm a bit worried about the RCU-correctness of this. I >> think the fallthrough case in mpls_select_multipath saves you, causing >> traffic to get via nh0 instead of the one selected by the hash (which is >> fine transiently). >> I share your concern. In-place modification sounds good in principle for handling RCU, but there is a reason why the original version of RCU was called Read-Copy-Update. Given that there are multiple fields that seem to depend upon each other I am not certain we can perform rcu safe modifications without creating a fresh copy of the route. Eric -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html