Hi, On Tue, Nov 29, 2016, at 18:15, David Lebrun wrote: > When multiple nexthops are available for a given route, the routing > engine > chooses a nexthop by computing the flow hash through get_hash_from_flowi6 > and by taking that value modulo the number of nexthops. The resulting > value > indexes the nexthop to select. This method causes issues when a new > nexthop > is added or one is removed (e.g. link failure). In that case, the number > of nexthops changes and potentially all the flows get re-routed to > another > nexthop. > > This patch implements a consistent hash method to select the nexthop in > case of ECMP. The idea is to generate K slices (or intervals) for each > route with multiple nexthops. The nexthops are randomly assigned to those > slices, in a uniform manner. The number K is configurable through a > sysctl > net.ipv6.route.ecmp_slices and is always an exponent of 2. To select the > nexthop, the algorithm takes the flow hash and computes an index which is > the flow hash modulo K. As K = 2^x, the modulo can be computed using a > simple binary AND operation (idx = hash & (K - 1)). The resulting index > references the selected nexthop. The lookup time complexity is thus O(1). > > When a nexthop is added, it steals K/N slices from the other nexthops, > where N is the new number of nexthops. The slices are stolen randomly and > uniformly from the other nexthops. When a nexthop is removed, the orphan > slices are randomly reassigned to the other nexthops. > > The number of slices for a route also fixes the maximum number of > nexthops > possible for that route.
In the worst case this causes 2GB (order 19) allocations (x == 31) to happen in GFP_ATOMIC (due to write lock) context and could cause update failures to the routing table due to fragmentation. Are you sure the upper limit of 31 is reasonable? I would very much prefer an upper limit of below or equal 25 for x to stay within the bounds of the slab allocators (which is still a lot and probably causes errors!). Unfortunately because of the nature of the sysctl you can't really create its own cache for it. :/ Also by design, one day this should all be RCU and having that much data outstanding worries me a bit during routing table mutation. I am a fan of consistent hashing but I am not so sure if it belongs into a generic ECMP implementation or into its own ipvs or netfilter module where you specifically know how much memory to burn for it. Also please convert the sysctl to a netlink attribute if you pursue this because if I change the sysctl while my quagga is hammering the routing table I would like to know which nodes allocate what amount of memory. Bye, Hannes