This series attempts to improve xfrm policy lookup performance when a lot of (several hundred or even thousands) inexact policies exist on a system.
On insert, a policy is either placed in hash table (all direct (/32 for ipv4, /128 policies, or all policies matching a user-configured threshold). All other policies get inserted into inexact list as per priority. Lookup then scans inexact list for first matching entry. This series instead makes it so that inexact policy is added to exactly one of four different search list classes. 1. "Any:Any" list, containing policies where both saddr and daddr are wildcards or have very coarse prefixes, e.g. 10.0.0.0/8 and the like. 2. "saddr:any" list, containing policies with a fixed saddr/prefixlen, but without destination restrictions. These lists are stored in rbtree nodes; each node contains those policies matching saddr/prefixlen. 3. "Any:daddr" list. Similar to 2), except for policies where only the destinations are specified. 4. "saddr:daddr" lists, containing policies that match the given source/destination network. The root of the saddr/daddr tree is stored in the nodes of the 'daddr' tree. This diagram illustrates the list classes, and their placement in the lookup hierarchy: xfrm_pol_inexact_bin = hash(dir,type,family,if_id); | +---- root_d: sorted by daddr:prefix | | | xfrm_pol_inexact_node | | | +- root: sorted by saddr/prefix | | | | | xfrm_pol_inexact_node | | | | | + root: unused | | | | | + hhead: saddr:daddr policies | | | +- coarse policies and all any:daddr policies | +---- root_s: sorted by saddr:prefix | | | xfrm_pol_inexact_node | | | + root: unused | | | + hhead: saddr:any policies | +---- coarse policies and all any:any policies First we obtain inexact bin by hashing direction, family and other 'must match' criteria. Next, we search root_d using packets' destination ip address. If we find a matching node, we search that nodes' source address tree using packets src address. We also search the bins root_s tree using packets source address. This initial lookup now has created a 'candidate set' consisting of up to four hlist heads to the four search classes. Each list gets scanned for first matching entry. Out of those up the one with the highest priority is chosen. In case several policies had same priority, most recent one is used to match old behaviour. Locking rules: insert requires net->xfrm.xfrm_policy_lock spinlock held + BH off (no change to current kernel). Insert/removal of a tree node needs increment of its sequence count. lookup requires rcu read lock, lookups in a tree need to sample the sequence count of the bin that tree is located in as well and re-try in case it has changed. Results on my kvm test setup, using 4 netns: # 1.2 1.1 3.1 3.10 2.1 2.2 # eth1 eth1 veth0 veth0 eth1 eth1 # ns1 ---- ns3 ----- ns4 ---- ns2 ns3 and ns4 are connected via ipsec tunnel. 'Dummy policies' below means that i created 1k non-matching policies in both ns3 and ns4. 20 parallel netperf (mbits, unidirectional TCP_STREAM): normal: 993.518000 (no dummy policies) 796.32 (with dummy policies) patched: 991.335 (no dummy policies) 989.684 (with dummy policies) 20 parallel TCP_RR, trans. per second: normal: 32413.930000 (no dummy policies) 17725.970000 (with dummy policies) patched: 32314.51 (no dummy policies) 32326.190000 (with dummy policies) caveats: 1. Won't work unless large part of policies have no common saddr/daddr pairs. Extreme example: adding 10k 0.0.0.0/0 -> 0.0.0.0/0 policies may require search of all policies, just like current kernel. 2. The policy add sequence a daddr 10.0.0.0/26 b daddr 10.0.0.64/26 c daddr 10.0.0.0/24 When c gets added, search nodes for a and b have to be merged with the new subnet, as it contains both. This also means that a single policy rule with a small subnet value can cause severe tree degradation and thus longer search lists. 3. Causes (small) performance degradation when setup only has few policies. 4. Policies need an additional hlist, as MIGRATE doesn't consider if_id, but the patchset makes the if_id part of the initial hash lookup. Could be avoided by not considering if_id as part of initial bin lookup, but then increases list length a lot when we have lots of xfrm interfaces. 5. In oder to solve 'two matching candidates have same priority, which takes precedence?' problem I added a number to xfrm_policy struct that maps to its index in the complete inexact list. This was necessary to make sure same sequence of policy-add commands keeps on returning same lookup result compared to older kernel. This work doesn't prevent us from re-adding a new flow caching scheme, but so far i found no good solution (all have various DoS or degradation issues). Comments or questions welcome. Florian Westphal (11): selftests: add xfrm policy test script xfrm: security: iterate all, not inexact lists xfrm: policy: split list insertion into a helper xfrm: policy: return NULL when inexact search needed xfrm: policy: store inexact policies in an rhashtable xfrm: policy: consider if_id when hashing inexact policy xfrm: policy: add inexact policy search tree infrastructure xfrm: policy: store inexact policies in a tree ordered by destination address xfrm: policy: check reinserted policies match their node xfrm: policy: store inexact policies in a tree ordered by source address xfrm: policy: add 2nd-level saddr trees for inexact policies include/net/netns/xfrm.h | 2 include/net/xfrm.h | 3 net/xfrm/xfrm_policy.c | 1234 ++++++++++++++++++++++++++--- tools/testing/selftests/net/Makefile | 3 tools/testing/selftests/net/xfrm_policy.sh | 276 ++++++ 5 files changed, 1391 insertions(+), 127 deletions(-)