On 01/13/2017 07:01 PM, Alexei Starovoitov wrote: > On Thu, Jan 12, 2017 at 06:29:21PM +0100, Daniel Mack wrote: >> This trie implements a longest prefix match algorithm that can be used >> to match IP addresses to a stored set of ranges. >> >> Internally, data is stored in an unbalanced trie of nodes that has a >> maximum height of n, where n is the prefixlen the trie was created >> with. >> >> Tries may be created with prefix lengths that are multiples of 8, in >> the range from 8 to 2048. The key used for lookup and update operations >> is a struct bpf_lpm_trie_key, and the value is a uint64_t. >> >> The code carries more information about the internal implementation. >> >> Signed-off-by: Daniel Mack <dan...@zonque.org> >> Reviewed-by: David Herrmann <dh.herrm...@gmail.com> >> --- >> include/uapi/linux/bpf.h | 7 + >> kernel/bpf/Makefile | 2 +- >> kernel/bpf/lpm_trie.c | 499 >> +++++++++++++++++++++++++++++++++++++++++++++++ >> 3 files changed, 507 insertions(+), 1 deletion(-) >> create mode 100644 kernel/bpf/lpm_trie.c
... Thanks for spotting my typos! :) >> +static struct lpm_trie_node *lpm_trie_node_alloc(const struct lpm_trie >> *trie, >> + const void *value) >> +{ >> + struct lpm_trie_node *node; >> + gfp_t gfp = GFP_ATOMIC | __GFP_NOWARN; >> + >> + node = kmalloc(sizeof(struct lpm_trie_node) + trie->data_size, gfp); >> + if (!node) >> + return ERR_PTR(-ENOMEM); >> + >> + if (value) { >> + node->value = kmemdup(value, trie->map.value_size, gfp); > > can you make value to be part of the node? similar to how hash map is done? > that will help avoid 2nd allocation, will speedup insertion and will > help converting this code to user pre-allocated elements. > I suspect the concern was that for many inner nodes that value is null ? > But in your use case the value_size will be == 0 eventually, > so by embedding it when can save memory too, since 'value' pointer will > be replaced with boolean present flag ? > So potentially less memory and less cache misses? Yes, that's a good idea. Implemented that now. > Overall algorithm is indeed straightforward and simple which is great, > but I would still like to see some performance numbers. I'm not sure yet how to implement such a test in a meaningful way, tbh. Given that the lookups have to be done one by one, I expect the syscall overhead to be quite significant. > Looks like > the best case for single 32-bit element it needs 4 xors and compares > which is fine. For mostly populate trie it's 4xors * 32 depth > which is pretty good too, but cache misses on pointer walks may > kill performance unless we're hitting the same path all the time. > I think it's all acceptable due to simplicity of the implementation > which we may improve later if it turns out to be a bottle neck for > some use cases. We just need a baseline to have realistic expectations. Yes, the maximum height of the trie is the number of bits in the prefix, so for n bits, the iteration would at most take n steps to finish. For each step, an xor and compare for n/8 bytes are needed. As you say, the implementation could be improved under the hood if someone spots a bottleneck somewhere. I'll post a v3 with your comments addressed for further discussion. Thanks, Daniel