On Wed, Jan 18, 2017 at 03:30:14PM +0100, David Herrmann wrote: > Hi > > On Sat, Jan 14, 2017 at 5:55 PM, Alexei Starovoitov <a...@fb.com> wrote: > > Another alternative is to extend samples/bpf/map_perf_test > > It has perf tests for most map types today (including lru) > > and trie would be natural addition there. > > I would prefer this latter option. > > I hooked into gettid() and installed a simple kprobe bpf program that > searches for an entry in an lpm trie. The bpf program does either 0, > 1, 8, or 32 lookups in a row (always the same element). The trie has > size 0, 1, or 8192. The data is below. The results vary by roughly 5% > on every run. > > A single gettid() syscall with an empty bpf program takes roughly > 6.5us on my system. Lookups in empty tries take ~1.8us on first try, > ~0.9us on retries. Lookups in tries with 8192 entries take ~7.1us (on > the first _and_ any subsequent try).
thanks. please add them to commit log. > https://gist.github.com/dvdhrm/4c90e61a1c39746d5c55ab9e0e29315e looks good. please add it as an extra patch to this set. and add #pragma clang loop unroll(full) before + for (i = 0; i < 8; ++i) + bpf_map_lookup_elem(&lpm_trie_map_alloc, &key); to make sure it's unrolled regardless of the version of llvm. > Trie-size: 0 > #Lookups: 0 > 0:lpm_perf kmalloc 9,230,321 events per sec > -> 6.5us / syscall > > Trie-size: 1 > #Lookups: 1 > 0:lpm_perf kmalloc 7,224,508 events per sec > -> 8.3us / syscall > > Trie-size: 1 > #Lookups: 8 > 0:lpm_perf kmalloc 4,152,740 events per sec > -> 14.4us / syscall > > Trie-size: 1 > #Lookups: 32 > 0:lpm_perf kmalloc 1,713,415 events per sec > -> 35.0us / syscall > > Trie-size: 8192 > #Lookups: 1 > 0:lpm_perf kmalloc 4,369,138 events per sec > -> 13.7us / syscall > > Trie-size: 8192 > #Lookups: 8 > 0:lpm_perf kmalloc 943,849 events per sec > -> 63.6us / syscall > > Trie-size: 8192 > #Lookups: 32 > 0:lpm_perf kmalloc 271,737 events per sec > -> 220.8us / syscall so it's the same lookup done 32-times per each syscall. Since the latency of single lookup in 8 vs 32 experiments are pretty close, it means that the 32 lookup numbers amortize the cost of syscall pretty well and no need to go higher. So I would use 32 as a loop count in stress_lpm_trie_map_alloc() Thanks!