On Fri, Nov 11, 2016 at 10:55:06AM -0800, Martin KaFai Lau wrote: > Introduce bpf_lru_list which will provide LRU capability to > the bpf_htab in the later patch. > > * General Thoughts: > 1. Target use case. Read is more often than update. > (i.e. bpf_lookup_elem() is more often than bpf_update_elem()). > If bpf_prog does a bpf_lookup_elem() first and then an in-place > update, it still counts as a read operation to the LRU list concern. > 2. It may be useful to think of it as a LRU cache > 3. Optimize the read case > 3.1 No lock in read case > 3.2 The LRU maintenance is only done during bpf_update_elem() > 4. If there is a percpu LRU list, it will lose the system-wise LRU > property. A completely isolated percpu LRU list has the best > performance but the memory utilization is not ideal considering > the work load may be imbalance. > 5. Hence, this patch starts the LRU implementation with a global LRU > list with batched operations before accessing the global LRU list. > As a LRU cache, #read >> #update/#insert operations, it will work well. > 6. There is a local list (for each cpu) which is named > 'struct bpf_lru_locallist'. This local list is not used to sort > the LRU property. Instead, the local list is to batch enough > operations before acquiring the lock of the global LRU list. More > details on this later. > 7. In the later patch, it allows a percpu LRU list by specifying a > map-attribute for scalability reason and for use cases that need to > prepare for the worst (and pathological) case like DoS attack. > The percpu LRU list is completely isolated from each other and the > LRU nodes (including free nodes) cannot be moved across the list. The > following description is for the global LRU list but mostly applicable > to the percpu LRU list also. > > * Global LRU List: > 1. It has three sub-lists: active-list, inactive-list and free-list. > 2. The two list idea, active and inactive, is borrowed from the > page cache. > 3. All nodes are pre-allocated and all sit at the free-list (of the > global LRU list) at the beginning. The pre-allocation reasoning > is similar to the existing BPF_MAP_TYPE_HASH. However, > opting-out prealloc (BPF_F_NO_PREALLOC) is not supported in > the LRU map. > > * Active/Inactive List (of the global LRU list): > 1. The active list, as its name says it, maintains the active set of > the nodes. We can think of it as the working set or more frequently > accessed nodes. The access frequency is approximated by a ref-bit. > The ref-bit is set during the bpf_lookup_elem(). > 2. The inactive list, as its name also says it, maintains a less > active set of nodes. They are the candidates to be removed > from the bpf_htab when we are running out of free nodes. > 3. The ordering of these two lists is acting as a rough clock. > The tail of the inactive list is the older nodes and > should be released first if the bpf_htab needs free element. > > * Rotating the Active/Inactive List (of the global LRU list): > 1. It is the basic operation to maintain the LRU property of > the global list. > 2. The active list is only rotated when the inactive list is running > low. This idea is similar to the current page cache. > Inactive running low is currently defined as > "# of inactive < # of active". > 3. The active list rotation always starts from the tail. It moves > node without ref-bit set to the head of the inactive list. > It moves node with ref-bit set back to the head of the active > list and then clears its ref-bit. > 4. The inactive rotation is pretty simply. > It walks the inactive list and moves the nodes back to the head of > active list if its ref-bit is set. The ref-bit is cleared after moving > to the active list. > If the node does not have ref-bit set, it just leave it as it is > because it is already in the inactive list. > > * Shrinking the Inactive List (of the global LRU list): > 1. Shrinking is the operation to get free nodes when the bpf_htab is > full. > 2. It usually only shrinks the inactive list to get free nodes. > 3. During shrinking, it will walk the inactive list from the tail, > delete the nodes without ref-bit set from bpf_htab. > 4. If no free node found after step (3), it will forcefully get > one node from the tail of inactive or active list. Forcefully is > in the sense that it ignores the ref-bit. > > * Local List: > 1. Each CPU has a 'struct bpf_lru_locallist'. The purpose is to > batch enough operations before acquiring the lock of the > global LRU. > 2. A local list has two sub-lists, free-list and pending-list. > 3. During bpf_update_elem(), it will try to get from the free-list > of (the current CPU local list). > 4. If the local free-list is empty, it will acquire from the > global LRU list. The global LRU list can either satisfy it > by its global free-list or by shrinking the global inactive > list. Since we have acquired the global LRU list lock, > it will try to get at most LOCAL_FREE_TARGET elements > to the local free list. > 5. When a new element is added to the bpf_htab, it will > first sit at the pending-list (of the local list) first. > The pending-list will be flushed to the global LRU list > when it needs to acquire free nodes from the global list > next time. > > * Lock Consideration: > The LRU list has a lock (lru_lock). Each bucket of htab has a > lock (buck_lock). If both locks need to be acquired together, > the lock order is always lru_lock -> buck_lock and this only > happens in the bpf_lru_list.c logic. > > In hashtab.c, both locks are not acquired together (i.e. one > lock is always released first before acquiring another lock). > > Signed-off-by: Martin KaFai Lau <ka...@fb.com>
thanks for detailed commit log. I think it's worth adding it to bpf_lru_list.c as design documentation. Acked-by: Alexei Starovoitov <a...@kernel.org>