On 10/12/2018 10:39 AM, Daniel Borkmann wrote: > On 10/12/2018 05:04 AM, Jakub Kicinski wrote: >> On Thu, 11 Oct 2018 16:02:07 +0200, Daniel Borkmann wrote: >>> Clean up and improve bpf_perf_event_read_simple() ring walk a bit >>> to use similar tail update scheme as in perf and bcc allowing the >>> kernel to make forward progress not only after full timely walk. >> >> The extra memory barriers won't impact performance? If I read the code >> correctly we now have: >> >> while (bla) { >> head = HEAD >> rmb() >> >> ... >> >> mb() >> TAIL = tail >> } >> >> Would it make sense to try to piggy back on the mb() for head re-read >> at least? Perhaps that's a non-issue, just wondering. > > From the scheme specified in the comment in prior patch my understanding > would be that they don't pair (see B and C) so there would be no guarantee > that load of head would happen before load of data. Fwiw, I've been using > the exact same semantics as user space perf tool walks the perf mmap'ed > ring buffer (tools/perf/util/mmap.{h,c}) here. Given kernel doesn't stop
On that note, I'll also respin, after some clarification with PeterZ on why perf is using {rmb,mb}() barriers today as opposed to more lightweight smp_{rmb,mb}() ones it turns out there is no real reason other than historic one and perf can be changed and made more efficient as well. ;) > pushing into ring buffer while user space walks it and indicates how far > it has consumed data via tail update, it would allow for making room > successively and not only after full run has complete, so we don't make > any assumptions in the generic libbpf library helper on how slow/quick > the callback would be processing resp. how full ring is, etc, and kernel > pushing new data can be processed in the same run if necessary. One thing > we could consider is to batch tail updates, say, every 8 elements and a > final update once we break out walking the ring; probably okay'ish as a > heuristic.. > >>> Also few other improvements to use realloc() instead of free() and >>> malloc() combination and for the callback use proper perf_event_header >>> instead of void pointer, so that real applications can use container_of() >>> macro with proper type checking. >> >> FWIW the free() + malloc() was to avoid the the needless copy of the >> previous event realloc() may do. It makes sense to use realloc() >> especially if you want to put extra info in front of the buffer, just >> sayin' it wasn't a complete braino ;) > > No strong preference from my side, I'd think that it might be sensible in > any case from applications to call the bpf_perf_event_read_simple() with a > already preallocated buffer, depending on the expected max element size from > BPF could e.g. be a buffer of 1 page or so. Given 512 byte stack space from > the BPF prog and MTU 1500 this would more than suffice to avoid new > allocations altogether. Anyway, given we only grow the new memory area I > did some testing on realloc() with an array of pointers to prior malloc()'ed > buffers, running randomly 10M realloc()s to increase size over them and > saw <1% where area had to be moved, so we're hitting corner case of a corner > case, I'm also ok to leave the combination, though. :) > > Thanks, > Daniel