Ido Schimmel <ido...@idosch.org> writes: > On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote: >> On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote: >> > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote: >> > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote: >> > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote: >> > > > > From: Ido Schimmel <ido...@idosch.org> >> > > > > Date: Sun, 7 Jul 2019 10:58:17 +0300 >> > > > > >> > > > > > Users have several ways to debug the kernel and understand why a >> > > > > > packet >> > > > > > was dropped. For example, using "drop monitor" and "perf". Both >> > > > > > utilities trace kfree_skb(), which is the function called when a >> > > > > > packet >> > > > > > is freed as part of a failure. The information provided by these >> > > > > > tools >> > > > > > is invaluable when trying to understand the cause of a packet loss. >> > > > > > >> > > > > > In recent years, large portions of the kernel data path were >> > > > > > offloaded >> > > > > > to capable devices. Today, it is possible to perform L2 and L3 >> > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN). >> > > > > > Different TC classifiers and actions are also offloaded to capable >> > > > > > devices, at both ingress and egress. >> > > > > > >> > > > > > However, when the data path is offloaded it is not possible to >> > > > > > achieve >> > > > > > the same level of introspection as tools such "perf" and "drop >> > > > > > monitor" >> > > > > > become irrelevant. >> > > > > > >> > > > > > This patchset aims to solve this by allowing users to monitor >> > > > > > packets >> > > > > > that the underlying device decided to drop along with relevant >> > > > > > metadata >> > > > > > such as the drop reason and ingress port. >> > > > > >> > > > > We are now going to have 5 or so ways to capture packets passing >> > > > > through >> > > > > the system, this is nonsense. >> > > > > >> > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now >> > > > > this >> > > > > devlink thing. >> > > > > >> > > > > This is insanity, too many ways to do the same thing and therefore >> > > > > the >> > > > > worst possible user experience. >> > > > > >> > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb >> > > > > events, >> > > > > XDP perf events, and these taps there too. >> > > > > >> > > > > I mean really, think about it from the average user's perspective. >> > > > > To >> > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not >> > > > > just >> > > > > listen on devlink but configure a special tap thing beforehand and >> > > > > then >> > > > > if someone is using XDP I gotta setup another perf event buffer >> > > > > capture >> > > > > thing too. >> > > > >> > > > Let me try to explain again because I probably wasn't clear enough. The >> > > > devlink-trap mechanism is not doing the same thing as other solutions. >> > > > >> > > > The packets we are capturing in this patchset are packets that the >> > > > kernel (the CPU) never saw up until now - they were silently dropped by >> > > > the underlying device performing the packet forwarding instead of the >> > > > CPU. >> > >> > Jakub, >> > >> > It seems to me that most of the criticism is about consolidation of >> > interfaces because you believe I'm doing something you can already do >> > today, but this is not the case. >> >> To be clear I'm not opposed to the patches, I'm just trying to >> facilitate a discussion. > > Sure, sorry if it came out the wrong way. I appreciate your feedback and > the time you have spent on this subject. > >> >> > Switch ASICs have dedicated traps for specific packets. Usually, these >> > packets are control packets (e.g., ARP, BGP) which are required for the >> > correct functioning of the control plane. You can see this in the SAI >> > interface, which is an abstraction layer over vendors' SDKs: >> > >> > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157 >> > >> > We need to be able to configure the hardware policers of these traps and >> > read their statistics to understand how many packets they dropped. We >> > currently do not have a way to do any of that and we rely on hardcoded >> > defaults in the driver which do not fit every use case (from >> > experience): >> > >> > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103 >> > >> > We plan to extend devlink-trap mechanism to cover all these use cases. I >> > hope you agree that this functionality belongs in devlink given it is a >> > device-specific configuration and not a netdev-specific one. >> >> No disagreement on providing knobs for traps. >> >> > That being said, in its current form, this mechanism is focused on traps >> > that correlate to packets the device decided to drop as this is very >> > useful for debugging. >> >> That'd be mixing two things - trap configuration and tracing exceptions >> in one API. That's a little suboptimal but not too terrible, especially >> if there is a higher level APIs users can default to. > > TBH, initially I was only focused on the drops, but then it occurred to > me that this is a too narrow scope. These traps are only a subset of the > complete list of traps we have and we have similar requirements for both > (statistics, setting policers etc.). Therefore, I decided to design this > interface in a more generic way, so that it could support the different > use cases. > >> >> > Given that the entire configuration is done via devlink and that devlink >> > stores all the information about these traps, it seems logical to also >> > report these packets and their metadata to user space as devlink events. >> > >> > If this is not desirable, we can try to call into drop_monitor from >> > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will >> > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT. >> > >> > IMO, this is less desirable, as instead of having one tool (devlink) to >> > interact with this mechanism we will need two (devlink & dropwatch). >> > >> > Below I tried to answer all your questions and refer to all the points >> > you brought up. >> > >> > > When you say silently dropped do you mean that mlxsw as of today >> > > doesn't have any counters exposed for those events? >> > >> > Some of these packets are counted, but not all of them. >> > >> > > If we wanted to consolidate this into something existing we can either >> > > (a) add similar traps in the kernel data path; >> > > (b) make these traps extension of statistics. >> > > >> > > My knee jerk reaction to seeing the patches was that it adds a new >> > > place where device statistics are reported. >> > >> > Not at all. This would be a step back. We can already count discards due >> > to VLAN membership on ingress on a per-port basis. A software maintained >> > global counter does not buy us anything. >> > >> > By also getting the dropped packet - coupled with the drop reason and >> > ingress port - you can understand exactly why and on which VLAN the >> > packet was dropped. I wrote a Wireshark dissector for these netlink >> > packets to make our life easier. You can see the details in my comment >> > to the cover letter: >> > >> > https://marc.info/?l=linux-netdev&m=156248736710238&w=2 >> > >> > In case you do not care about individual packets, but still want more >> > fine-grained statistics for your monitoring application, you can use >> > eBPF. For example, one thing we did is attaching a kprobe to >> > devlink_trap_report() with an eBPF program that dissects the incoming >> > skbs and maintains a counter per-{5 tuple, drop reason}. With >> > ebpf_exporter you can export these statistics to Prometheus on which you >> > can run queries and visualize the results with Grafana. This is >> > especially useful for tail and early drops since it allows you to >> > understand which flows contribute to most of the drops. >> >> No question that the mechanism is useful. >> >> > > Users who want to know why things are dropped will not get detailed >> > > breakdown from ethtool -S which for better or worse is the one stop >> > > shop for device stats today. >> > >> > I hope I managed to explain why counters are not enough, but I also want >> > to point out that ethtool statistics are not properly documented and >> > this hinders their effectiveness. I did my best to document the exposed >> > traps in order to avoid the same fate: >> > >> > https://patchwork.ozlabs.org/patch/1128585/ >> > >> > In addition, there are selftests to show how each trap can be triggered >> > to reduce the ambiguity even further: >> > >> > https://patchwork.ozlabs.org/patch/1128610/ >> > >> > And a note in the documentation to make sure future functionality is >> > tested as well: >> > >> > https://patchwork.ozlabs.org/patch/1128608/ >> > >> > > Having thought about it some more, however, I think that having a >> > > forwarding "exception" object and hanging statistics off of it is a >> > > better design, even if we need to deal with some duplication to get >> > > there. >> > > >> > > IOW having an way to "trap all packets which would increment a >> > > statistic" (option (b) above) is probably a bad design. >> > > >> > > As for (a) I wonder how many of those events have a corresponding event >> > > in the kernel stack? >> > >> > Generic packet drops all have a corresponding kfree_skb() calls in the >> > kernel, but that does not mean that every packet dropped by the hardware >> > would also be dropped by the kernel if it were to be injected to its Rx >> > path. >> >> The notion that all SW events get captured by kfree_skb() would not be >> correct. > > I meant that the generic drop reasons I'm exposing with this patchset > all correspond to reasons for which the kernel would drop packets. > >> We have the kfree_skb(), and xdp_exception(), and drivers can >> drop packets if various allocations fail.. the situation is already not >> great. >> >> I think that having a single useful place where users can look to see >> all traffic exception events would go a long way. > > I believe this was Dave's point as well. We have one tool to monitor > kfree_skb() drops and with this patchset we will have another to monitor > HW drops. As I mentioned in my previous reply, I will look into sending > the events via drop_monitor by calling into it from devlink. > > I'm not involved with XDP (as you might have noticed), but I assume > drop_monitor could be extended for this use case as well by doing > register_trace_xdp_exception(). Then you could monitor SW, HW and XDP > events using a single netlink channel, potentially split into different > multicast groups to allow user space programs to receive only the events > they care about.
Yes, having XDP hook into this would be good! This ties in nicely with the need for better statistics for XDP (see https://github.com/xdp-project/xdp-project/blob/master/xdp-project.org#consistency-for-statistics-with-xdp). Unfortunately, it's not enough to just hook into the xdp_exception trace point, as that is generally only triggered on XDP_ABORTED, not if the XDP program just decides to drop the packet with XDP_DROP. So we may need another mechanism to hook this if it's going to comprehensive; and we'd probably want a way to do this that doesn't impose a performance penalty when the drop monitor is not enabled... -Toke