On Tue, Jul 09, 2019 at 03:34:30PM -0700, Jakub Kicinski wrote: > On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote: > > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote: > > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote: > > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote: > > > > > From: Ido Schimmel <ido...@idosch.org> > > > > > Date: Sun, 7 Jul 2019 10:58:17 +0300 > > > > > > > > > > > Users have several ways to debug the kernel and understand why a > > > > > > packet > > > > > > was dropped. For example, using "drop monitor" and "perf". Both > > > > > > utilities trace kfree_skb(), which is the function called when a > > > > > > packet > > > > > > is freed as part of a failure. The information provided by these > > > > > > tools > > > > > > is invaluable when trying to understand the cause of a packet loss. > > > > > > > > > > > > In recent years, large portions of the kernel data path were > > > > > > offloaded > > > > > > to capable devices. Today, it is possible to perform L2 and L3 > > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN). > > > > > > Different TC classifiers and actions are also offloaded to capable > > > > > > devices, at both ingress and egress. > > > > > > > > > > > > However, when the data path is offloaded it is not possible to > > > > > > achieve > > > > > > the same level of introspection as tools such "perf" and "drop > > > > > > monitor" > > > > > > become irrelevant. > > > > > > > > > > > > This patchset aims to solve this by allowing users to monitor > > > > > > packets > > > > > > that the underlying device decided to drop along with relevant > > > > > > metadata > > > > > > such as the drop reason and ingress port. > > > > > > > > > > We are now going to have 5 or so ways to capture packets passing > > > > > through > > > > > the system, this is nonsense. > > > > > > > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this > > > > > devlink thing. > > > > > > > > > > This is insanity, too many ways to do the same thing and therefore the > > > > > worst possible user experience. > > > > > > > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events, > > > > > XDP perf events, and these taps there too. > > > > > > > > > > I mean really, think about it from the average user's perspective. To > > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not > > > > > just > > > > > listen on devlink but configure a special tap thing beforehand and > > > > > then > > > > > if someone is using XDP I gotta setup another perf event buffer > > > > > capture > > > > > thing too. > > > > > > > > Let me try to explain again because I probably wasn't clear enough. The > > > > devlink-trap mechanism is not doing the same thing as other solutions. > > > > > > > > The packets we are capturing in this patchset are packets that the > > > > kernel (the CPU) never saw up until now - they were silently dropped by > > > > the underlying device performing the packet forwarding instead of the > > > > CPU. > > > > Jakub, > > > > It seems to me that most of the criticism is about consolidation of > > interfaces because you believe I'm doing something you can already do > > today, but this is not the case. > > To be clear I'm not opposed to the patches, I'm just trying to > facilitate a discussion.
Sure, sorry if it came out the wrong way. I appreciate your feedback and the time you have spent on this subject. > > > Switch ASICs have dedicated traps for specific packets. Usually, these > > packets are control packets (e.g., ARP, BGP) which are required for the > > correct functioning of the control plane. You can see this in the SAI > > interface, which is an abstraction layer over vendors' SDKs: > > > > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157 > > > > We need to be able to configure the hardware policers of these traps and > > read their statistics to understand how many packets they dropped. We > > currently do not have a way to do any of that and we rely on hardcoded > > defaults in the driver which do not fit every use case (from > > experience): > > > > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103 > > > > We plan to extend devlink-trap mechanism to cover all these use cases. I > > hope you agree that this functionality belongs in devlink given it is a > > device-specific configuration and not a netdev-specific one. > > No disagreement on providing knobs for traps. > > > That being said, in its current form, this mechanism is focused on traps > > that correlate to packets the device decided to drop as this is very > > useful for debugging. > > That'd be mixing two things - trap configuration and tracing exceptions > in one API. That's a little suboptimal but not too terrible, especially > if there is a higher level APIs users can default to. TBH, initially I was only focused on the drops, but then it occurred to me that this is a too narrow scope. These traps are only a subset of the complete list of traps we have and we have similar requirements for both (statistics, setting policers etc.). Therefore, I decided to design this interface in a more generic way, so that it could support the different use cases. > > > Given that the entire configuration is done via devlink and that devlink > > stores all the information about these traps, it seems logical to also > > report these packets and their metadata to user space as devlink events. > > > > If this is not desirable, we can try to call into drop_monitor from > > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will > > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT. > > > > IMO, this is less desirable, as instead of having one tool (devlink) to > > interact with this mechanism we will need two (devlink & dropwatch). > > > > Below I tried to answer all your questions and refer to all the points > > you brought up. > > > > > When you say silently dropped do you mean that mlxsw as of today > > > doesn't have any counters exposed for those events? > > > > Some of these packets are counted, but not all of them. > > > > > If we wanted to consolidate this into something existing we can either > > > (a) add similar traps in the kernel data path; > > > (b) make these traps extension of statistics. > > > > > > My knee jerk reaction to seeing the patches was that it adds a new > > > place where device statistics are reported. > > > > Not at all. This would be a step back. We can already count discards due > > to VLAN membership on ingress on a per-port basis. A software maintained > > global counter does not buy us anything. > > > > By also getting the dropped packet - coupled with the drop reason and > > ingress port - you can understand exactly why and on which VLAN the > > packet was dropped. I wrote a Wireshark dissector for these netlink > > packets to make our life easier. You can see the details in my comment > > to the cover letter: > > > > https://marc.info/?l=linux-netdev&m=156248736710238&w=2 > > > > In case you do not care about individual packets, but still want more > > fine-grained statistics for your monitoring application, you can use > > eBPF. For example, one thing we did is attaching a kprobe to > > devlink_trap_report() with an eBPF program that dissects the incoming > > skbs and maintains a counter per-{5 tuple, drop reason}. With > > ebpf_exporter you can export these statistics to Prometheus on which you > > can run queries and visualize the results with Grafana. This is > > especially useful for tail and early drops since it allows you to > > understand which flows contribute to most of the drops. > > No question that the mechanism is useful. > > > > Users who want to know why things are dropped will not get detailed > > > breakdown from ethtool -S which for better or worse is the one stop > > > shop for device stats today. > > > > I hope I managed to explain why counters are not enough, but I also want > > to point out that ethtool statistics are not properly documented and > > this hinders their effectiveness. I did my best to document the exposed > > traps in order to avoid the same fate: > > > > https://patchwork.ozlabs.org/patch/1128585/ > > > > In addition, there are selftests to show how each trap can be triggered > > to reduce the ambiguity even further: > > > > https://patchwork.ozlabs.org/patch/1128610/ > > > > And a note in the documentation to make sure future functionality is > > tested as well: > > > > https://patchwork.ozlabs.org/patch/1128608/ > > > > > Having thought about it some more, however, I think that having a > > > forwarding "exception" object and hanging statistics off of it is a > > > better design, even if we need to deal with some duplication to get > > > there. > > > > > > IOW having an way to "trap all packets which would increment a > > > statistic" (option (b) above) is probably a bad design. > > > > > > As for (a) I wonder how many of those events have a corresponding event > > > in the kernel stack? > > > > Generic packet drops all have a corresponding kfree_skb() calls in the > > kernel, but that does not mean that every packet dropped by the hardware > > would also be dropped by the kernel if it were to be injected to its Rx > > path. > > The notion that all SW events get captured by kfree_skb() would not be > correct. I meant that the generic drop reasons I'm exposing with this patchset all correspond to reasons for which the kernel would drop packets. > We have the kfree_skb(), and xdp_exception(), and drivers can > drop packets if various allocations fail.. the situation is already not > great. > > I think that having a single useful place where users can look to see > all traffic exception events would go a long way. I believe this was Dave's point as well. We have one tool to monitor kfree_skb() drops and with this patchset we will have another to monitor HW drops. As I mentioned in my previous reply, I will look into sending the events via drop_monitor by calling into it from devlink. I'm not involved with XDP (as you might have noticed), but I assume drop_monitor could be extended for this use case as well by doing register_trace_xdp_exception(). Then you could monitor SW, HW and XDP events using a single netlink channel, potentially split into different multicast groups to allow user space programs to receive only the events they care about. > Software side as I mentioned is pretty brutal, IDK how many users are > actually willing to decode stack traces to figure out why their system > is dropping packets :/ > > > In my reply to Dave I gave buffer drops as an example. > > The example of buffer drops is also probably the case where having the > packet is least useful, but yes, I definitely agree devices need a way > of reporting events that can't happen in SW. > > > There are also situations in which packets can be dropped due to > > device-specific exceptions and these do not have a corresponding drop > > reason in the kernel. See example here: > > > > https://patchwork.ozlabs.org/patch/1128587/ > > > > > If we could add corresponding trace points and just feed those from > > > the device driver, that'd obviously be a holy grail. > > > > Unlike tracepoints, netlink gives you a structured and extensible > > interface. For example, in Spectrum-1 we cannot provide the Tx port for > > early/tail drops, whereas for Spectrum-2 and later we can. With netlink, > > we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for > > Spectrum-1. You also get a programmatic interface that you can query for > > this information: > > > > # devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter > > netdevsim/netdevsim10: > > name ingress_vlan_filter type drop generic true report false action drop > > group l2_drops > > metadata: > > input_port > > Right, you can set or not set skb fields to some extent but its > definitely not as flexible as netlink.