On Tue, 9 Jul 2019 15:38:44 +0300, Ido Schimmel wrote: > On Mon, Jul 08, 2019 at 03:51:58PM -0700, Jakub Kicinski wrote: > > On Mon, 8 Jul 2019 16:19:08 +0300, Ido Schimmel wrote: > > > On Sun, Jul 07, 2019 at 12:45:41PM -0700, David Miller wrote: > > > > From: Ido Schimmel <ido...@idosch.org> > > > > Date: Sun, 7 Jul 2019 10:58:17 +0300 > > > > > > > > > Users have several ways to debug the kernel and understand why a > > > > > packet > > > > > was dropped. For example, using "drop monitor" and "perf". Both > > > > > utilities trace kfree_skb(), which is the function called when a > > > > > packet > > > > > is freed as part of a failure. The information provided by these tools > > > > > is invaluable when trying to understand the cause of a packet loss. > > > > > > > > > > In recent years, large portions of the kernel data path were offloaded > > > > > to capable devices. Today, it is possible to perform L2 and L3 > > > > > forwarding in hardware, as well as tunneling (IP-in-IP and VXLAN). > > > > > Different TC classifiers and actions are also offloaded to capable > > > > > devices, at both ingress and egress. > > > > > > > > > > However, when the data path is offloaded it is not possible to achieve > > > > > the same level of introspection as tools such "perf" and "drop > > > > > monitor" > > > > > become irrelevant. > > > > > > > > > > This patchset aims to solve this by allowing users to monitor packets > > > > > that the underlying device decided to drop along with relevant > > > > > metadata > > > > > such as the drop reason and ingress port. > > > > > > > > We are now going to have 5 or so ways to capture packets passing through > > > > the system, this is nonsense. > > > > > > > > AF_PACKET, kfree_skb drop monitor, perf, XDP perf events, and now this > > > > devlink thing. > > > > > > > > This is insanity, too many ways to do the same thing and therefore the > > > > worst possible user experience. > > > > > > > > Pick _ONE_ method to trap packets and forward normal kfree_skb events, > > > > XDP perf events, and these taps there too. > > > > > > > > I mean really, think about it from the average user's perspective. To > > > > see all drops/pkts I have to attach a kfree_skb tracepoint, and not just > > > > listen on devlink but configure a special tap thing beforehand and then > > > > if someone is using XDP I gotta setup another perf event buffer capture > > > > thing too. > > > > > > Let me try to explain again because I probably wasn't clear enough. The > > > devlink-trap mechanism is not doing the same thing as other solutions. > > > > > > The packets we are capturing in this patchset are packets that the > > > kernel (the CPU) never saw up until now - they were silently dropped by > > > the underlying device performing the packet forwarding instead of the > > > CPU. > > Jakub, > > It seems to me that most of the criticism is about consolidation of > interfaces because you believe I'm doing something you can already do > today, but this is not the case.
To be clear I'm not opposed to the patches, I'm just trying to facilitate a discussion. > Switch ASICs have dedicated traps for specific packets. Usually, these > packets are control packets (e.g., ARP, BGP) which are required for the > correct functioning of the control plane. You can see this in the SAI > interface, which is an abstraction layer over vendors' SDKs: > > https://github.com/opencomputeproject/SAI/blob/master/inc/saihostif.h#L157 > > We need to be able to configure the hardware policers of these traps and > read their statistics to understand how many packets they dropped. We > currently do not have a way to do any of that and we rely on hardcoded > defaults in the driver which do not fit every use case (from > experience): > > https://elixir.bootlin.com/linux/v5.2/source/drivers/net/ethernet/mellanox/mlxsw/spectrum.c#L4103 > > We plan to extend devlink-trap mechanism to cover all these use cases. I > hope you agree that this functionality belongs in devlink given it is a > device-specific configuration and not a netdev-specific one. No disagreement on providing knobs for traps. > That being said, in its current form, this mechanism is focused on traps > that correlate to packets the device decided to drop as this is very > useful for debugging. That'd be mixing two things - trap configuration and tracing exceptions in one API. That's a little suboptimal but not too terrible, especially if there is a higher level APIs users can default to. > Given that the entire configuration is done via devlink and that devlink > stores all the information about these traps, it seems logical to also > report these packets and their metadata to user space as devlink events. > > If this is not desirable, we can try to call into drop_monitor from > devlink and add a new command (e.g., NET_DM_CMD_HW_ALERT), which will > encode all the information we currently have in DEVLINK_CMD_TRAP_REPORT. > > IMO, this is less desirable, as instead of having one tool (devlink) to > interact with this mechanism we will need two (devlink & dropwatch). > > Below I tried to answer all your questions and refer to all the points > you brought up. > > > When you say silently dropped do you mean that mlxsw as of today > > doesn't have any counters exposed for those events? > > Some of these packets are counted, but not all of them. > > > If we wanted to consolidate this into something existing we can either > > (a) add similar traps in the kernel data path; > > (b) make these traps extension of statistics. > > > > My knee jerk reaction to seeing the patches was that it adds a new > > place where device statistics are reported. > > Not at all. This would be a step back. We can already count discards due > to VLAN membership on ingress on a per-port basis. A software maintained > global counter does not buy us anything. > > By also getting the dropped packet - coupled with the drop reason and > ingress port - you can understand exactly why and on which VLAN the > packet was dropped. I wrote a Wireshark dissector for these netlink > packets to make our life easier. You can see the details in my comment > to the cover letter: > > https://marc.info/?l=linux-netdev&m=156248736710238&w=2 > > In case you do not care about individual packets, but still want more > fine-grained statistics for your monitoring application, you can use > eBPF. For example, one thing we did is attaching a kprobe to > devlink_trap_report() with an eBPF program that dissects the incoming > skbs and maintains a counter per-{5 tuple, drop reason}. With > ebpf_exporter you can export these statistics to Prometheus on which you > can run queries and visualize the results with Grafana. This is > especially useful for tail and early drops since it allows you to > understand which flows contribute to most of the drops. No question that the mechanism is useful. > > Users who want to know why things are dropped will not get detailed > > breakdown from ethtool -S which for better or worse is the one stop > > shop for device stats today. > > I hope I managed to explain why counters are not enough, but I also want > to point out that ethtool statistics are not properly documented and > this hinders their effectiveness. I did my best to document the exposed > traps in order to avoid the same fate: > > https://patchwork.ozlabs.org/patch/1128585/ > > In addition, there are selftests to show how each trap can be triggered > to reduce the ambiguity even further: > > https://patchwork.ozlabs.org/patch/1128610/ > > And a note in the documentation to make sure future functionality is > tested as well: > > https://patchwork.ozlabs.org/patch/1128608/ > > > Having thought about it some more, however, I think that having a > > forwarding "exception" object and hanging statistics off of it is a > > better design, even if we need to deal with some duplication to get > > there. > > > > IOW having an way to "trap all packets which would increment a > > statistic" (option (b) above) is probably a bad design. > > > > As for (a) I wonder how many of those events have a corresponding event > > in the kernel stack? > > Generic packet drops all have a corresponding kfree_skb() calls in the > kernel, but that does not mean that every packet dropped by the hardware > would also be dropped by the kernel if it were to be injected to its Rx > path. The notion that all SW events get captured by kfree_skb() would not be correct. We have the kfree_skb(), and xdp_exception(), and drivers can drop packets if various allocations fail.. the situation is already not great. I think that having a single useful place where users can look to see all traffic exception events would go a long way. Software side as I mentioned is pretty brutal, IDK how many users are actually willing to decode stack traces to figure out why their system is dropping packets :/ > In my reply to Dave I gave buffer drops as an example. The example of buffer drops is also probably the case where having the packet is least useful, but yes, I definitely agree devices need a way of reporting events that can't happen in SW. > There are also situations in which packets can be dropped due to > device-specific exceptions and these do not have a corresponding drop > reason in the kernel. See example here: > > https://patchwork.ozlabs.org/patch/1128587/ > > > If we could add corresponding trace points and just feed those from > > the device driver, that'd obviously be a holy grail. > > Unlike tracepoints, netlink gives you a structured and extensible > interface. For example, in Spectrum-1 we cannot provide the Tx port for > early/tail drops, whereas for Spectrum-2 and later we can. With netlink, > we can just omit the DEVLINK_ATTR_TRAP_OUT_PORT attribute for > Spectrum-1. You also get a programmatic interface that you can query for > this information: > > # devlink -v trap show netdevsim/netdevsim10 trap ingress_vlan_filter > netdevsim/netdevsim10: > name ingress_vlan_filter type drop generic true report false action drop > group l2_drops > metadata: > input_port Right, you can set or not set skb fields to some extent but its definitely not as flexible as netlink.