On Tue, Dec 08, 2020 at 04:17:37PM -0800, Jakub Kicinski wrote: > On Wed, 9 Dec 2020 00:03:56 +0000 Vladimir Oltean wrote: > > On Tue, Dec 08, 2020 at 03:57:44PM -0800, Jakub Kicinski wrote: > > > On Mon, 7 Dec 2020 01:00:40 +0000 Vladimir Oltean wrote: > > > > - ensuring through convention that user space always takes > > > > net->netdev_lists_lock when calling dev_get_stats, and documenting > > > > that, and therefore making it unnecessary to lock in bonding. > > > > > > This seems like the better option to me. Makes the locking rules pretty > > > clear. > > > > It is non-obvious to me that top-level callers of dev_get_stats should > > hold a lock as specific as the one protecting the lists of network > > interfaces. In the vast majority of implementations of dev_get_stats, > > that lock would not actually protect anything, which would lead into > > just one more lock that is used for more than it should be. In my tree I > > had actually already switched over to mutex_lock_nested. Nonetheless, I > > am still open if you want to make the case that simplicity should prevail > > over specificity. > > What are the locking rules you have in mind then? Caller may hold RTNL > or ifc mutex?
Well, it's clear that calling mutex_lock_nested would only silence lockdep, there would still be this non-reentrant mutex that will be held from a potentially recursive code path. So it a non-solution, and even worse than using plain mutex_lock, because at least that is detectable when it locks up the system. net_failover and bonding are the only drivers that are creating this recursivity requirement in dev_get_stats. Other one-over-many stackable interfaces, like the bridge, just use dev_get_tstats64. I'm almost thinking that it would be cleaner to convert these two to dev_get_tstats64 too, that would simplify things enormously. Even team uses something that is based on software counters, something reminiscent of dev_get_tstats64, definitely not counters retrieved from the underlying device. Of course, the disadvantage with doing that is going to be that virtual interfaces cannot retrieve statistics recursively from their lower interface. I'm trying to think how much of a real disadvantage that will be. For offloaded interfaces they will be completely off, that's for sure. And this is one of the reasons that mandated the DSA rework to expose MAC-based counters in dev_get_stats in the first place. But thinking in the larger sense. An offloading interface that supports IP forwarding, with 50 VLAN uppers. How could the statistics counters of those VLAN uppers ever be correct. It's not realistic to expect of the underlying hardware to keep separate statistics for each upper, that the upper could then go ahead and just query. Sure, in the cases where a lower can have only one upper at a time (bridge, bonding) what I said is not applicable, but the general goal of having accurate counters for offloading interfaces everywhere seems to not be really achievable. In case this doesn't work out, I guess I'll have to document that dev_get_stats is recursive, and all mutual exclusion based locks need to be taken upfront, which is the only strategy that works with recursion that I know of. > > But in that case, maybe we should just keep on using the RTNL mutex. > > That's a wasted opportunity, RTNL lock is pretty busy. It is certainly easy to use and easy to enforce though. It also seems to protect everything, which is generally the reason why you tend to not think a lot when using it. To be clear, I am not removing the RTNL mutex from any place where it is held today. Letting me do that would be like letting a bull in a china shop. I am only creating a sub-lock of it, which is protecting a subset of what the RTNL mutex is protecting. By sub-lock I mean that code paths that currently hold the RTNL mutex, like list_netdevice(), will also take net->netdev_lists_lock - never the other way around, that would create lock inversion. The plan is to then require new code that iterates through network interface lists to use the new locking scheme and not RTNL mutex, and in parallel to migrate, on a best-effort basis, code that runs under ASSERT_RTNL + net->netdev_lists_lock to only take the net->netdev_lists_lock and be RTNL-free. But is that any better at runtime than just taking the RTNL mutex, will it make it less contended? Nope, due to the indirect serialization effect that is created by the fact that net->netdev_lists_lock is a sub-lock of the RTNL mutex. Say net/core/net-procfs.c holds net->netdev_lists_lock while dumping the statistics from a firmware and sleeping while doing so. All is rosy in its RTNL mutex free code path. Then there comes along another thread which calls for_each_netdev, something else which today requires the RTNL mutex but could be reworked to use the net->netdev_lists_lock. Let's assume we are at a stage where the gazillion places that call for_each_netdev() have been converted to also take the net->netdev_lists_lock. But they couldn't be converted to also _not_ take the RTNL mutex, or at least all of them. This means that there will be code paths that try to take the RTNL mutex and end up waiting for our net/core/net-procfs.c to complete dumping the firmware.