On 12/12/18 12:08 PM, Igor Russkikh wrote: > >> >> The idea of having the PHY/network device as a cooling agent is >> something valuable, but as Andrew pointed out, you need to expose this >> as a standard HWMON device, and you need to let user-space implement the >> appropriate thermal policy, not do that in the network driver underneath >> the user's feet with no feedback other than link dropped, got >> re-negotiated at a different speed. How would one be able to >> differentiate those events from a faulty link partner for instance? > >> >> None of what you are doing here is specific to your device driver and >> the policy of downgrading the link speed to lower the thermal budget is >> something that is nearly universally applicable to all network >> equipments because higher speeds just require higher power. >> > > Hi Florian, > Partially agreed with you, but as far as I know there is no much of > ready to use infrastructure for this to use right now?
If you use programs like thermald, I am quite positive you could script and action which involves re-negotiation of the link at a lower speed and that would be something applicable to a variety of network devices. > > IMHO that could be a both-way solution, where short term driver patch > will secure against hardware burn out right now, and long term hwmon > based infrastructure could handle that on userspace level. The short term and most effective solution would be to have the firmware running on the device do the thermal throttling, that way, if the host CPU is crashed/unresponsive, you can still take corrective actions. Your response to Andrew seems to suggest this is not possible, so if we are reaching the critical junction temperature of your chip and that in turn, causes the enclosure to melt down, then clearly the runaway solution is not good. > > A whole separate concern is how much userspace should be involved here. > It could be a very device specific (and therefore driver specific) logic > on how to do device's thermal control. My problem with your approach is people doing the same thing to each and every one of their driver and building policy, as opposed to mechanisms in the kernel. If the argument is "user space may not be running a thermal solution", then clearly you need a hardware driven (or firmware driven) approach) which works across all possible use cases, including those where appropriate SW is not there. If you look at how your desktop PC likely manages the fans in the chassis, they can be SW controlled, or ACPI controlled, for the same reasons. -- Florian