On Fri, 13 Nov 2020 14:51:36 -0800 Jacob Keller wrote: > On 11/13/2020 2:32 PM, Jacob Keller wrote: > > > > > > On 11/13/2020 1:34 PM, Jacob Keller wrote: > >> Well, at least with ice, the experience is pretty bad. I tried out with > >> a garbage file name on one of my test systems. This was on a slightly > >> older kernel without this patch applied, and the device had a pending > >> update that had not yet been finalized with a reset: > >> > >> $ devlink dev flash pci/0000:af:00.0 file garbage_file_does_not_exist > >> Canceling previous pending update > >> > >> > >> The update looks like it got stuck, but actually it failed. Somehow the > >> extack error over the socket didn't get handled by the application very > >> well. Something buggy in the forked process probably. > >> > >> I do get this in the dmesg though: > >> > >> Nov 13 13:12:57 jekeller-stp-glorfindel kernel: ice 0000:af:00.0: Direct > >> firmware load for garbage_file_does_not_exist failed with error -2 > >> > > > > I think I figured out what is going on here, but I'm not sure what the > > best solution is. > > > > in userspace devlink.c:3410, the condition for exiting the while loop > > that monitors the flash update process is: > > > > (!ctx.flash_done || (ctx.not_first && !ctx.received_end)) > > > > FWIW changing this to > > (!ctx.flash_done && !ctx.received_end) > > works for this problem, but I suspect that the original condition > intended to try and catch the case where flash update has exited but we > haven't yet processed all of the status messages?
Yeah... I've only looked at this for 5 minutes, but it seems that ice should not send notifications outside of begin / end (in fact it could be nice to add an appropriate WARN_ON() in notify())... > I mean in some sense we could just wait for !ctx.flash_done only. Then > we'd always loop until the initial request exits. > > There's a slight issue with the netlink extack message not being > displayed on its own line, but I think that just needs us to add a > pr_out("\n") somewhere to fix it. > > > > This condition means keep looping until flash is done *OR* we've > > received a message but have not yet received the end. > > > > In the ice driver implementation, we perform a check for a pending flash > > update first, which could trigger a cancellation that causes us to send > > back a "cancelling previous pending flash update" status message, which > > was sent *before* the devlink_flash_update_begin_notify(). Then, after > > this we request the firmware object, which fails, and results in an > > error code being reported back.. > > > > However, we will never send either the begin or end notification at this > > point. Thus, the devlink userspace application will never quit, and > > won't display the extack message. > > > > This occurs because we sent a status notify message before we actually > > sent a "begin notify". I think the ordering was done because of trying > > to avoid having a complicated cleanup logic. > > > > In some sense, this is a bug in the ice driver.. but in another sense > > this is the devlink application being too strict about the requirements > > on ordering of these messages.. > > > > I guess one method if we really want to remain strict is moving the > > "begin" and "end" notifications outside of the driver into core so that > > it always sends a begin before calling the .flash_update handler, and > > always sends an end before exiting. > > > > I guess we could simply relax the restriction on "not first" so that > > we'll always exit in the case where the forked process has quit on us, > > even if we haven't received a proper flash end notification... > > > > Thoughts? > >