Control: tags -1 + unreproducible moreinfo On Wed, Jan 22, 2025 at 12:29:12AM +0100, Francesco Poli (wintermute) wrote: > Package: nfs-kernel-server > Version: 1:2.8.2-1+b1 > Severity: grave > Justification: causes non-serious data loss > X-Debbugs-Cc: invernom...@paranoici.org > > > Dear maintainers, > I encountered a big issue, while upgrading package 'nfs-kernel-server' > on the box where the NFS server runs (the clients run on the compute > nodes of an HPC cluster). > > The upgrade: > > [UPGRADE] nfs-kernel-server:amd64 1:2.8.2-1 -> 1:2.8.2-1+b1 > > got stuck at > > [...] > Setting up nfs-kernel-server (1:2.8.2-1+b1) ... > > > > It looks like it was stuck at the restart of the systemd service: > > # systemctl status nfs-kernel-server.service > ● nfs-server.service - NFS server and services > Loaded: loaded (/usr/lib/systemd/system/nfs-server.service; enabled; > prese> > Drop-In: /run/systemd/generator/nfs-server.service.d > └─order-with-mounts.conf > Active: activating (start-pre) since Tue 2025-01-21 12:40:52 CET; 10min > ago > Job: 97667 > Invocation: ced460d410fe4059b9e8781b35340d70 > Docs: man:rpc.nfsd(8) > man:exportfs(8) > Cntrl PID: 249039 (exportfs) > Tasks: 3 (limit: 154102) > Memory: 680K (peak: 2.5M) > CPU: 10ms > CGroup: /system.slice/nfs-server.service > ├─239857 /usr/sbin/nfsdctl threads 0 > ├─239918 /usr/sbin/exportfs -au > └─249039 /usr/sbin/exportfs -r > > There was a 'nfsdctl' process in uninterruptible sleep (D): > > $ ps -eldaf | grep nf[s] > 4 D root 239857 1 0 80 0 - 847 - 12:07 ? > 00:00:00 /usr/sbin/nfsdctl threads 0 > 5 S root 247511 1 0 80 0 - 1375 - 12:35 ? > 00:00:00 /usr/sbin/nfsdcld > > After about 30 min, since trying to kill PID 239857 obviously had no effect, > and I could not find any other strategy to restart nfs-kernel-server.service, > I had to reboot the box, thus causing many problems to all the NFS clients. > > After reboot, I could issue: > > # aptitude --purge-unused safe-upgrade > > which finally completed the upgrade (fixing the nfs-kernel-server package, > which was left in a partially configured state). > > > I have never seen anything like this before, and I have upgraded > nfs-kernel-server and related packages on Debian machines for quite > a long time. > Anyway, this should *not* happen during a system upgrade with > aptitude or apt! > > I don't know whether bug [#992661] is related or not. > > [#992661]: <https://bugs.debian.org/992661> > > By looking at /var/log/kern.log , I see that a kernel BUG was traced > at the time when the 'nfsdctl' process got stuck in D state. > See the attached kern.log snippet. > > Please investigate and fix the issue as soon as possible. > I really hope we can prevent this from happening again! > > Thanks for your time and dedication.
So I'm not able to reproduce this on a current Debian unstable system mimicking the upgrade. *But* it is possible we have some races somehwere as recently discussed at our regular kernel team meeting. We need first to find a way to trigger the issue in any case. Regards, Salvatore