Control: tag -1 + moreinfo Control: forwarded -1 https://github.com/linux-rdma/opensm/issues/37
Hello, On Thu, Oct 31, 2024 at 07:53:52PM +0100, Francesco Poli (wintermute) wrote: > Package: src:linux > Version: 6.11.2-1 > Severity: important > X-Debbugs-Cc: invernom...@paranoici.org > > Hello, > I encountered a major issue on an HPC cluster head node, as soon as > I upgraded the Linux kernel from version 6.10.11-1 to version 6.11.2-1 . > > The issue is that the head node runs OpenSM (InfiniBand subnet manager), > which is needed for the Infiniband network to work. > As soon as I reboot the head node with kernel 6.11.2-1 (or 6.11.4-1), > OpenSM fails to start. > If I reboot with the previous kernel version 6.10.11-1, everything > works fine. > > The symptoms are described in bug [#1085300], filed against package > opensm. > > [#1085300]: <https://bugs.debian.org/1085300> > > Now I am not sure what's going on. > > Is there any important change in the Linux kernel that OpenSM needs > to adapt for? > Or is this a bug in the newer Linux kernel version (that needs to > be fixed there)? > > I filed this bug report against the Debian Linux kernel, in order > to warn other users about this issue, and in order to ask the Debian > Kernel Team to investigate the issue and/or to forward the bug report > to the relevant upstream Linux kernel maintainers. > > Please do not reassign to package opensm with the intention of > merging with bug [#1085300], unless you know for sure that the > issue is in opensm and you know how to fix it. Please do not report multiple bugs for the same issue. The right(er) thing to do is to make use of "affects". Now there are three bug reports (2 for Debian and one upstream) and someone being aware of only one (or two) of them, might miss some action which results in duplicate work. > Please help, I would very much like to run the head node with > an up-to-date kernel! This is hard to act on without further input. Some questions to debug this: I guess the kernel provides a directory "/sys/class/infiniband_mad". Do its contents look different on 6.10.x and 6.11.x? Can you please bisect the problem? There are a few kernel versions that were packaged for Debian (i.e. 6.11-1~exp1, 6.11~rc5-1~exp1, 6.11~rc4-1~exp1, 6.10.12-1). I would expect that 6.11~rc4-1~exp1 is the oldest failing one. It would be great if you could bisect this further. Something like the following on the working kernel: git clone https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git cd linux git checkout v6.10 cp /boot/config-6.10.11-1-$(uname -r) .config make localmodconfig cp .config arch/x86/configs/my_defconfig make bindeb-pkg This creates a debian kernel package that you can test. I would hope this one to be "good". (The following steps don't need to be done on the working kernel, this is only critical for the localmodconfig step above.) Then test 6.11: git checkout v6.11 make my_defconfig bindeb-pkg I would expect this one to produce a broken kernel package. If you can confirm that (i.e. vanilla 6.10 works and 6.11 doesn't), do the actual bisection: git bisect start v6.11 v6.10 and in each to-be-tested version do: make my_defconfig bindeb-pkg and test the resulting kernel package. Depending on if that is good or bad do: git bisect good or git bisect bad Note you don't need to test the versions that are suggested there. To speed up, it might be beneficial to test v6.11-rc1~117 and v6.11-rc1~116 first. To do so just don't test the version that git-bisect proposes but do: git checkout v6.11-rc1~117 make my_defconfig bindeb-pkg ... test ... git bisect ... and then the same for v6.11-rc1~116. Then report back the found first bad commit. If you have difficulties following this instruction, feel free to contact me, e.g. in the #debian-kernel irc channel. Best regards Uwe
signature.asc
Description: PGP signature