Control: tag -1 + moreinfo
Control: forwarded -1 https://github.com/linux-rdma/opensm/issues/37

Hello,

On Thu, Oct 31, 2024 at 07:53:52PM +0100, Francesco Poli (wintermute) wrote:
> Package: src:linux
> Version: 6.11.2-1
> Severity: important
> X-Debbugs-Cc: invernom...@paranoici.org
> 
> Hello,
> I encountered a major issue on an HPC cluster head node, as soon as
> I upgraded the Linux kernel from version 6.10.11-1 to version 6.11.2-1 .
> 
> The issue is that the head node runs OpenSM (InfiniBand subnet manager),
> which is needed for the Infiniband network to work.
> As soon as I reboot the head node with kernel 6.11.2-1 (or 6.11.4-1),
> OpenSM fails to start.
> If I reboot with the previous kernel version 6.10.11-1, everything
> works fine.
> 
> The symptoms are described in bug [#1085300], filed against package
> opensm.
> 
> [#1085300]: <https://bugs.debian.org/1085300>
> 
> Now I am not sure what's going on.
> 
> Is there any important change in the Linux kernel that OpenSM needs
> to adapt for?
> Or is this a bug in the newer Linux kernel version (that needs to
> be fixed there)?
> 
> I filed this bug report against the Debian Linux kernel, in order
> to warn other users about this issue, and in order to ask the Debian
> Kernel Team to investigate the issue and/or to forward the bug report
> to the relevant upstream Linux kernel maintainers.
> 
> Please do not reassign to package opensm with the intention of
> merging with bug [#1085300], unless you know for sure that the
> issue is in opensm and you know how to fix it.

Please do not report multiple bugs for the same issue. The right(er)
thing to do is to make use of "affects". Now there are three bug reports
(2 for Debian and one upstream) and someone being aware of only one (or
two) of them, might miss some action which results in duplicate work.
 
> Please help, I would very much like to run the head node with
> an up-to-date kernel!

This is hard to act on without further input. Some questions to debug
this:

I guess the kernel provides a directory "/sys/class/infiniband_mad". Do
its contents look different on 6.10.x and 6.11.x?

Can you please bisect the problem? There are a few kernel versions that
were packaged for Debian (i.e. 6.11-1~exp1, 6.11~rc5-1~exp1,
6.11~rc4-1~exp1, 6.10.12-1). I would expect that 6.11~rc4-1~exp1 is the
oldest failing one. It would be great if you could bisect this further.
Something like the following on the working kernel:

        git clone 
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git
        cd linux
        git checkout v6.10
        cp /boot/config-6.10.11-1-$(uname -r) .config
        make localmodconfig
        cp .config arch/x86/configs/my_defconfig
        make bindeb-pkg

This creates a debian kernel package that you can test. I would hope
this one to be "good".

(The following steps don't need to be done on the working kernel, this
is only critical for the localmodconfig step above.)

Then test 6.11:

        git checkout v6.11
        make my_defconfig bindeb-pkg

I would expect this one to produce a broken kernel package.

If you can confirm that (i.e. vanilla 6.10 works and 6.11 doesn't), do
the actual bisection:

        git bisect start v6.11 v6.10

and in each to-be-tested version do:

        make my_defconfig bindeb-pkg

and test the resulting kernel package. Depending on if that is good or
bad do:

        git bisect good

or

        git bisect bad

Note you don't need to test the versions that are suggested there. To
speed up, it might be beneficial to test v6.11-rc1~117 and v6.11-rc1~116
first. To do so just don't test the version that git-bisect proposes but
do:

        git checkout v6.11-rc1~117
        make my_defconfig bindeb-pkg
        ... test ...
        git bisect ...

and then the same for v6.11-rc1~116.

Then report back the found first bad commit. If you have difficulties
following this instruction, feel free to contact me, e.g. in the
#debian-kernel irc channel.

Best regards
Uwe

Attachment: signature.asc
Description: PGP signature

Reply via email to