Your message dated Thu, 5 Dec 2024 10:10:01 +0100
with message-id <20241205101001.ad56e378a54a55b2ddefc...@paranoici.org>
and subject line Re: Bug#1085300: opensm: fails to start after Linux kernel 
upgrade to 6.11.2
has caused the Debian Bug report #1085300,
regarding opensm: fails to start after Linux kernel upgrade to 6.11.2
to be marked as done.

This means that you claim that the problem has been dealt with.
If this is not the case it is now your responsibility to reopen the
Bug report if necessary, and/or fix the problem forthwith.

(NB: If you are a system administrator and have no idea what this
message is talking about, this may indicate a serious mail system
misconfiguration somewhere. Please contact ow...@bugs.debian.org
immediately.)


-- 
1085300: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1085300
Debian Bug Tracking System
Contact ow...@bugs.debian.org with problems
--- Begin Message ---
Package: opensm
Version: 3.3.23-3
Severity: grave
Justification: renders package unusable
X-Debbugs-Cc: invernom...@paranoici.org

Hello, thanks for maintaining this package.

I encountered a major issue, as soon as I upgraded the Linux kernel
of the box where OpenSM runs (in order to manage the InfiniBand
network of an HPC cluster).

Before the upgrade:

  # uname -v
  #1 SMP PREEMPT_DYNAMIC Debian 6.10.11-1 (2024-09-22)

Snippet from /var/log/opensm.0x9c63c00300033240.log

  976187 [2F772740] 0x03 -> OpenSM 3.3.23
  976444 [2F772740] 0x80 -> OpenSM 3.3.23
  983202 [2F772740] 0x02 -> osm_vendor_init: 1000 pending umads specified
  984338 [2F772740] 0x80 -> Entering DISCOVERING state
  984431 [2F772740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port 
GUID 0x9c63c00300033240
  008073 [2F772740] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port 
GUID 0x9c63c00300033240
  008097 [2F772740] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port 
GUID 0x9c63c00300033240
  008118 [2F772740] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port 
GUID 0x9c63c00300033240
  008159 [2F772740] 0x02 -> osm_opensm_bind: Setting IS_SM on port 
0x9c63c00300033240
  009661 [2E2006C0] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid 
base LID 65535 from the network. Corrected to 0
  009728 [238006C0] 0x80 -> SM port is down
  985713 [2BA006C0] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid 
base LID 65535 from the network. Corrected to 0
  985857 [29C006C0] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid 
base LID 65535 from the network. Corrected to 0
  987086 [238006C0] 0x80 -> SM port is up
  992267 [238006C0] 0x80 -> Entering MASTER state
  995175 [238006C0] 0x02 -> osm_ucast_mgr_process: minhop tables configured on 
all switches
  997623 [238006C0] 0x02 -> SUBNET UP
  998308 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New 
mcast group created", MGID:ff12:401b:ffff::ffff:ffff
  178273 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New 
mcast group created", MGID:ff12:601b:ffff::1
  178300 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New 
mcast group created", MGID:ff12:401b:ffff::1
  179342 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New 
mcast group created", MGID:ff12:601b:ffff::1:ff03:3240
  183044 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New 
mcast group created", MGID:ff15:4001:ffff:3:400::
  189187 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New 
mcast group created", MGID:ff12:601b:ffff::16
  306099 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New 
mcast group created", MGID:ff12:601b:ffff::1:ff6f:b638
  521640 [2D8006C0] 0x01 -> log_trap_info: Received Generic Notice type:1 
num:128 (Link state change) Producer:2 (Switch) from LID:2 
TID:0x0000014000000080

  # ps aux | grep opens[m]
  root        1163  0.0  0.0 1560768 3636 ?        Ssl  15:47   0:00 
/usr/sbin/opensm --guid 0x9c63c00300033240 --log_file 
/var/log/opensm.0x9c63c00300033240.log

  # ibnodes
  Ca     : 0x9c63c00300033240 ports 1 "$HOST ibp129s0f0"
  Switch : 0xa088c203006fb638 ports 81 "MF0;switch-5f8718:MQM8700/U1" enhanced 
port 0 lid 2 lmc 0
  [...]

Everything looks OK

After the upgrade:

  # reboot

  # uname -v
  #1 SMP PREEMPT_DYNAMIC Debian 6.11.2-1 (2024-10-05)

Snippet from /var/log/opensm.0x9c63c00300033240.log

  138670 [F0D43740] 0x03 -> OpenSM 3.3.23
  138934 [F0D43740] 0x80 -> OpenSM 3.3.23
  140409 [F0D43740] 0x02 -> osm_vendor_init: 1000 pending umads specified
  140628 [F0D43740] 0x80 -> Entering DISCOVERING state
  140711 [F0D43740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port 
GUID 0x9c63c00300033240
  165028 [F0D43740] 0x01 -> osm_vendor_bind: ERR 5426: Unable to register class 
129 version 1
  165157 [F0D43740] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific 
bind failed
  165162 [F0D43740] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind 
failed (IB_ERROR)
  165173 [F0D43740] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind
  165176 [F0D43740] 0x01 -> osm_congestion_control_shutdown: ERR C108: No 
previous bind
  165217 [F0D43740] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind
  165630 [F0D43740] 0x80 -> Exiting SM

  # ps aux | grep opens[m]

  # ibnodes
  ibwarn: [1795] mad_rpc_open_port: client_register for mgmt 1 failed
  ./libibnetdisc/ibnetdisc.c:798; can't open MAD port ((null):0)
  /usr/sbin/ibnetdiscover: iberror: failed: discover failed
  ibwarn: [1800] mad_rpc_open_port: client_register for mgmt 1 failed
  ./libibnetdisc/ibnetdisc.c:798; can't open MAD port ((null):0)
  /usr/sbin/ibnetdiscover: iberror: failed: discover failed

The Infiniband network does not work.

If I reboot with the previous Linux kernel version, everything
works again.

I cannot understand what's going on.

Is there any important change in the Linux kernel that OpenSM needs
to adapt for?
Or is this a bug in the newer Linux kernel version (that needs to
be fixed there)?

Please note that the other cluster nodes can run with the newest
version (6.11.2-1) of the Linux kernel and connect to the Infiniband
network, as long as the node which runs OpenSM is using the previous
Linux kernel version. Hence, it does not seem that the Linux kernel
version 6.11.2-1 broke its support for (mlx5) Infiniband networks:
it's just that opensm/3.3.23-3 and linux-image-6.11.2-amd64/6.11.2-1
don't seem to work together...

Please investigate this bug and fix it and/or forward my bug report
to OpenSM upstream developers, as appropriate.
Or, if the bug is in the Linux kernel, please forward the bug report
to the Linux kernel upstream developers.

Thanks for your time and dedication!


-- System Information:
Debian Release: trixie/sid
  APT prefers testing
  APT policy: (500, 'testing')
Architecture: amd64 (x86_64)

Kernel: Linux 6.10.11-amd64 (SMP w/16 CPU threads; PREEMPT)
Locale: LANG=C, LC_CTYPE=en_US.utf8 (charmap=UTF-8), LANGUAGE not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
LSM: AppArmor: enabled

Versions of packages opensm depends on:
ii  infiniband-diags     52.0-2
ii  init-system-helpers  1.67
ii  libc6                2.40-3
ii  libopensm9           3.3.23-3
ii  libosmcomp5          3.3.23-3
ii  libosmvendor5        3.3.23-3
ii  libwrap0             7.6.q-33

opensm recommends no packages.

opensm suggests no packages.

-- Configuration Files:
/etc/default/opensm changed:
PORTS="0x9c63c00300033240"


-- no debconf information

--- End Message ---
--- Begin Message ---
On Fri, 8 Nov 2024 23:56:06 +0100 Francesco Poli wrote:

> Control: forwarded -1 https://github.com/linux-rdma/opensm/issues/37
> 
> 
> On Fri, 18 Oct 2024 00:25:27 +0200 Francesco Poli (wintermute) wrote:
> 
> [...]
> > Please [...] forward my bug report
> > to OpenSM upstream developers, as appropriate.
> [...]
> 
> I have forwarded the bug report upstream by myself.

Hello,
I have also [reported] the bug against the Debian Linux kernel package,
since I was suspecting the issue was in the kernel.

[reported]: <https://bugs.debian.org/1086520>

Long story short, after quite some investigation, it turned out to be
an InfiniBand NIC firmware bug.
Upgrading the firmware solved the issue.

I am closing this bug report right now, as the issue is not in package
opensm.


-- 
 http://www.inventati.org/frx/
 There's not a second to spare! To the laboratory!
..................................................... Francesco Poli .
 GnuPG key fpr == CA01 1147 9CD2 EFDF FB82  3925 3E1C 27E1 1F69 BFFE

Attachment: pgpREAAMJObXP.pgp
Description: PGP signature


--- End Message ---

Reply via email to