Package: opensm Version: 3.3.23-3 Severity: grave Justification: renders package unusable X-Debbugs-Cc: invernom...@paranoici.org
Hello, thanks for maintaining this package. I encountered a major issue, as soon as I upgraded the Linux kernel of the box where OpenSM runs (in order to manage the InfiniBand network of an HPC cluster). Before the upgrade: # uname -v #1 SMP PREEMPT_DYNAMIC Debian 6.10.11-1 (2024-09-22) Snippet from /var/log/opensm.0x9c63c00300033240.log 976187 [2F772740] 0x03 -> OpenSM 3.3.23 976444 [2F772740] 0x80 -> OpenSM 3.3.23 983202 [2F772740] 0x02 -> osm_vendor_init: 1000 pending umads specified 984338 [2F772740] 0x80 -> Entering DISCOVERING state 984431 [2F772740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x9c63c00300033240 008073 [2F772740] 0x02 -> osm_vendor_bind: Mgmt class 0x03 binding to port GUID 0x9c63c00300033240 008097 [2F772740] 0x02 -> osm_vendor_bind: Mgmt class 0x04 binding to port GUID 0x9c63c00300033240 008118 [2F772740] 0x02 -> osm_vendor_bind: Mgmt class 0x21 binding to port GUID 0x9c63c00300033240 008159 [2F772740] 0x02 -> osm_opensm_bind: Setting IS_SM on port 0x9c63c00300033240 009661 [2E2006C0] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0 009728 [238006C0] 0x80 -> SM port is down 985713 [2BA006C0] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0 985857 [29C006C0] 0x01 -> pi_rcv_check_and_fix_lid: ERR 0F04: Got invalid base LID 65535 from the network. Corrected to 0 987086 [238006C0] 0x80 -> SM port is up 992267 [238006C0] 0x80 -> Entering MASTER state 995175 [238006C0] 0x02 -> osm_ucast_mgr_process: minhop tables configured on all switches 997623 [238006C0] 0x02 -> SUBNET UP 998308 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New mcast group created", MGID:ff12:401b:ffff::ffff:ffff 178273 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New mcast group created", MGID:ff12:601b:ffff::1 178300 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New mcast group created", MGID:ff12:401b:ffff::1 179342 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New mcast group created", MGID:ff12:601b:ffff::1:ff03:3240 183044 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New mcast group created", MGID:ff15:4001:ffff:3:400:: 189187 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New mcast group created", MGID:ff12:601b:ffff::16 306099 [24C006C0] 0x02 -> log_notice: Reporting Informational Notice "New mcast group created", MGID:ff12:601b:ffff::1:ff6f:b638 521640 [2D8006C0] 0x01 -> log_trap_info: Received Generic Notice type:1 num:128 (Link state change) Producer:2 (Switch) from LID:2 TID:0x0000014000000080 # ps aux | grep opens[m] root 1163 0.0 0.0 1560768 3636 ? Ssl 15:47 0:00 /usr/sbin/opensm --guid 0x9c63c00300033240 --log_file /var/log/opensm.0x9c63c00300033240.log # ibnodes Ca : 0x9c63c00300033240 ports 1 "$HOST ibp129s0f0" Switch : 0xa088c203006fb638 ports 81 "MF0;switch-5f8718:MQM8700/U1" enhanced port 0 lid 2 lmc 0 [...] Everything looks OK After the upgrade: # reboot # uname -v #1 SMP PREEMPT_DYNAMIC Debian 6.11.2-1 (2024-10-05) Snippet from /var/log/opensm.0x9c63c00300033240.log 138670 [F0D43740] 0x03 -> OpenSM 3.3.23 138934 [F0D43740] 0x80 -> OpenSM 3.3.23 140409 [F0D43740] 0x02 -> osm_vendor_init: 1000 pending umads specified 140628 [F0D43740] 0x80 -> Entering DISCOVERING state 140711 [F0D43740] 0x02 -> osm_vendor_bind: Mgmt class 0x81 binding to port GUID 0x9c63c00300033240 165028 [F0D43740] 0x01 -> osm_vendor_bind: ERR 5426: Unable to register class 129 version 1 165157 [F0D43740] 0x01 -> osm_sm_mad_ctrl_bind: ERR 3118: Vendor specific bind failed 165162 [F0D43740] 0x01 -> osm_sm_bind: ERR 2E10: SM MAD Controller bind failed (IB_ERROR) 165173 [F0D43740] 0x01 -> perfmgr_mad_unbind: ERR 5405: No previous bind 165176 [F0D43740] 0x01 -> osm_congestion_control_shutdown: ERR C108: No previous bind 165217 [F0D43740] 0x01 -> osm_sa_mad_ctrl_unbind: ERR 1A11: No previous bind 165630 [F0D43740] 0x80 -> Exiting SM # ps aux | grep opens[m] # ibnodes ibwarn: [1795] mad_rpc_open_port: client_register for mgmt 1 failed ./libibnetdisc/ibnetdisc.c:798; can't open MAD port ((null):0) /usr/sbin/ibnetdiscover: iberror: failed: discover failed ibwarn: [1800] mad_rpc_open_port: client_register for mgmt 1 failed ./libibnetdisc/ibnetdisc.c:798; can't open MAD port ((null):0) /usr/sbin/ibnetdiscover: iberror: failed: discover failed The Infiniband network does not work. If I reboot with the previous Linux kernel version, everything works again. I cannot understand what's going on. Is there any important change in the Linux kernel that OpenSM needs to adapt for? Or is this a bug in the newer Linux kernel version (that needs to be fixed there)? Please note that the other cluster nodes can run with the newest version (6.11.2-1) of the Linux kernel and connect to the Infiniband network, as long as the node which runs OpenSM is using the previous Linux kernel version. Hence, it does not seem that the Linux kernel version 6.11.2-1 broke its support for (mlx5) Infiniband networks: it's just that opensm/3.3.23-3 and linux-image-6.11.2-amd64/6.11.2-1 don't seem to work together... Please investigate this bug and fix it and/or forward my bug report to OpenSM upstream developers, as appropriate. Or, if the bug is in the Linux kernel, please forward the bug report to the Linux kernel upstream developers. Thanks for your time and dedication! -- System Information: Debian Release: trixie/sid APT prefers testing APT policy: (500, 'testing') Architecture: amd64 (x86_64) Kernel: Linux 6.10.11-amd64 (SMP w/16 CPU threads; PREEMPT) Locale: LANG=C, LC_CTYPE=en_US.utf8 (charmap=UTF-8), LANGUAGE not set Shell: /bin/sh linked to /usr/bin/dash Init: systemd (via /run/systemd/system) LSM: AppArmor: enabled Versions of packages opensm depends on: ii infiniband-diags 52.0-2 ii init-system-helpers 1.67 ii libc6 2.40-3 ii libopensm9 3.3.23-3 ii libosmcomp5 3.3.23-3 ii libosmvendor5 3.3.23-3 ii libwrap0 7.6.q-33 opensm recommends no packages. opensm suggests no packages. -- Configuration Files: /etc/default/opensm changed: PORTS="0x9c63c00300033240" -- no debconf information