I have a DL380 G7 which I got for free a few months ago and IO setup in my home lab, I installed Debian 12.5 and got lots of errors in the logs and random freezes a few times a day. Investigating I came across posts saying that the problem was hpwdt and to blacklist it. Since I did this server has been an absolute beauty with no issues at all.
Happy to run tests for you on weekends, but although I am not a noob on USING Debian (sys admin here), I have no idea of kernel and modules programming, so you may need to tell me exactly what to do to collect data for you. Cheers Marcos inxi CPU: 2x 6-core Intel Xeon X5680 (-MT MCP SMP-) speed/min/max: 2487/1596/3326 MHz Kernel: 6.10.6+bpo-amd64 x86_64 Up: 5d 16h 55m Mem: 35.15/188.88 GiB (18.6%) Storage: 34.83 TiB (3.5% used) Procs: 422 Shell: Bash inxi: 3.3.36 uname -a Linux Earth2 6.10.6+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.10.6-1~bpo12+1 (2024-08-26) x86_64 GNU/Linux lsmod Module Size Used by cpuid 12288 0 vhost_net 36864 5 vhost 65536 1 vhost_net vhost_iotlb 16384 1 vhost tap 32768 1 vhost_net tun 69632 13 vhost_net bridge 389120 0 stp 12288 1 bridge llc 16384 2 bridge,stp rfkill 40960 1 qrtr 53248 2 cpufreq_powersave 16384 0 amdgpu 12939264 0 amdxcp 12288 1 amdgpu drm_exec 12288 1 amdgpu binfmt_misc 28672 1 gpu_sched 65536 1 amdgpu drm_buddy 20480 1 amdgpu ipmi_ssif 45056 0 radeon 1888256 1 intel_powerclamp 16384 0 kvm_intel 413696 32 drm_suballoc_helper 12288 2 amdgpu,radeon drm_display_helper 266240 2 amdgpu,radeon kvm 1343488 21 kvm_intel cec 69632 1 drm_display_helper rc_core 73728 1 cec drm_ttm_helper 12288 2 amdgpu,radeon ttm 102400 3 amdgpu,radeon,drm_ttm_helper ghash_clmulni_intel 16384 0 drm_kms_helper 253952 3 drm_display_helper,amdgpu,radeon sha512_ssse3 45056 0 sha256_ssse3 32768 0 sha1_ssse3 32768 0 i2c_algo_bit 12288 2 amdgpu,radeon video 77824 2 amdgpu,radeon wmi 28672 1 video aesni_intel 364544 0 crypto_simd 16384 1 aesni_intel cryptd 28672 2 crypto_simd,ghash_clmulni_intel sg 45056 0 hpilo 20480 0 joydev 24576 0 intel_cstate 24576 0 serio_raw 16384 0 evdev 28672 7 pcspkr 12288 0 ipmi_si 86016 1 intel_uncore 258048 0 iTCO_wdt 12288 0 intel_pmc_bxt 16384 1 iTCO_wdt i7core_edac 32768 0 iTCO_vendor_support 12288 1 iTCO_wdt watchdog 49152 1 iTCO_wdt acpi_power_meter 24576 0 acpi_cpufreq 32768 0 acpi_ipmi 20480 1 acpi_power_meter ipmi_devintf 16384 0 ipmi_msghandler 86016 4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif button 24576 0 scsi_dh_alua 24576 1 dm_service_time 12288 0 dm_multipath 45056 1 dm_service_time coretemp 16384 0 drm 749568 12 gpu_sched,drm_kms_helper,drm_exec,drm_suballoc_helper,drm_display_helper,drm_buddy,amdgpu,radeon,drm_ttm_helper,ttm,amdxcp msr 12288 0 efi_pstore 12288 0 loop 40960 0 configfs 69632 1 ip_tables 28672 0 x_tables 53248 1 ip_tables autofs4 57344 2 ext4 1130496 7 crc16 12288 1 ext4 mbcache 16384 1 ext4 jbd2 196608 1 ext4 efivarfs 28672 0 raid10 73728 0 raid456 196608 0 async_raid6_recov 20480 1 raid456 async_memcpy 16384 2 raid456,async_raid6_recov async_pq 16384 2 raid456,async_raid6_recov async_xor 16384 3 async_pq,raid456,async_raid6_recov async_tx 16384 5 async_pq,async_memcpy,async_xor,raid456,async_raid6_recov xor 20480 1 async_xor raid6_pq 122880 3 async_pq,raid456,async_raid6_recov libcrc32c 12288 1 raid456 crc32c_generic 12288 0 raid1 61440 0 raid0 24576 0 md_mod 225280 4 raid1,raid10,raid0,raid456 dm_mod 208896 25 dm_multipath hid_generic 12288 0 usbhid 77824 0 hid 253952 2 usbhid,hid_generic qla2xxx 1171456 2 sd_mod 81920 8 nvme_fc 53248 1 qla2xxx nvme_fabrics 32768 1 nvme_fc nvme_core 192512 2 nvme_fc,nvme_fabrics t10_pi 20480 2 sd_mod,nvme_core uhci_hcd 61440 0 crc64_rocksoft 16384 1 t10_pi ehci_pci 16384 0 crc64 16384 1 crc64_rocksoft hpsa 122880 6 ehci_hcd 110592 1 ehci_pci crc_t10dif 16384 1 t10_pi crct10dif_generic 12288 0 scsi_transport_fc 102400 1 qla2xxx scsi_transport_sas 57344 1 hpsa usbcore 401408 4 ehci_pci,usbhid,ehci_hcd,uhci_hcd psmouse 208896 0 scsi_mod 319488 8 scsi_transport_sas,sd_mod,dm_multipath,qla2xxx,scsi_dh_alua,scsi_transport_fc,hpsa,sg crct10dif_pclmul 12288 1 crc32_pclmul 12288 0 crc32c_intel 16384 14 bnx2 118784 0 lpc_ich 28672 0 usb_common 16384 3 usbcore,ehci_hcd,uhci_hcd crct10dif_common 12288 3 crct10dif_generic,crc_t10dif,crct10dif_pclmul scsi_common 16384 5 scsi_mod,sd_mod,qla2xxx,hpsa,sg lsblk NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINTS sda 8:0 0 273.4G 0 disk ├─sda1 8:1 0 487M 0 part /boot ├─sda2 8:2 0 1K 0 part └─sda5 8:5 0 272.9G 0 part ├─Earth2--vg-root 254:0 0 43.3G 0 lvm / ├─Earth2--vg-var 254:1 0 9.3G 0 lvm /var ├─Earth2--vg-swap_1 254:2 0 976M 0 lvm [SWAP] ├─Earth2--vg-tmp 254:3 0 1.9G 0 lvm /tmp └─Earth2--vg-home 254:4 0 169.5G 0 lvm /home sdb 8:16 0 17.3T 0 disk └─sdb1 8:17 0 17.3T 0 part sdc 8:32 0 17.3T 0 disk └─sdc1 8:33 0 17.3T 0 part ├─Oort-VMDisks 254:5 0 7T 0 lvm /Oort/VMDisks └─Oort-NextcloudDisk 254:6 0 5T 0 lvm /Oort/NextcloudDisk On Thu, 10 Oct 2024 at 12:44, Jerry Hoemann <jerry.hoem...@hpe.com> wrote: > On Wed, Oct 09, 2024 at 09:00:00PM +0200, Ben Hutchings wrote: > > Hi Jerry, > > > > The Debian kernel team received a number of reports over the past few > > years of instability of the Proliant DL380 G7 and DL380p G8, seemingly > > related to the hpwdt driver (in that this goes away if it is not > > loaded). These reports can be seen at > > <https://bugs.debian.org/898336>. > > > > The instability has been seen with kernel versions ranging from 4.16 to > > 6.1.y, including after the backport of commit dced0b3e51dd > > "watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO"). > > > > I can see that hpwdt seems to be used for error reporting so it's not > > clear to me whether these are problems caused by the driver, or the > > driver is only reporting that something bad happened. > > > > Do you have any ideas about what's going wrong here? Is there > > something odd about these models that needs to be handled in hpwdt, or > > are they just popular models? > > Hi Ben, > > There are a couple things that come to mind. > > As you mentioned, hpwdt is used for error containment on ProLiants. > (Especially on the older generations) Errors would be raised as > NMI and the expectation was that hpwdt would handle the NMI and > initiate a kdump. I have seen cases where shutting down file > systems can raise PCIe errors which would be transmitted to the > SUT as NMI and handled by hpwdt. > > The second issue is that systemd enables WDT (not just hpwdt) during > shutdown. This is to handle the case where shutdown hangs. The WDT > is supposed to break the system out of such situations. The default > timeout is 10 minutes: > /etc/systemd/system.conf: > #RebootWatchdogSec=10min > (note, I'm not a Debian user, but i believe the systemd behavior is the > same on Debian as it is on rhel/sles.) > > While a ten minute delay to shutdown would be fairly obvious if you're > doing interactive testing, it might not be noticed if the testing is > automated. > > To determine if either of the above is happening, you can: > > o) do the testing interactively and time the test. Does the NMI come in > roughly 10 minutes after the shutdown? > > o) Check the IEL and IML on the iLO web interface. Do you see any > errors reported during the shutdown? > > > Questions: > 1) The Debian bug above mentions only Gen 7 and 8 systems. > Are you seeing this issue on other ProLiant systems? > > 2) You mentioned back-porting commit dced0b3e51dd. Does your > drivers/watchdog/hpwdt.c source match upstream Linux? Or > do you cherry pick patches? (sorry, not knowing Debian, > I don't know how find/navigate your kernel source.) > > Please let me know what you find. > > > Jerry > > > -- > > > ----------------------------------------------------------------------------- > Jerry Hoemann Software Engineer Hewlett Packard > Enterprise > > ----------------------------------------------------------------------------- > -- Marcos R Carot