I have a DL380 G7 which I got for free a few months ago and IO setup in my
home lab, I installed Debian 12.5 and got lots of errors in the logs and
random freezes a few times a day.
Investigating I came across posts saying that the problem was hpwdt and to
blacklist it. Since I did this server has been an absolute beauty with no
issues at all.

Happy to run tests for you on weekends, but although I am not a noob on
USING Debian (sys admin here), I have no idea of kernel and modules
programming, so you may need to tell me exactly what to do to collect data
for you.

Cheers
Marcos

inxi
CPU: 2x 6-core Intel Xeon X5680 (-MT MCP SMP-) speed/min/max:
2487/1596/3326 MHz
Kernel: 6.10.6+bpo-amd64 x86_64 Up: 5d 16h 55m Mem: 35.15/188.88 GiB (18.6%)
Storage: 34.83 TiB (3.5% used) Procs: 422 Shell: Bash inxi: 3.3.36

uname -a
Linux Earth2 6.10.6+bpo-amd64 #1 SMP PREEMPT_DYNAMIC Debian
6.10.6-1~bpo12+1 (2024-08-26) x86_64 GNU/Linux

lsmod
Module                  Size  Used by
cpuid                  12288  0
vhost_net              36864  5
vhost                  65536  1 vhost_net
vhost_iotlb            16384  1 vhost
tap                    32768  1 vhost_net
tun                    69632  13 vhost_net
bridge                389120  0
stp                    12288  1 bridge
llc                    16384  2 bridge,stp
rfkill                 40960  1
qrtr                   53248  2
cpufreq_powersave      16384  0
amdgpu              12939264  0
amdxcp                 12288  1 amdgpu
drm_exec               12288  1 amdgpu
binfmt_misc            28672  1
gpu_sched              65536  1 amdgpu
drm_buddy              20480  1 amdgpu
ipmi_ssif              45056  0
radeon               1888256  1
intel_powerclamp       16384  0
kvm_intel             413696  32
drm_suballoc_helper    12288  2 amdgpu,radeon
drm_display_helper    266240  2 amdgpu,radeon
kvm                  1343488  21 kvm_intel
cec                    69632  1 drm_display_helper
rc_core                73728  1 cec
drm_ttm_helper         12288  2 amdgpu,radeon
ttm                   102400  3 amdgpu,radeon,drm_ttm_helper
ghash_clmulni_intel    16384  0
drm_kms_helper        253952  3 drm_display_helper,amdgpu,radeon
sha512_ssse3           45056  0
sha256_ssse3           32768  0
sha1_ssse3             32768  0
i2c_algo_bit           12288  2 amdgpu,radeon
video                  77824  2 amdgpu,radeon
wmi                    28672  1 video
aesni_intel           364544  0
crypto_simd            16384  1 aesni_intel
cryptd                 28672  2 crypto_simd,ghash_clmulni_intel
sg                     45056  0
hpilo                  20480  0
joydev                 24576  0
intel_cstate           24576  0
serio_raw              16384  0
evdev                  28672  7
pcspkr                 12288  0
ipmi_si                86016  1
intel_uncore          258048  0
iTCO_wdt               12288  0
intel_pmc_bxt          16384  1 iTCO_wdt
i7core_edac            32768  0
iTCO_vendor_support    12288  1 iTCO_wdt
watchdog               49152  1 iTCO_wdt
acpi_power_meter       24576  0
acpi_cpufreq           32768  0
acpi_ipmi              20480  1 acpi_power_meter
ipmi_devintf           16384  0
ipmi_msghandler        86016  4 ipmi_devintf,ipmi_si,acpi_ipmi,ipmi_ssif
button                 24576  0
scsi_dh_alua           24576  1
dm_service_time        12288  0
dm_multipath           45056  1 dm_service_time
coretemp               16384  0
drm                   749568  12
gpu_sched,drm_kms_helper,drm_exec,drm_suballoc_helper,drm_display_helper,drm_buddy,amdgpu,radeon,drm_ttm_helper,ttm,amdxcp
msr                    12288  0
efi_pstore             12288  0
loop                   40960  0
configfs               69632  1
ip_tables              28672  0
x_tables               53248  1 ip_tables
autofs4                57344  2
ext4                 1130496  7
crc16                  12288  1 ext4
mbcache                16384  1 ext4
jbd2                  196608  1 ext4
efivarfs               28672  0
raid10                 73728  0
raid456               196608  0
async_raid6_recov      20480  1 raid456
async_memcpy           16384  2 raid456,async_raid6_recov
async_pq               16384  2 raid456,async_raid6_recov
async_xor              16384  3 async_pq,raid456,async_raid6_recov
async_tx               16384  5
async_pq,async_memcpy,async_xor,raid456,async_raid6_recov
xor                    20480  1 async_xor
raid6_pq              122880  3 async_pq,raid456,async_raid6_recov
libcrc32c              12288  1 raid456
crc32c_generic         12288  0
raid1                  61440  0
raid0                  24576  0
md_mod                225280  4 raid1,raid10,raid0,raid456
dm_mod                208896  25 dm_multipath
hid_generic            12288  0
usbhid                 77824  0
hid                   253952  2 usbhid,hid_generic
qla2xxx              1171456  2
sd_mod                 81920  8
nvme_fc                53248  1 qla2xxx
nvme_fabrics           32768  1 nvme_fc
nvme_core             192512  2 nvme_fc,nvme_fabrics
t10_pi                 20480  2 sd_mod,nvme_core
uhci_hcd               61440  0
crc64_rocksoft         16384  1 t10_pi
ehci_pci               16384  0
crc64                  16384  1 crc64_rocksoft
hpsa                  122880  6
ehci_hcd              110592  1 ehci_pci
crc_t10dif             16384  1 t10_pi
crct10dif_generic      12288  0
scsi_transport_fc     102400  1 qla2xxx
scsi_transport_sas     57344  1 hpsa
usbcore               401408  4 ehci_pci,usbhid,ehci_hcd,uhci_hcd
psmouse               208896  0
scsi_mod              319488  8
scsi_transport_sas,sd_mod,dm_multipath,qla2xxx,scsi_dh_alua,scsi_transport_fc,hpsa,sg
crct10dif_pclmul       12288  1
crc32_pclmul           12288  0
crc32c_intel           16384  14
bnx2                  118784  0
lpc_ich                28672  0
usb_common             16384  3 usbcore,ehci_hcd,uhci_hcd
crct10dif_common       12288  3
crct10dif_generic,crc_t10dif,crct10dif_pclmul
scsi_common            16384  5 scsi_mod,sd_mod,qla2xxx,hpsa,sg


lsblk
NAME                   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda                      8:0    0 273.4G  0 disk
├─sda1                   8:1    0   487M  0 part /boot
├─sda2                   8:2    0     1K  0 part
└─sda5                   8:5    0 272.9G  0 part
 ├─Earth2--vg-root    254:0    0  43.3G  0 lvm  /
 ├─Earth2--vg-var     254:1    0   9.3G  0 lvm  /var
 ├─Earth2--vg-swap_1  254:2    0   976M  0 lvm  [SWAP]
 ├─Earth2--vg-tmp     254:3    0   1.9G  0 lvm  /tmp
 └─Earth2--vg-home    254:4    0 169.5G  0 lvm  /home
sdb                      8:16   0  17.3T  0 disk
└─sdb1                   8:17   0  17.3T  0 part
sdc                      8:32   0  17.3T  0 disk
└─sdc1                   8:33   0  17.3T  0 part
 ├─Oort-VMDisks       254:5    0     7T  0 lvm  /Oort/VMDisks
 └─Oort-NextcloudDisk 254:6    0     5T  0 lvm  /Oort/NextcloudDisk


On Thu, 10 Oct 2024 at 12:44, Jerry Hoemann <jerry.hoem...@hpe.com> wrote:

> On Wed, Oct 09, 2024 at 09:00:00PM +0200, Ben Hutchings wrote:
> > Hi Jerry,
> >
> > The Debian kernel team received a number of reports over the past few
> > years of instability of the Proliant DL380 G7 and DL380p G8, seemingly
> > related to the hpwdt driver (in that this goes away if it is not
> > loaded).  These reports can be seen at
> > <https://bugs.debian.org/898336>.
> >
> > The instability has been seen with kernel versions ranging from 4.16 to
> > 6.1.y, including after the backport of commit dced0b3e51dd
> > "watchdog/hpwdt: Only claim UNKNOWN NMI if from iLO").
> >
> > I can see that hpwdt seems to be used for error reporting so it's not
> > clear to me whether these are problems caused by the driver, or the
> > driver is only reporting that something bad happened.
> >
> > Do you have any ideas about what's going wrong here?  Is there
> > something odd about these models that needs to be handled in hpwdt, or
> > are they just popular models?
>
> Hi Ben,
>
> There are a couple things that come to mind.
>
> As you mentioned,  hpwdt is used for error containment on ProLiants.
> (Especially on the older generations) Errors would be raised as
> NMI and the expectation was that hpwdt would handle the NMI and
> initiate a kdump.  I have seen cases where shutting down file
> systems can raise PCIe errors which would be transmitted to the
> SUT as NMI and handled by hpwdt.
>
> The second issue is that systemd enables WDT (not just hpwdt) during
> shutdown.  This is to handle the case where shutdown hangs.  The WDT
> is supposed to break the system out of such situations.  The default
> timeout is 10 minutes:
>         /etc/systemd/system.conf:
>         #RebootWatchdogSec=10min
> (note, I'm not a Debian user, but i believe the systemd behavior is the
> same on Debian as it is on rhel/sles.)
>
> While a ten minute delay to shutdown would be fairly obvious if you're
> doing interactive testing, it might not be noticed if the testing is
> automated.
>
> To determine if either of the above is happening, you can:
>
> o) do the testing interactively and time the test.  Does the NMI come in
> roughly 10 minutes after the shutdown?
>
> o) Check the IEL and IML on the iLO web interface.  Do you see any
> errors reported during the shutdown?
>
>
> Questions:
> 1) The Debian bug above mentions only Gen 7 and 8 systems.
>    Are you seeing this issue on other ProLiant systems?
>
> 2) You mentioned back-porting commit dced0b3e51dd.  Does your
>    drivers/watchdog/hpwdt.c source match upstream Linux? Or
>    do you cherry pick patches?  (sorry, not knowing Debian,
>    I don't know how find/navigate your kernel source.)
>
> Please let me know what you find.
>
>
> Jerry
>
>
> --
>
>
> -----------------------------------------------------------------------------
> Jerry Hoemann                  Software Engineer   Hewlett Packard
> Enterprise
>
> -----------------------------------------------------------------------------
>


-- 
Marcos R Carot

Reply via email to