[ceph-users] Re: "BUG: soft lockup" with MDS

Frédéric Nass Mon, 12 May 2025 09:20:31 -0700

Last firmware for HGST HUH721010AL5200 is 'LS21' [1] released in 2021. Your 
drives have likely been using this firmware for years, so the firmware is 
probably not the source of the issue.


Frédéric.

[1] 
https://www.dell.com/support/home/en-us/drivers/driversdetails?driverId=MGW91&lwp=rt

----- Le 12 Mai 25, à 13:48, Janek Bevendorff [email protected] a 
écrit :

> I tried to install both firmwares on one of the nodes, but they're not
> compatible. Most of our disks are HGST HUH721010AL5200 10TB SAS disks.
> 
> 
> On 12/05/2025 10:51, Frédéric Nass wrote:
>> Hi Janek,
>>
>> I just checked and we upgraded both HDD and SSD firmwares to those versions
>> released last month.
>>
>> HDD firmware (DELL/Seagate 'ST16000NM006J'):
>> https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=xf65r&lwp=rt
>> NVMe firmware (DELL/Kioxia 'Dell Ent NVMe CM7 U.2 RI 1.92TB'):
>> https://www.dell.com/support/home/en-us/drivers/DriversDetails?driverID=VH1YP&lwp=rt
>>
>> What models are your drives?
>>
>> Regards,
>> Frédéric.
>>
>> ----- Le 12 Mai 25, à 9:37, Janek Bevendorff [email protected] a
>> écrit :
>>
>>> Hi all,
>>>
>>> Kernel is 6.8.0 (Ubuntu). The thermal settings on iDrac are already
>>> quite high and we have an overall good cooling system, so that shouldn't
>>> cause any issues. Our cold aisle is around 20˚C.
>>>
>>> @Frédéric Do you have a link to the HDD firmware? I installed everything
>>> that's available in the Dell catalogue. Also, I don't know whether CPU
>>> usage is high during these lockups, since I cannot observe the host
>>> state when it happens. It's like the entire node goes down until either
>>> it recovers on its own or I do a hardreset.
>>>
>>> Janek
>>>
>>>
>>> On 09/05/2025 15:08, Anthony D'Atri wrote:
>>>> There are separate thermal, overall performance, and fan states in iDRAC.  
>>>> I’ve
>>>> found that I often have to bump up the default “fan offset” for more 
>>>> cooling.
>>>>
>>>>> On May 9, 2025, at 8:51 AM, Frédéric Nass <[email protected]>
>>>>> wrote:
>>>>>
>>>>> Hi Janek,
>>>>>
>>>>> We just had a very similar issue with recent hardware (DELL R760xd) going 
>>>>> nuts
>>>>> (100% CPU load) for like 10 to 20 minutes and OSDs being reported 'down' 
>>>>> as not
>>>>> responding in time.
>>>>>
>>>>> Switching the CPU profile to HPC (High Performance Computing) and the 
>>>>> thermal
>>>>> settings to Maximum Performance (or is it Optimized?) in BIOS, and 
>>>>> upgrading
>>>>> HDD firmware to the latest one taht was only available from DELL's 
>>>>> website (not
>>>>> yet in OpenManage catalog) fixed it.
>>>>>
>>>>> Maybe you can give it a try.
>>>>>
>>>>> Regards,
>>>>> Frédéric.
>>>>>
>>>>> ----- Le 9 Mai 25, à 9:02, Janek Bevendorff 
>>>>> [email protected] a
>>>>> écrit :
>>>>>
>>>>>> Hi, it's happening again. I haven't fully upgraded the firmware on all
>>>>>> hosts yet, but at least on all MDS. I managed to finish the Ceph
>>>>>> upgrade, but now I'm randomly getting the soft lockups again (mostly,
>>>>>> but not only) on the MDS hosts.
>>>>>>
>>>>>> Anything else I could check for?
>>>>>>
>>>>>> Janek
>>>>>>
>>>>>>
>>>>>> On 16/04/2025 17:38, Janek Bevendorff wrote:
>>>>>>> Yes, we have a mirror of the Dell Firmware catalogue, so the servers
>>>>>>> can check what they need. There are three updates in total: BIOS, NIC,
>>>>>>> and Lifecycle Controller.
>>>>>>>
>>>>>>> I hope the BIOS update fixes this.
>>>>>>>
>>>>>>>
>>>>>>> On 16/04/2025 17:16, Anthony D'Atri wrote:
>>>>>>>> Ack, I know the R730xd very well, mostly running Trusty and Luminous
>>>>>>>> at the time.  BIOS updates inherently require a reboot.  Check for
>>>>>>>> CPLD/SPLD as well, that changes very rarely but ISTR that this model
>>>>>>>> had at least one update after FCS.
>>>>>>>>
>>>>>>>>
>>>>>>>>> The servers our Ceph runs on are all R730xd machines.
>>>>>>>>>
>>>>>>>>> I checked the Dell repository manager and it looks like there is at
>>>>>>>>> least one BIOS update that's newer than what we've already
>>>>>>>>> installed, so I've updated our Firmware repository and will schedule
>>>>>>>>> the updates now. That's going to take a long while.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 16/04/2025 16:16, Anthony D'Atri wrote:
>>>>>>>>>> For whatever reason, in recent years I’ve seen these more often
>>>>>>>>>> with Dells than other systems. My first thought was that maybe you
>>>>>>>>>> were running an ancient kernel, but then I saw that you aren’t.  Is
>>>>>>>>>> the kernel you’re running the stock one that comes with your
>>>>>>>>>> distribution?  I’ve seen CPU reset events on R750s running an
>>>>>>>>>> elrepo kernel.
>>>>>>>>>>
>>>>>>>>>> I suspect that some code change may have tickled a latent issue
>>>>>>>>>> that perhaps you were fortunate to have not previously run into,
>>>>>>>>>> but this is entirely speculation.
>>>>>>>>>>
>>>>>>>>>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff
>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Yes, they are older Dell PowerEdges. I have to check whether
>>>>>>>>>>> there's newer firmware, but we've been running Ceph for years
>>>>>>>>>>> without these problems.
>>>>>>>>>>>
>>>>>>>>>>> I checked the logs on the host on which I had a lockup just an
>>>>>>>>>>> hour ago, but there's nothing besides the expected hardreset
>>>>>>>>>>> messages. There are two older watchdog messages, but they are from
>>>>>>>>>>> March:
>>>>>>>>>>>
>>>>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> SeqNumber       = 2089
>>>>>>>>>>> Message ID      = ASR0000
>>>>>>>>>>> Category        = System
>>>>>>>>>>> AgentID         = SEL
>>>>>>>>>>> Severity        = Critical
>>>>>>>>>>> Timestamp       = 2025-03-27 07:16:03
>>>>>>>>>>> Message         = The watchdog timer expired.
>>>>>>>>>>> RawEventData    =
>>>>>>>>>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>>>>>>>>>>
>>>>>>>>>>> FQDD            = WatchdogTimer.iDRAC.1
>>>>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>> SeqNumber       = 2088
>>>>>>>>>>> Message ID      = ASR0000
>>>>>>>>>>> Category        = System
>>>>>>>>>>> AgentID         = SEL
>>>>>>>>>>> Severity        = Critical
>>>>>>>>>>> Timestamp       = 2025-03-27 07:06:41
>>>>>>>>>>> Message         = The watchdog timer expired.
>>>>>>>>>>> RawEventData    =
>>>>>>>>>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>>>>>>>>>>
>>>>>>>>>>> FQDD            = WatchdogTimer.iDRAC.1
>>>>>>>>>>> --------------------------------------------------------------------------------
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I grepped the logs of another host where it happened, but couldn't
>>>>>>>>>>> find any watchtdog messages there. I believe it's also unlikely
>>>>>>>>>>> that suddenly all MDS hosts (we have five active, five hot
>>>>>>>>>>> standbys, and one cold standby) start having hardware issues. I
>>>>>>>>>>> also ran a memtest on one of the hosts last week and couldn't find
>>>>>>>>>>> anything there either.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On 16/04/2025 15:14, Anthony D'Atri wrote:
>>>>>>>>>>>> Curious, are your systems Dells? If so you might see some
>>>>>>>>>>>> improvement from running DSU to update all the firmware.  It
>>>>>>>>>>>> might also be illuminating to run `racadm lclog view`
>>>>>>>>>>>>
>>>>>>>>>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff
>>>>>>>>>>>>> <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Since the latest Reef update I have the problem that some of my
>>>>>>>>>>>>> hosts suddenly go into a state where all CPUs are stuck in
>>>>>>>>>>>>> kernel mode causing all daemons on that host to become
>>>>>>>>>>>>> unresponsive. When I connect to the IPMI console, I see a lot of
>>>>>>>>>>>>> messages like:
>>>>>>>>>>>>>
>>>>>>>>>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]
>>>>>>>>>>>>>
>>>>>>>>>>>>> (it's basically a list of all processes running on the machine).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Usually, this resolves itself after several minutes, but
>>>>>>>>>>>>> sometimes I have to hardreset the host. When this happens, all
>>>>>>>>>>>>> daemons are marked as down and I cannot interact with the host
>>>>>>>>>>>>> at all. I don't know what causes this but, I think it happens
>>>>>>>>>>>>> primarily on the hosts where my MDS run and it seems to be
>>>>>>>>>>>>> triggered by events such as cluster rebalances, MDS restarts, or
>>>>>>>>>>>>> just randomly.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I found a few reports about similar issues on the bug tracker
>>>>>>>>>>>>> and mailing list, but they are all very unspecific, unanswered,
>>>>>>>>>>>>> or more than 6 years old.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there any way I can debug this? I upgraded to Squid already,
>>>>>>>>>>>>> but that didn't solve the problem. I also had massive issues
>>>>>>>>>>>>> with this during the upgrade. Particularly at the end when the
>>>>>>>>>>>>> MDS were upgraded, I had constant struggles with it. I had to
>>>>>>>>>>>>> set the noout flag and then literally sit next to it to resume
>>>>>>>>>>>>> the upgrade every few minutes until it finally went through,
>>>>>>>>>>>>> because random MDS hosts went intermittently dark all the time.
>>>>>>>>>>>>>
>>>>>>>>>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Any ideas? Thanks!
>>>>>>>>>>>>> Janek
>>>>>>>>>>>>>
>>>>>>>>>>>>> _______________________________________________
>>>>>>>>>>>>> ceph-users mailing list -- [email protected]
>>>>>>>>>>>>> To unsubscribe send an email to [email protected]
>>>>>>>>>>> _______________________________________________
>>>>>>>>>>> ceph-users mailing list -- [email protected]
>>>>>>>>>>> To unsubscribe send an email to [email protected]
>>>>>>>>> --
>>>>>>>>> Bauhaus-Universität Weimar
>>>>>>>>> Bauhausstr. 9a, R308
>>>>>>>>> 99423 Weimar, Germany
>>>>>>>>>
>>>>>>>>> Phone: +49 3643 58 3577
>>>>>>>>> www.webis.de
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- [email protected]
>>>>>>>>> To unsubscribe send an email to [email protected]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- [email protected]
>>>>>>>>> To unsubscribe send an email to [email protected]
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- [email protected]
> >>>> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: "BUG: soft lockup" with MDS

Reply via email to