[ceph-users] Re: "BUG: soft lockup" with MDS

Anthony D'Atri Fri, 09 May 2025 06:09:44 -0700

There are separate thermal, overall performance, and fan states in iDRAC.  I’ve 
found that I often have to bump up the default “fan offset” for more cooling.


> On May 9, 2025, at 8:51 AM, Frédéric Nass <[email protected]> 
> wrote:
> 
> Hi Janek,
> 
> We just had a very similar issue with recent hardware (DELL R760xd) going 
> nuts (100% CPU load) for like 10 to 20 minutes and OSDs being reported 'down' 
> as not responding in time.
> 
> Switching the CPU profile to HPC (High Performance Computing) and the thermal 
> settings to Maximum Performance (or is it Optimized?) in BIOS, and upgrading 
> HDD firmware to the latest one taht was only available from DELL's website 
> (not yet in OpenManage catalog) fixed it.
> 
> Maybe you can give it a try.
> 
> Regards,
> Frédéric.
> 
> ----- Le 9 Mai 25, à 9:02, Janek Bevendorff [email protected] a 
> écrit :
> 
>> Hi, it's happening again. I haven't fully upgraded the firmware on all
>> hosts yet, but at least on all MDS. I managed to finish the Ceph
>> upgrade, but now I'm randomly getting the soft lockups again (mostly,
>> but not only) on the MDS hosts.
>> 
>> Anything else I could check for?
>> 
>> Janek
>> 
>> 
>> On 16/04/2025 17:38, Janek Bevendorff wrote:
>>> Yes, we have a mirror of the Dell Firmware catalogue, so the servers
>>> can check what they need. There are three updates in total: BIOS, NIC,
>>> and Lifecycle Controller.
>>> 
>>> I hope the BIOS update fixes this.
>>> 
>>> 
>>> On 16/04/2025 17:16, Anthony D'Atri wrote:
>>>> Ack, I know the R730xd very well, mostly running Trusty and Luminous
>>>> at the time.  BIOS updates inherently require a reboot.  Check for
>>>> CPLD/SPLD as well, that changes very rarely but ISTR that this model
>>>> had at least one update after FCS.
>>>> 
>>>> 
>>>>> The servers our Ceph runs on are all R730xd machines.
>>>>> 
>>>>> I checked the Dell repository manager and it looks like there is at
>>>>> least one BIOS update that's newer than what we've already
>>>>> installed, so I've updated our Firmware repository and will schedule
>>>>> the updates now. That's going to take a long while.
>>>>> 
>>>>> 
>>>>> On 16/04/2025 16:16, Anthony D'Atri wrote:
>>>>>> For whatever reason, in recent years I’ve seen these more often
>>>>>> with Dells than other systems. My first thought was that maybe you
>>>>>> were running an ancient kernel, but then I saw that you aren’t.  Is
>>>>>> the kernel you’re running the stock one that comes with your
>>>>>> distribution?  I’ve seen CPU reset events on R750s running an
>>>>>> elrepo kernel.
>>>>>> 
>>>>>> I suspect that some code change may have tickled a latent issue
>>>>>> that perhaps you were fortunate to have not previously run into,
>>>>>> but this is entirely speculation.
>>>>>> 
>>>>>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff
>>>>>>> <[email protected]> wrote:
>>>>>>> 
>>>>>>> Yes, they are older Dell PowerEdges. I have to check whether
>>>>>>> there's newer firmware, but we've been running Ceph for years
>>>>>>> without these problems.
>>>>>>> 
>>>>>>> I checked the logs on the host on which I had a lockup just an
>>>>>>> hour ago, but there's nothing besides the expected hardreset
>>>>>>> messages. There are two older watchdog messages, but they are from
>>>>>>> March:
>>>>>>> 
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> 
>>>>>>> SeqNumber       = 2089
>>>>>>> Message ID      = ASR0000
>>>>>>> Category        = System
>>>>>>> AgentID         = SEL
>>>>>>> Severity        = Critical
>>>>>>> Timestamp       = 2025-03-27 07:16:03
>>>>>>> Message         = The watchdog timer expired.
>>>>>>> RawEventData    =
>>>>>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>>>>>> 
>>>>>>> FQDD            = WatchdogTimer.iDRAC.1
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> 
>>>>>>> SeqNumber       = 2088
>>>>>>> Message ID      = ASR0000
>>>>>>> Category        = System
>>>>>>> AgentID         = SEL
>>>>>>> Severity        = Critical
>>>>>>> Timestamp       = 2025-03-27 07:06:41
>>>>>>> Message         = The watchdog timer expired.
>>>>>>> RawEventData    =
>>>>>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>>>>>> 
>>>>>>> FQDD            = WatchdogTimer.iDRAC.1
>>>>>>> --------------------------------------------------------------------------------
>>>>>>> 
>>>>>>> 
>>>>>>> I grepped the logs of another host where it happened, but couldn't
>>>>>>> find any watchtdog messages there. I believe it's also unlikely
>>>>>>> that suddenly all MDS hosts (we have five active, five hot
>>>>>>> standbys, and one cold standby) start having hardware issues. I
>>>>>>> also ran a memtest on one of the hosts last week and couldn't find
>>>>>>> anything there either.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On 16/04/2025 15:14, Anthony D'Atri wrote:
>>>>>>>> Curious, are your systems Dells? If so you might see some
>>>>>>>> improvement from running DSU to update all the firmware.  It
>>>>>>>> might also be illuminating to run `racadm lclog view`
>>>>>>>> 
>>>>>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff
>>>>>>>>> <[email protected]> wrote:
>>>>>>>>> 
>>>>>>>>> Hi,
>>>>>>>>> 
>>>>>>>>> Since the latest Reef update I have the problem that some of my
>>>>>>>>> hosts suddenly go into a state where all CPUs are stuck in
>>>>>>>>> kernel mode causing all daemons on that host to become
>>>>>>>>> unresponsive. When I connect to the IPMI console, I see a lot of
>>>>>>>>> messages like:
>>>>>>>>> 
>>>>>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]
>>>>>>>>> 
>>>>>>>>> (it's basically a list of all processes running on the machine).
>>>>>>>>> 
>>>>>>>>> Usually, this resolves itself after several minutes, but
>>>>>>>>> sometimes I have to hardreset the host. When this happens, all
>>>>>>>>> daemons are marked as down and I cannot interact with the host
>>>>>>>>> at all. I don't know what causes this but, I think it happens
>>>>>>>>> primarily on the hosts where my MDS run and it seems to be
>>>>>>>>> triggered by events such as cluster rebalances, MDS restarts, or
>>>>>>>>> just randomly.
>>>>>>>>> 
>>>>>>>>> I found a few reports about similar issues on the bug tracker
>>>>>>>>> and mailing list, but they are all very unspecific, unanswered,
>>>>>>>>> or more than 6 years old.
>>>>>>>>> 
>>>>>>>>> Is there any way I can debug this? I upgraded to Squid already,
>>>>>>>>> but that didn't solve the problem. I also had massive issues
>>>>>>>>> with this during the upgrade. Particularly at the end when the
>>>>>>>>> MDS were upgraded, I had constant struggles with it. I had to
>>>>>>>>> set the noout flag and then literally sit next to it to resume
>>>>>>>>> the upgrade every few minutes until it finally went through,
>>>>>>>>> because random MDS hosts went intermittently dark all the time.
>>>>>>>>> 
>>>>>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0.
>>>>>>>>> 
>>>>>>>>> Any ideas? Thanks!
>>>>>>>>> Janek
>>>>>>>>> 
>>>>>>>>> _______________________________________________
>>>>>>>>> ceph-users mailing list -- [email protected]
>>>>>>>>> To unsubscribe send an email to [email protected]
>>>>>>> _______________________________________________
>>>>>>> ceph-users mailing list -- [email protected]
>>>>>>> To unsubscribe send an email to [email protected]
>>>>> --
>>>>> Bauhaus-Universität Weimar
>>>>> Bauhausstr. 9a, R308
>>>>> 99423 Weimar, Germany
>>>>> 
>>>>> Phone: +49 3643 58 3577
>>>>> www.webis.de
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> ceph-users mailing list -- [email protected]
>>>>> To unsubscribe send an email to [email protected]
>> 
>> --
>> Bauhaus-Universität Weimar
>> Bauhausstr. 9a, R308
>> 99423 Weimar, Germany
>> 
>> Phone: +49 3643 58 3577
>> www.webis.de
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: "BUG: soft lockup" with MDS

Reply via email to