[ceph-users] Re: "BUG: soft lockup" with MDS

Anthony D'Atri Fri, 09 May 2025 05:42:40 -0700

Which kernel are you running?


> On May 9, 2025, at 3:02 AM, Janek Bevendorff <[email protected]> 
> wrote:
> 
> Hi, it's happening again. I haven't fully upgraded the firmware on all hosts 
> yet, but at least on all MDS. I managed to finish the Ceph upgrade, but now 
> I'm randomly getting the soft lockups again (mostly, but not only) on the MDS 
> hosts.
> 
> Anything else I could check for?
> 
> Janek
> 
> 
> On 16/04/2025 17:38, Janek Bevendorff wrote:
>> Yes, we have a mirror of the Dell Firmware catalogue, so the servers can 
>> check what they need. There are three updates in total: BIOS, NIC, and 
>> Lifecycle Controller.
>> 
>> I hope the BIOS update fixes this.
>> 
>> 
>> On 16/04/2025 17:16, Anthony D'Atri wrote:
>>> Ack, I know the R730xd very well, mostly running Trusty and Luminous at the 
>>> time.  BIOS updates inherently require a reboot.  Check for CPLD/SPLD as 
>>> well, that changes very rarely but ISTR that this model had at least one 
>>> update after FCS.
>>> 
>>> 
>>>> The servers our Ceph runs on are all R730xd machines.
>>>> 
>>>> I checked the Dell repository manager and it looks like there is at least 
>>>> one BIOS update that's newer than what we've already installed, so I've 
>>>> updated our Firmware repository and will schedule the updates now. That's 
>>>> going to take a long while.
>>>> 
>>>> 
>>>> On 16/04/2025 16:16, Anthony D'Atri wrote:
>>>>> For whatever reason, in recent years I’ve seen these more often with 
>>>>> Dells than other systems. My first thought was that maybe you were 
>>>>> running an ancient kernel, but then I saw that you aren’t.  Is the kernel 
>>>>> you’re running the stock one that comes with your distribution?  I’ve 
>>>>> seen CPU reset events on R750s running an elrepo kernel.
>>>>> 
>>>>> I suspect that some code change may have tickled a latent issue that 
>>>>> perhaps you were fortunate to have not previously run into, but this is 
>>>>> entirely speculation.
>>>>> 
>>>>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff 
>>>>>> <[email protected]> wrote:
>>>>>> 
>>>>>> Yes, they are older Dell PowerEdges. I have to check whether there's 
>>>>>> newer firmware, but we've been running Ceph for years without these 
>>>>>> problems.
>>>>>> 
>>>>>> I checked the logs on the host on which I had a lockup just an hour ago, 
>>>>>> but there's nothing besides the expected hardreset messages. There are 
>>>>>> two older watchdog messages, but they are from March:
>>>>>> 
>>>>>> --------------------------------------------------------------------------------
>>>>>>  
>>>>>> SeqNumber       = 2089
>>>>>> Message ID      = ASR0000
>>>>>> Category        = System
>>>>>> AgentID         = SEL
>>>>>> Severity        = Critical
>>>>>> Timestamp       = 2025-03-27 07:16:03
>>>>>> Message         = The watchdog timer expired.
>>>>>> RawEventData    = 
>>>>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>>>>> 
>>>>>> FQDD            = WatchdogTimer.iDRAC.1
>>>>>> --------------------------------------------------------------------------------
>>>>>>  
>>>>>> SeqNumber       = 2088
>>>>>> Message ID      = ASR0000
>>>>>> Category        = System
>>>>>> AgentID         = SEL
>>>>>> Severity        = Critical
>>>>>> Timestamp       = 2025-03-27 07:06:41
>>>>>> Message         = The watchdog timer expired.
>>>>>> RawEventData    = 
>>>>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF
>>>>>> 
>>>>>> FQDD            = WatchdogTimer.iDRAC.1
>>>>>> --------------------------------------------------------------------------------
>>>>>>  
>>>>>> 
>>>>>> I grepped the logs of another host where it happened, but couldn't find 
>>>>>> any watchtdog messages there. I believe it's also unlikely that suddenly 
>>>>>> all MDS hosts (we have five active, five hot standbys, and one cold 
>>>>>> standby) start having hardware issues. I also ran a memtest on one of 
>>>>>> the hosts last week and couldn't find anything there either.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> On 16/04/2025 15:14, Anthony D'Atri wrote:
>>>>>>> Curious, are your systems Dells? If so you might see some improvement 
>>>>>>> from running DSU to update all the firmware.  It might also be 
>>>>>>> illuminating to run `racadm lclog view`
>>>>>>> 
>>>>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff 
>>>>>>>> <[email protected]> wrote:
>>>>>>>> 
>>>>>>>> Hi,
>>>>>>>> 
>>>>>>>> Since the latest Reef update I have the problem that some of my hosts 
>>>>>>>> suddenly go into a state where all CPUs are stuck in kernel mode 
>>>>>>>> causing all daemons on that host to become unresponsive. When I 
>>>>>>>> connect to the IPMI console, I see a lot of messages like:
>>>>>>>> 
>>>>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840]
>>>>>>>> 
>>>>>>>> (it's basically a list of all processes running on the machine).
>>>>>>>> 
>>>>>>>> Usually, this resolves itself after several minutes, but sometimes I 
>>>>>>>> have to hardreset the host. When this happens, all daemons are marked 
>>>>>>>> as down and I cannot interact with the host at all. I don't know what 
>>>>>>>> causes this but, I think it happens primarily on the hosts where my 
>>>>>>>> MDS run and it seems to be triggered by events such as cluster 
>>>>>>>> rebalances, MDS restarts, or just randomly.
>>>>>>>> 
>>>>>>>> I found a few reports about similar issues on the bug tracker and 
>>>>>>>> mailing list, but they are all very unspecific, unanswered, or more 
>>>>>>>> than 6 years old.
>>>>>>>> 
>>>>>>>> Is there any way I can debug this? I upgraded to Squid already, but 
>>>>>>>> that didn't solve the problem. I also had massive issues with this 
>>>>>>>> during the upgrade. Particularly at the end when the MDS were 
>>>>>>>> upgraded, I had constant struggles with it. I had to set the noout 
>>>>>>>> flag and then literally sit next to it to resume the upgrade every few 
>>>>>>>> minutes until it finally went through, because random MDS hosts went 
>>>>>>>> intermittently dark all the time.
>>>>>>>> 
>>>>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0.
>>>>>>>> 
>>>>>>>> Any ideas? Thanks!
>>>>>>>> Janek
>>>>>>>> 
>>>>>>>> _______________________________________________
>>>>>>>> ceph-users mailing list -- [email protected]
>>>>>>>> To unsubscribe send an email to [email protected]
>>>>>> _______________________________________________
>>>>>> ceph-users mailing list -- [email protected]
>>>>>> To unsubscribe send an email to [email protected]
>>>> -- 
>>>> Bauhaus-Universität Weimar
>>>> Bauhausstr. 9a, R308
>>>> 99423 Weimar, Germany
>>>> 
>>>> Phone: +49 3643 58 3577
>>>> www.webis.de
>>>> 
>>>> 
>>>> _______________________________________________
>>>> ceph-users mailing list -- [email protected]
>>>> To unsubscribe send an email to [email protected]
> 
> -- 
> Bauhaus-Universität Weimar
> Bauhausstr. 9a, R308
> 99423 Weimar, Germany
> 
> Phone: +49 3643 58 3577
> www.webis.de
> 
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: "BUG: soft lockup" with MDS

Reply via email to