Which kernel are you running?
> On May 9, 2025, at 3:02 AM, Janek Bevendorff <[email protected]> > wrote: > > Hi, it's happening again. I haven't fully upgraded the firmware on all hosts > yet, but at least on all MDS. I managed to finish the Ceph upgrade, but now > I'm randomly getting the soft lockups again (mostly, but not only) on the MDS > hosts. > > Anything else I could check for? > > Janek > > > On 16/04/2025 17:38, Janek Bevendorff wrote: >> Yes, we have a mirror of the Dell Firmware catalogue, so the servers can >> check what they need. There are three updates in total: BIOS, NIC, and >> Lifecycle Controller. >> >> I hope the BIOS update fixes this. >> >> >> On 16/04/2025 17:16, Anthony D'Atri wrote: >>> Ack, I know the R730xd very well, mostly running Trusty and Luminous at the >>> time. BIOS updates inherently require a reboot. Check for CPLD/SPLD as >>> well, that changes very rarely but ISTR that this model had at least one >>> update after FCS. >>> >>> >>>> The servers our Ceph runs on are all R730xd machines. >>>> >>>> I checked the Dell repository manager and it looks like there is at least >>>> one BIOS update that's newer than what we've already installed, so I've >>>> updated our Firmware repository and will schedule the updates now. That's >>>> going to take a long while. >>>> >>>> >>>> On 16/04/2025 16:16, Anthony D'Atri wrote: >>>>> For whatever reason, in recent years I’ve seen these more often with >>>>> Dells than other systems. My first thought was that maybe you were >>>>> running an ancient kernel, but then I saw that you aren’t. Is the kernel >>>>> you’re running the stock one that comes with your distribution? I’ve >>>>> seen CPU reset events on R750s running an elrepo kernel. >>>>> >>>>> I suspect that some code change may have tickled a latent issue that >>>>> perhaps you were fortunate to have not previously run into, but this is >>>>> entirely speculation. >>>>> >>>>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff >>>>>> <[email protected]> wrote: >>>>>> >>>>>> Yes, they are older Dell PowerEdges. I have to check whether there's >>>>>> newer firmware, but we've been running Ceph for years without these >>>>>> problems. >>>>>> >>>>>> I checked the logs on the host on which I had a lockup just an hour ago, >>>>>> but there's nothing besides the expected hardreset messages. There are >>>>>> two older watchdog messages, but they are from March: >>>>>> >>>>>> -------------------------------------------------------------------------------- >>>>>> >>>>>> SeqNumber = 2089 >>>>>> Message ID = ASR0000 >>>>>> Category = System >>>>>> AgentID = SEL >>>>>> Severity = Critical >>>>>> Timestamp = 2025-03-27 07:16:03 >>>>>> Message = The watchdog timer expired. >>>>>> RawEventData = >>>>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>>>>> >>>>>> FQDD = WatchdogTimer.iDRAC.1 >>>>>> -------------------------------------------------------------------------------- >>>>>> >>>>>> SeqNumber = 2088 >>>>>> Message ID = ASR0000 >>>>>> Category = System >>>>>> AgentID = SEL >>>>>> Severity = Critical >>>>>> Timestamp = 2025-03-27 07:06:41 >>>>>> Message = The watchdog timer expired. >>>>>> RawEventData = >>>>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>>>>> >>>>>> FQDD = WatchdogTimer.iDRAC.1 >>>>>> -------------------------------------------------------------------------------- >>>>>> >>>>>> >>>>>> I grepped the logs of another host where it happened, but couldn't find >>>>>> any watchtdog messages there. I believe it's also unlikely that suddenly >>>>>> all MDS hosts (we have five active, five hot standbys, and one cold >>>>>> standby) start having hardware issues. I also ran a memtest on one of >>>>>> the hosts last week and couldn't find anything there either. >>>>>> >>>>>> >>>>>> >>>>>> On 16/04/2025 15:14, Anthony D'Atri wrote: >>>>>>> Curious, are your systems Dells? If so you might see some improvement >>>>>>> from running DSU to update all the firmware. It might also be >>>>>>> illuminating to run `racadm lclog view` >>>>>>> >>>>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff >>>>>>>> <[email protected]> wrote: >>>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Since the latest Reef update I have the problem that some of my hosts >>>>>>>> suddenly go into a state where all CPUs are stuck in kernel mode >>>>>>>> causing all daemons on that host to become unresponsive. When I >>>>>>>> connect to the IPMI console, I see a lot of messages like: >>>>>>>> >>>>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840] >>>>>>>> >>>>>>>> (it's basically a list of all processes running on the machine). >>>>>>>> >>>>>>>> Usually, this resolves itself after several minutes, but sometimes I >>>>>>>> have to hardreset the host. When this happens, all daemons are marked >>>>>>>> as down and I cannot interact with the host at all. I don't know what >>>>>>>> causes this but, I think it happens primarily on the hosts where my >>>>>>>> MDS run and it seems to be triggered by events such as cluster >>>>>>>> rebalances, MDS restarts, or just randomly. >>>>>>>> >>>>>>>> I found a few reports about similar issues on the bug tracker and >>>>>>>> mailing list, but they are all very unspecific, unanswered, or more >>>>>>>> than 6 years old. >>>>>>>> >>>>>>>> Is there any way I can debug this? I upgraded to Squid already, but >>>>>>>> that didn't solve the problem. I also had massive issues with this >>>>>>>> during the upgrade. Particularly at the end when the MDS were >>>>>>>> upgraded, I had constant struggles with it. I had to set the noout >>>>>>>> flag and then literally sit next to it to resume the upgrade every few >>>>>>>> minutes until it finally went through, because random MDS hosts went >>>>>>>> intermittently dark all the time. >>>>>>>> >>>>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0. >>>>>>>> >>>>>>>> Any ideas? Thanks! >>>>>>>> Janek >>>>>>>> >>>>>>>> _______________________________________________ >>>>>>>> ceph-users mailing list -- [email protected] >>>>>>>> To unsubscribe send an email to [email protected] >>>>>> _______________________________________________ >>>>>> ceph-users mailing list -- [email protected] >>>>>> To unsubscribe send an email to [email protected] >>>> -- >>>> Bauhaus-Universität Weimar >>>> Bauhausstr. 9a, R308 >>>> 99423 Weimar, Germany >>>> >>>> Phone: +49 3643 58 3577 >>>> www.webis.de >>>> >>>> >>>> _______________________________________________ >>>> ceph-users mailing list -- [email protected] >>>> To unsubscribe send an email to [email protected] > > -- > Bauhaus-Universität Weimar > Bauhausstr. 9a, R308 > 99423 Weimar, Germany > > Phone: +49 3643 58 3577 > www.webis.de > > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
