There are separate thermal, overall performance, and fan states in iDRAC. I’ve found that I often have to bump up the default “fan offset” for more cooling.
> On May 9, 2025, at 8:51 AM, Frédéric Nass <[email protected]> > wrote: > > Hi Janek, > > We just had a very similar issue with recent hardware (DELL R760xd) going > nuts (100% CPU load) for like 10 to 20 minutes and OSDs being reported 'down' > as not responding in time. > > Switching the CPU profile to HPC (High Performance Computing) and the thermal > settings to Maximum Performance (or is it Optimized?) in BIOS, and upgrading > HDD firmware to the latest one taht was only available from DELL's website > (not yet in OpenManage catalog) fixed it. > > Maybe you can give it a try. > > Regards, > Frédéric. > > ----- Le 9 Mai 25, à 9:02, Janek Bevendorff [email protected] a > écrit : > >> Hi, it's happening again. I haven't fully upgraded the firmware on all >> hosts yet, but at least on all MDS. I managed to finish the Ceph >> upgrade, but now I'm randomly getting the soft lockups again (mostly, >> but not only) on the MDS hosts. >> >> Anything else I could check for? >> >> Janek >> >> >> On 16/04/2025 17:38, Janek Bevendorff wrote: >>> Yes, we have a mirror of the Dell Firmware catalogue, so the servers >>> can check what they need. There are three updates in total: BIOS, NIC, >>> and Lifecycle Controller. >>> >>> I hope the BIOS update fixes this. >>> >>> >>> On 16/04/2025 17:16, Anthony D'Atri wrote: >>>> Ack, I know the R730xd very well, mostly running Trusty and Luminous >>>> at the time. BIOS updates inherently require a reboot. Check for >>>> CPLD/SPLD as well, that changes very rarely but ISTR that this model >>>> had at least one update after FCS. >>>> >>>> >>>>> The servers our Ceph runs on are all R730xd machines. >>>>> >>>>> I checked the Dell repository manager and it looks like there is at >>>>> least one BIOS update that's newer than what we've already >>>>> installed, so I've updated our Firmware repository and will schedule >>>>> the updates now. That's going to take a long while. >>>>> >>>>> >>>>> On 16/04/2025 16:16, Anthony D'Atri wrote: >>>>>> For whatever reason, in recent years I’ve seen these more often >>>>>> with Dells than other systems. My first thought was that maybe you >>>>>> were running an ancient kernel, but then I saw that you aren’t. Is >>>>>> the kernel you’re running the stock one that comes with your >>>>>> distribution? I’ve seen CPU reset events on R750s running an >>>>>> elrepo kernel. >>>>>> >>>>>> I suspect that some code change may have tickled a latent issue >>>>>> that perhaps you were fortunate to have not previously run into, >>>>>> but this is entirely speculation. >>>>>> >>>>>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Yes, they are older Dell PowerEdges. I have to check whether >>>>>>> there's newer firmware, but we've been running Ceph for years >>>>>>> without these problems. >>>>>>> >>>>>>> I checked the logs on the host on which I had a lockup just an >>>>>>> hour ago, but there's nothing besides the expected hardreset >>>>>>> messages. There are two older watchdog messages, but they are from >>>>>>> March: >>>>>>> >>>>>>> -------------------------------------------------------------------------------- >>>>>>> >>>>>>> SeqNumber = 2089 >>>>>>> Message ID = ASR0000 >>>>>>> Category = System >>>>>>> AgentID = SEL >>>>>>> Severity = Critical >>>>>>> Timestamp = 2025-03-27 07:16:03 >>>>>>> Message = The watchdog timer expired. >>>>>>> RawEventData = >>>>>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>>>>>> >>>>>>> FQDD = WatchdogTimer.iDRAC.1 >>>>>>> -------------------------------------------------------------------------------- >>>>>>> >>>>>>> SeqNumber = 2088 >>>>>>> Message ID = ASR0000 >>>>>>> Category = System >>>>>>> AgentID = SEL >>>>>>> Severity = Critical >>>>>>> Timestamp = 2025-03-27 07:06:41 >>>>>>> Message = The watchdog timer expired. >>>>>>> RawEventData = >>>>>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>>>>>> >>>>>>> FQDD = WatchdogTimer.iDRAC.1 >>>>>>> -------------------------------------------------------------------------------- >>>>>>> >>>>>>> >>>>>>> I grepped the logs of another host where it happened, but couldn't >>>>>>> find any watchtdog messages there. I believe it's also unlikely >>>>>>> that suddenly all MDS hosts (we have five active, five hot >>>>>>> standbys, and one cold standby) start having hardware issues. I >>>>>>> also ran a memtest on one of the hosts last week and couldn't find >>>>>>> anything there either. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On 16/04/2025 15:14, Anthony D'Atri wrote: >>>>>>>> Curious, are your systems Dells? If so you might see some >>>>>>>> improvement from running DSU to update all the firmware. It >>>>>>>> might also be illuminating to run `racadm lclog view` >>>>>>>> >>>>>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff >>>>>>>>> <[email protected]> wrote: >>>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> Since the latest Reef update I have the problem that some of my >>>>>>>>> hosts suddenly go into a state where all CPUs are stuck in >>>>>>>>> kernel mode causing all daemons on that host to become >>>>>>>>> unresponsive. When I connect to the IPMI console, I see a lot of >>>>>>>>> messages like: >>>>>>>>> >>>>>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840] >>>>>>>>> >>>>>>>>> (it's basically a list of all processes running on the machine). >>>>>>>>> >>>>>>>>> Usually, this resolves itself after several minutes, but >>>>>>>>> sometimes I have to hardreset the host. When this happens, all >>>>>>>>> daemons are marked as down and I cannot interact with the host >>>>>>>>> at all. I don't know what causes this but, I think it happens >>>>>>>>> primarily on the hosts where my MDS run and it seems to be >>>>>>>>> triggered by events such as cluster rebalances, MDS restarts, or >>>>>>>>> just randomly. >>>>>>>>> >>>>>>>>> I found a few reports about similar issues on the bug tracker >>>>>>>>> and mailing list, but they are all very unspecific, unanswered, >>>>>>>>> or more than 6 years old. >>>>>>>>> >>>>>>>>> Is there any way I can debug this? I upgraded to Squid already, >>>>>>>>> but that didn't solve the problem. I also had massive issues >>>>>>>>> with this during the upgrade. Particularly at the end when the >>>>>>>>> MDS were upgraded, I had constant struggles with it. I had to >>>>>>>>> set the noout flag and then literally sit next to it to resume >>>>>>>>> the upgrade every few minutes until it finally went through, >>>>>>>>> because random MDS hosts went intermittently dark all the time. >>>>>>>>> >>>>>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0. >>>>>>>>> >>>>>>>>> Any ideas? Thanks! >>>>>>>>> Janek >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- [email protected] >>>>>>>>> To unsubscribe send an email to [email protected] >>>>>>> _______________________________________________ >>>>>>> ceph-users mailing list -- [email protected] >>>>>>> To unsubscribe send an email to [email protected] >>>>> -- >>>>> Bauhaus-Universität Weimar >>>>> Bauhausstr. 9a, R308 >>>>> 99423 Weimar, Germany >>>>> >>>>> Phone: +49 3643 58 3577 >>>>> www.webis.de >>>>> >>>>> >>>>> _______________________________________________ >>>>> ceph-users mailing list -- [email protected] >>>>> To unsubscribe send an email to [email protected] >> >> -- >> Bauhaus-Universität Weimar >> Bauhausstr. 9a, R308 >> 99423 Weimar, Germany >> >> Phone: +49 3643 58 3577 >> www.webis.de >> >> >> _______________________________________________ >> ceph-users mailing list -- [email protected] >> To unsubscribe send an email to [email protected] > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
