Last firmware for HGST HUH721010AL5200 is 'LS21' [1] released in 2021. Your drives have likely been using this firmware for years, so the firmware is probably not the source of the issue.
Frédéric. [1] https://www.dell.com/support/home/en-us/drivers/driversdetails?driverId=MGW91&lwp=rt ----- Le 12 Mai 25, à 13:48, Janek Bevendorff [email protected] a écrit : > I tried to install both firmwares on one of the nodes, but they're not > compatible. Most of our disks are HGST HUH721010AL5200 10TB SAS disks. > > > On 12/05/2025 10:51, Frédéric Nass wrote: >> Hi Janek, >> >> I just checked and we upgraded both HDD and SSD firmwares to those versions >> released last month. >> >> HDD firmware (DELL/Seagate 'ST16000NM006J'): >> https://www.dell.com/support/home/en-us/drivers/driversdetails?driverid=xf65r&lwp=rt >> NVMe firmware (DELL/Kioxia 'Dell Ent NVMe CM7 U.2 RI 1.92TB'): >> https://www.dell.com/support/home/en-us/drivers/DriversDetails?driverID=VH1YP&lwp=rt >> >> What models are your drives? >> >> Regards, >> Frédéric. >> >> ----- Le 12 Mai 25, à 9:37, Janek Bevendorff [email protected] a >> écrit : >> >>> Hi all, >>> >>> Kernel is 6.8.0 (Ubuntu). The thermal settings on iDrac are already >>> quite high and we have an overall good cooling system, so that shouldn't >>> cause any issues. Our cold aisle is around 20˚C. >>> >>> @Frédéric Do you have a link to the HDD firmware? I installed everything >>> that's available in the Dell catalogue. Also, I don't know whether CPU >>> usage is high during these lockups, since I cannot observe the host >>> state when it happens. It's like the entire node goes down until either >>> it recovers on its own or I do a hardreset. >>> >>> Janek >>> >>> >>> On 09/05/2025 15:08, Anthony D'Atri wrote: >>>> There are separate thermal, overall performance, and fan states in iDRAC. >>>> I’ve >>>> found that I often have to bump up the default “fan offset” for more >>>> cooling. >>>> >>>>> On May 9, 2025, at 8:51 AM, Frédéric Nass <[email protected]> >>>>> wrote: >>>>> >>>>> Hi Janek, >>>>> >>>>> We just had a very similar issue with recent hardware (DELL R760xd) going >>>>> nuts >>>>> (100% CPU load) for like 10 to 20 minutes and OSDs being reported 'down' >>>>> as not >>>>> responding in time. >>>>> >>>>> Switching the CPU profile to HPC (High Performance Computing) and the >>>>> thermal >>>>> settings to Maximum Performance (or is it Optimized?) in BIOS, and >>>>> upgrading >>>>> HDD firmware to the latest one taht was only available from DELL's >>>>> website (not >>>>> yet in OpenManage catalog) fixed it. >>>>> >>>>> Maybe you can give it a try. >>>>> >>>>> Regards, >>>>> Frédéric. >>>>> >>>>> ----- Le 9 Mai 25, à 9:02, Janek Bevendorff >>>>> [email protected] a >>>>> écrit : >>>>> >>>>>> Hi, it's happening again. I haven't fully upgraded the firmware on all >>>>>> hosts yet, but at least on all MDS. I managed to finish the Ceph >>>>>> upgrade, but now I'm randomly getting the soft lockups again (mostly, >>>>>> but not only) on the MDS hosts. >>>>>> >>>>>> Anything else I could check for? >>>>>> >>>>>> Janek >>>>>> >>>>>> >>>>>> On 16/04/2025 17:38, Janek Bevendorff wrote: >>>>>>> Yes, we have a mirror of the Dell Firmware catalogue, so the servers >>>>>>> can check what they need. There are three updates in total: BIOS, NIC, >>>>>>> and Lifecycle Controller. >>>>>>> >>>>>>> I hope the BIOS update fixes this. >>>>>>> >>>>>>> >>>>>>> On 16/04/2025 17:16, Anthony D'Atri wrote: >>>>>>>> Ack, I know the R730xd very well, mostly running Trusty and Luminous >>>>>>>> at the time. BIOS updates inherently require a reboot. Check for >>>>>>>> CPLD/SPLD as well, that changes very rarely but ISTR that this model >>>>>>>> had at least one update after FCS. >>>>>>>> >>>>>>>> >>>>>>>>> The servers our Ceph runs on are all R730xd machines. >>>>>>>>> >>>>>>>>> I checked the Dell repository manager and it looks like there is at >>>>>>>>> least one BIOS update that's newer than what we've already >>>>>>>>> installed, so I've updated our Firmware repository and will schedule >>>>>>>>> the updates now. That's going to take a long while. >>>>>>>>> >>>>>>>>> >>>>>>>>> On 16/04/2025 16:16, Anthony D'Atri wrote: >>>>>>>>>> For whatever reason, in recent years I’ve seen these more often >>>>>>>>>> with Dells than other systems. My first thought was that maybe you >>>>>>>>>> were running an ancient kernel, but then I saw that you aren’t. Is >>>>>>>>>> the kernel you’re running the stock one that comes with your >>>>>>>>>> distribution? I’ve seen CPU reset events on R750s running an >>>>>>>>>> elrepo kernel. >>>>>>>>>> >>>>>>>>>> I suspect that some code change may have tickled a latent issue >>>>>>>>>> that perhaps you were fortunate to have not previously run into, >>>>>>>>>> but this is entirely speculation. >>>>>>>>>> >>>>>>>>>>> On Apr 16, 2025, at 9:39 AM, Janek Bevendorff >>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>> Yes, they are older Dell PowerEdges. I have to check whether >>>>>>>>>>> there's newer firmware, but we've been running Ceph for years >>>>>>>>>>> without these problems. >>>>>>>>>>> >>>>>>>>>>> I checked the logs on the host on which I had a lockup just an >>>>>>>>>>> hour ago, but there's nothing besides the expected hardreset >>>>>>>>>>> messages. There are two older watchdog messages, but they are from >>>>>>>>>>> March: >>>>>>>>>>> >>>>>>>>>>> -------------------------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> SeqNumber = 2089 >>>>>>>>>>> Message ID = ASR0000 >>>>>>>>>>> Category = System >>>>>>>>>>> AgentID = SEL >>>>>>>>>>> Severity = Critical >>>>>>>>>>> Timestamp = 2025-03-27 07:16:03 >>>>>>>>>>> Message = The watchdog timer expired. >>>>>>>>>>> RawEventData = >>>>>>>>>>> 0x03,0x00,0x02,0x33,0xFB,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>>>>>>>>>> >>>>>>>>>>> FQDD = WatchdogTimer.iDRAC.1 >>>>>>>>>>> -------------------------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> SeqNumber = 2088 >>>>>>>>>>> Message ID = ASR0000 >>>>>>>>>>> Category = System >>>>>>>>>>> AgentID = SEL >>>>>>>>>>> Severity = Critical >>>>>>>>>>> Timestamp = 2025-03-27 07:06:41 >>>>>>>>>>> Message = The watchdog timer expired. >>>>>>>>>>> RawEventData = >>>>>>>>>>> 0x02,0x00,0x02,0x01,0xF9,0xE4,0x67,0x20,0x00,0x04,0x23,0x71,0x6F,0xC0,0x04,0xFF >>>>>>>>>>> >>>>>>>>>>> FQDD = WatchdogTimer.iDRAC.1 >>>>>>>>>>> -------------------------------------------------------------------------------- >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> I grepped the logs of another host where it happened, but couldn't >>>>>>>>>>> find any watchtdog messages there. I believe it's also unlikely >>>>>>>>>>> that suddenly all MDS hosts (we have five active, five hot >>>>>>>>>>> standbys, and one cold standby) start having hardware issues. I >>>>>>>>>>> also ran a memtest on one of the hosts last week and couldn't find >>>>>>>>>>> anything there either. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On 16/04/2025 15:14, Anthony D'Atri wrote: >>>>>>>>>>>> Curious, are your systems Dells? If so you might see some >>>>>>>>>>>> improvement from running DSU to update all the firmware. It >>>>>>>>>>>> might also be illuminating to run `racadm lclog view` >>>>>>>>>>>> >>>>>>>>>>>>> On Apr 16, 2025, at 8:32 AM, Janek Bevendorff >>>>>>>>>>>>> <[email protected]> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi, >>>>>>>>>>>>> >>>>>>>>>>>>> Since the latest Reef update I have the problem that some of my >>>>>>>>>>>>> hosts suddenly go into a state where all CPUs are stuck in >>>>>>>>>>>>> kernel mode causing all daemons on that host to become >>>>>>>>>>>>> unresponsive. When I connect to the IPMI console, I see a lot of >>>>>>>>>>>>> messages like: >>>>>>>>>>>>> >>>>>>>>>>>>> watchdog: BUG: soft lockup - CPU#8 stuck for 47s! [cron:868840] >>>>>>>>>>>>> >>>>>>>>>>>>> (it's basically a list of all processes running on the machine). >>>>>>>>>>>>> >>>>>>>>>>>>> Usually, this resolves itself after several minutes, but >>>>>>>>>>>>> sometimes I have to hardreset the host. When this happens, all >>>>>>>>>>>>> daemons are marked as down and I cannot interact with the host >>>>>>>>>>>>> at all. I don't know what causes this but, I think it happens >>>>>>>>>>>>> primarily on the hosts where my MDS run and it seems to be >>>>>>>>>>>>> triggered by events such as cluster rebalances, MDS restarts, or >>>>>>>>>>>>> just randomly. >>>>>>>>>>>>> >>>>>>>>>>>>> I found a few reports about similar issues on the bug tracker >>>>>>>>>>>>> and mailing list, but they are all very unspecific, unanswered, >>>>>>>>>>>>> or more than 6 years old. >>>>>>>>>>>>> >>>>>>>>>>>>> Is there any way I can debug this? I upgraded to Squid already, >>>>>>>>>>>>> but that didn't solve the problem. I also had massive issues >>>>>>>>>>>>> with this during the upgrade. Particularly at the end when the >>>>>>>>>>>>> MDS were upgraded, I had constant struggles with it. I had to >>>>>>>>>>>>> set the noout flag and then literally sit next to it to resume >>>>>>>>>>>>> the upgrade every few minutes until it finally went through, >>>>>>>>>>>>> because random MDS hosts went intermittently dark all the time. >>>>>>>>>>>>> >>>>>>>>>>>>> All hosts run Ubuntu 22.04 with kernel 6.8.0. >>>>>>>>>>>>> >>>>>>>>>>>>> Any ideas? Thanks! >>>>>>>>>>>>> Janek >>>>>>>>>>>>> >>>>>>>>>>>>> _______________________________________________ >>>>>>>>>>>>> ceph-users mailing list -- [email protected] >>>>>>>>>>>>> To unsubscribe send an email to [email protected] >>>>>>>>>>> _______________________________________________ >>>>>>>>>>> ceph-users mailing list -- [email protected] >>>>>>>>>>> To unsubscribe send an email to [email protected] >>>>>>>>> -- >>>>>>>>> Bauhaus-Universität Weimar >>>>>>>>> Bauhausstr. 9a, R308 >>>>>>>>> 99423 Weimar, Germany >>>>>>>>> >>>>>>>>> Phone: +49 3643 58 3577 >>>>>>>>> www.webis.de >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- [email protected] >>>>>>>>> To unsubscribe send an email to [email protected] >>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>>> _______________________________________________ >>>>>>>>> ceph-users mailing list -- [email protected] >>>>>>>>> To unsubscribe send an email to [email protected] >>>>> _______________________________________________ >>>>> ceph-users mailing list -- [email protected] > >>>> To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
