Suggestions: 1. Figure out which OSDs are unsafe to stop. 2. Slowly restart every other OSD 3. Figure out which PGs are degraded 4. Use the "ceph osd pg-upmap-items" command to redirect their recovery to already-restarted OSDs 5. At this point, the set of OSDs that are unsafe to restart should contain only already-restarted OSDs 6. Restart the remaining OSDs
P.S. Not tested. On Tue, Aug 19, 2025 at 5:31 PM Curt <[email protected]> wrote: > Hello all, > > I'm sure this has been discussed before, but I can't seem to find it. I > know on older versions of Ceph there was an issue with mclock having no > recovery and switching to wpq fixed it. Is this still an issue with > 19.2.1? > > I recently ran into this bug <https://tracker.ceph.com/issues/70390>and > various issues with it. In order to help recovery I set norebalance flag, > so it would focus solely on undersized PGs. The issue I'm seeing though is > sometimes recovering will show nothing despite having > X active+undersized+remapped+backfilling. Sometimes restarting a few OSD's > will fix the issue and it will start again. > > I'm tempted to switch to wpq, but that would mean having to slowly restart > each OSD, which with undersized would cause IO to stop while some OSD's are > restarted. Wanted to get others' thoughts before making the change. > > Thanks, > Curt > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > -- Alexander Patrakov _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
