I would stop the service, down, out, rm, auth del, crush remove, disable service, fstab, umount.
So you did remove it from your crush map, then? Could you post your `ceph osd tree`? On Wed, Jun 28, 2017, 10:12 AM Mazzystr <[email protected]> wrote: > I've been using this procedure to remove OSDs... > > OSD_ID= > ceph auth del osd.${OSD_ID} > ceph osd down ${OSD_ID} > ceph osd out ${OSD_ID} > ceph osd rm ${OSD_ID} > ceph osd crush remove osd.${OSD_ID} > systemctl disable ceph-osd@${OSD_ID}.service > systemctl stop ceph-osd@${OSD_ID}.service > sed -i "/ceph-$OSD_ID/d" /etc/fstab > umount /var/lib/ceph/osd/ceph-${OSD_ID} > > Would you say this is the correct order of events? > > Thanks! > > > On Wed, Jun 28, 2017 at 9:34 AM, David Turner <[email protected]> > wrote: > >> A couple things. You didn't `ceph osd crush remove osd.21` after doing >> the other bits. Also you will want to remove the bucket (re: host) from >> the crush map as it will now be empty. Right now you have a host in the >> crush map with a weight, but no osds to put that data on. It has a weight >> because of the 2 OSDs that are still in it that were removed from the >> cluster but not from the crush map. It's confusing to your cluster. >> >> If you had removed the OSDs from the crush map when you ran the other >> commands, then the dead host would have still been in the crush map but >> with a weight of 0 and wouldn't cause any problems. >> >> On Wed, Jun 28, 2017 at 4:15 AM Jan Kasprzak <[email protected]> wrote: >> >>> Hello, >>> >>> TL;DR: what to do when my cluster reports stuck unclean pgs? >>> >>> Detailed description: >>> >>> One of the nodes in my cluster died. CEPH correctly rebalanced itself, >>> and reached the HEALTH_OK state. I have looked at the failed server, >>> and decided to take it out of the cluster permanently, because the >>> hardware >>> is indeed faulty. It used to host two OSDs, which were marked down and >>> out >>> in "ceph osd dump". >>> >>> So from the HEALTH_OK I ran the following commands: >>> >>> # ceph auth del osd.20 >>> # ceph auth del osd.21 >>> # ceph osd rm osd.20 >>> # ceph osd rm osd.21 >>> >>> After that, CEPH started to rebalance itself, but now it reports some PGs >>> as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s": >>> >>> # ceph -s >>> cluster 3065224c-ea2e-4558-8a81-8f935dde56e5 >>> health HEALTH_WARN >>> 350 pgs stuck unclean >>> recovery 26/1596390 objects degraded (0.002%) >>> recovery 58772/1596390 objects misplaced (3.682%) >>> monmap e16: 3 mons at {...} >>> election epoch 584, quorum 0,1,2 ... >>> osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs >>> flags require_jewel_osds >>> pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects >>> 6244 GB used, 40569 GB / 46814 GB avail >>> 26/1596390 objects degraded (0.002%) >>> 58772/1596390 objects misplaced (3.682%) >>> 3426 active+clean >>> 349 active+remapped >>> 1 active >>> client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr >>> >>> # ceph health detail >>> HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded >>> (0.002%); recovery 58772/1596390 objects misplaced (3.682%) >>> pg 28.fa is stuck unclean for 14408925.966824, current state >>> active+remapped, last acting [38,52,4] >>> pg 28.e7 is stuck unclean for 14408925.966886, current state >>> active+remapped, last acting [29,42,22] >>> pg 23.dc is stuck unclean for 61698.641750, current state >>> active+remapped, last acting [50,33,23] >>> pg 23.d9 is stuck unclean for 61223.093284, current state >>> active+remapped, last acting [54,31,23] >>> pg 28.df is stuck unclean for 14408925.967120, current state >>> active+remapped, last acting [33,7,15] >>> pg 34.38 is stuck unclean for 60904.322881, current state >>> active+remapped, last acting [18,41,9] >>> pg 34.fe is stuck unclean for 60904.241762, current state >>> active+remapped, last acting [58,1,44] >>> [...] >>> pg 28.8f is stuck unclean for 66102.059671, current state active, last >>> acting [8,40,5] >>> [...] >>> recovery 26/1596390 objects degraded (0.002%) >>> recovery 58772/1596390 objects misplaced (3.682%) >>> >>> Apart from that, the data stored in CEPH pools seems to be reachable >>> and usable as before. >>> >>> The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH >>> repository). >>> >>> What other debugging info should I provide, or what to do in order >>> to unstuck the stuck pgs? Thanks! >>> >>> -Yenya >>> >>> -- >>> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - >>> private}> | >>> | http://www.fi.muni.cz/~kas/ GPG: >>> 4096R/A45477D5 | >>> > That's why this kind of vulnerability is a concern: deploying stuff >>> is < >>> > often about collecting an obscene number of .jar files and pushing >>> them < >>> > up to the application server. --pboddie at >>> LWN < >>> _______________________________________________ >>> ceph-users mailing list >>> [email protected] >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >>> >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
