Re: [ceph-users] pgs stuck unclean after removing OSDs

David Turner Wed, 28 Jun 2017 07:18:34 -0700

I would stop the service, down, out, rm, auth del, crush remove, disable
service, fstab, umount.


So you did remove it from your crush map, then?  Could you post your `ceph
osd tree`?

On Wed, Jun 28, 2017, 10:12 AM Mazzystr <[email protected]> wrote:

> I've been using this procedure to remove OSDs...
>
> OSD_ID=
> ceph auth del osd.${OSD_ID}
> ceph osd down ${OSD_ID}
> ceph osd out ${OSD_ID}
> ceph osd rm ${OSD_ID}
> ceph osd crush remove osd.${OSD_ID}
> systemctl disable ceph-osd@${OSD_ID}.service
> systemctl stop ceph-osd@${OSD_ID}.service
> sed -i "/ceph-$OSD_ID/d" /etc/fstab
> umount /var/lib/ceph/osd/ceph-${OSD_ID}
>
> Would you say this is the correct order of events?
>
> Thanks!
>
>
> On Wed, Jun 28, 2017 at 9:34 AM, David Turner <[email protected]>
> wrote:
>
>> A couple things.  You didn't `ceph osd crush remove osd.21` after doing
>> the other bits.  Also you will want to remove the bucket (re: host) from
>> the crush map as it will now be empty.  Right now you have a host in the
>> crush map with a weight, but no osds to put that data on.  It has a weight
>> because of the 2 OSDs that are still in it that were removed from the
>> cluster but not from the crush map.  It's confusing to your cluster.
>>
>> If you had removed the OSDs from the crush map when you ran the other
>> commands, then the dead host would have still been in the crush map but
>> with a weight of 0 and wouldn't cause any problems.
>>
>> On Wed, Jun 28, 2017 at 4:15 AM Jan Kasprzak <[email protected]> wrote:
>>
>>>         Hello,
>>>
>>> TL;DR: what to do when my cluster reports stuck unclean pgs?
>>>
>>> Detailed description:
>>>
>>> One of the nodes in my cluster died. CEPH correctly rebalanced itself,
>>> and reached the HEALTH_OK state. I have looked at the failed server,
>>> and decided to take it out of the cluster permanently, because the
>>> hardware
>>> is indeed faulty. It used to host two OSDs, which were marked down and
>>> out
>>> in "ceph osd dump".
>>>
>>> So from the HEALTH_OK I ran the following commands:
>>>
>>> # ceph auth del osd.20
>>> # ceph auth del osd.21
>>> # ceph osd rm osd.20
>>> # ceph osd rm osd.21
>>>
>>> After that, CEPH started to rebalance itself, but now it reports some PGs
>>> as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s":
>>>
>>> # ceph -s
>>>     cluster 3065224c-ea2e-4558-8a81-8f935dde56e5
>>>      health HEALTH_WARN
>>>             350 pgs stuck unclean
>>>             recovery 26/1596390 objects degraded (0.002%)
>>>             recovery 58772/1596390 objects misplaced (3.682%)
>>>      monmap e16: 3 mons at {...}
>>>             election epoch 584, quorum 0,1,2 ...
>>>      osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs
>>>             flags require_jewel_osds
>>>       pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects
>>>             6244 GB used, 40569 GB / 46814 GB avail
>>>             26/1596390 objects degraded (0.002%)
>>>             58772/1596390 objects misplaced (3.682%)
>>>                 3426 active+clean
>>>                  349 active+remapped
>>>                    1 active
>>>   client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr
>>>
>>> # ceph health detail
>>> HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded
>>> (0.002%); recovery 58772/1596390 objects misplaced (3.682%)
>>> pg 28.fa is stuck unclean for 14408925.966824, current state
>>> active+remapped, last acting [38,52,4]
>>> pg 28.e7 is stuck unclean for 14408925.966886, current state
>>> active+remapped, last acting [29,42,22]
>>> pg 23.dc is stuck unclean for 61698.641750, current state
>>> active+remapped, last acting [50,33,23]
>>> pg 23.d9 is stuck unclean for 61223.093284, current state
>>> active+remapped, last acting [54,31,23]
>>> pg 28.df is stuck unclean for 14408925.967120, current state
>>> active+remapped, last acting [33,7,15]
>>> pg 34.38 is stuck unclean for 60904.322881, current state
>>> active+remapped, last acting [18,41,9]
>>> pg 34.fe is stuck unclean for 60904.241762, current state
>>> active+remapped, last acting [58,1,44]
>>> [...]
>>> pg 28.8f is stuck unclean for 66102.059671, current state active, last
>>> acting [8,40,5]
>>> [...]
>>> recovery 26/1596390 objects degraded (0.002%)
>>> recovery 58772/1596390 objects misplaced (3.682%)
>>>
>>> Apart from that, the data stored in CEPH pools seems to be reachable
>>> and usable as before.
>>>
>>> The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH
>>> repository).
>>>
>>> What other debugging info should I provide, or what to do in order
>>> to unstuck the stuck pgs? Thanks!
>>>
>>> -Yenya
>>>
>>> --
>>> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net -
>>> private}> |
>>> | http://www.fi.muni.cz/~kas/                         GPG:
>>> 4096R/A45477D5 |
>>> > That's why this kind of vulnerability is a concern: deploying stuff
>>> is  <
>>> > often about collecting an obscene number of .jar files and pushing
>>> them <
>>> > up to the application server.                          --pboddie at
>>> LWN <
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pgs stuck unclean after removing OSDs

Reply via email to