A couple things.  You didn't `ceph osd crush remove osd.21` after doing the
other bits.  Also you will want to remove the bucket (re: host) from the
crush map as it will now be empty.  Right now you have a host in the crush
map with a weight, but no osds to put that data on.  It has a weight
because of the 2 OSDs that are still in it that were removed from the
cluster but not from the crush map.  It's confusing to your cluster.

If you had removed the OSDs from the crush map when you ran the other
commands, then the dead host would have still been in the crush map but
with a weight of 0 and wouldn't cause any problems.

On Wed, Jun 28, 2017 at 4:15 AM Jan Kasprzak <[email protected]> wrote:

>         Hello,
>
> TL;DR: what to do when my cluster reports stuck unclean pgs?
>
> Detailed description:
>
> One of the nodes in my cluster died. CEPH correctly rebalanced itself,
> and reached the HEALTH_OK state. I have looked at the failed server,
> and decided to take it out of the cluster permanently, because the hardware
> is indeed faulty. It used to host two OSDs, which were marked down and out
> in "ceph osd dump".
>
> So from the HEALTH_OK I ran the following commands:
>
> # ceph auth del osd.20
> # ceph auth del osd.21
> # ceph osd rm osd.20
> # ceph osd rm osd.21
>
> After that, CEPH started to rebalance itself, but now it reports some PGs
> as "stuck unclean", and there is no "recovery I/O" visible in "ceph -s":
>
> # ceph -s
>     cluster 3065224c-ea2e-4558-8a81-8f935dde56e5
>      health HEALTH_WARN
>             350 pgs stuck unclean
>             recovery 26/1596390 objects degraded (0.002%)
>             recovery 58772/1596390 objects misplaced (3.682%)
>      monmap e16: 3 mons at {...}
>             election epoch 584, quorum 0,1,2 ...
>      osdmap e61435: 58 osds: 58 up, 58 in; 350 remapped pgs
>             flags require_jewel_osds
>       pgmap v35959908: 3776 pgs, 6 pools, 2051 GB data, 519 kobjects
>             6244 GB used, 40569 GB / 46814 GB avail
>             26/1596390 objects degraded (0.002%)
>             58772/1596390 objects misplaced (3.682%)
>                 3426 active+clean
>                  349 active+remapped
>                    1 active
>   client io 5818 B/s rd, 8457 kB/s wr, 0 op/s rd, 71 op/s wr
>
> # ceph health detail
> HEALTH_WARN 350 pgs stuck unclean; recovery 26/1596390 objects degraded
> (0.002%); recovery 58772/1596390 objects misplaced (3.682%)
> pg 28.fa is stuck unclean for 14408925.966824, current state
> active+remapped, last acting [38,52,4]
> pg 28.e7 is stuck unclean for 14408925.966886, current state
> active+remapped, last acting [29,42,22]
> pg 23.dc is stuck unclean for 61698.641750, current state active+remapped,
> last acting [50,33,23]
> pg 23.d9 is stuck unclean for 61223.093284, current state active+remapped,
> last acting [54,31,23]
> pg 28.df is stuck unclean for 14408925.967120, current state
> active+remapped, last acting [33,7,15]
> pg 34.38 is stuck unclean for 60904.322881, current state active+remapped,
> last acting [18,41,9]
> pg 34.fe is stuck unclean for 60904.241762, current state active+remapped,
> last acting [58,1,44]
> [...]
> pg 28.8f is stuck unclean for 66102.059671, current state active, last
> acting [8,40,5]
> [...]
> recovery 26/1596390 objects degraded (0.002%)
> recovery 58772/1596390 objects misplaced (3.682%)
>
> Apart from that, the data stored in CEPH pools seems to be reachable
> and usable as before.
>
> The nodes run CentOS 7 and ceph 10.2.5 (RPMS downloaded from CEPH
> repository).
>
> What other debugging info should I provide, or what to do in order
> to unstuck the stuck pgs? Thanks!
>
> -Yenya
>
> --
> | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}>
> |
> | http://www.fi.muni.cz/~kas/                         GPG: 4096R/A45477D5
> |
> > That's why this kind of vulnerability is a concern: deploying stuff is  <
> > often about collecting an obscene number of .jar files and pushing them <
> > up to the application server.                          --pboddie at LWN <
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to