Hi Andras,
To me it looks like the osd.0 is not peering when it starts with crush weight 0.
I would try forcing the re-peering with `ceph osd down osd.0` when the
PGs are unexpectedly degraded. (e.g start the osd when crush weight is
0, then obverve the PGs are still degraded, then force the re-peering
-- does it help?)
Otherwise I agree, to me this is an unexpected behaviour -- maybe open a ticket?
Cheers, Dan
P.S. For some reason all of your mails are repeatedly landing in my
spam folder. I think this is the reason:
ARC-Authentication-Results: i=1; mx.google.com;
dkim=neutral (body hash did not verify)
[email protected] header.s=google header.b=NvX+wag9;
spf=fail (google.com: domain of [email protected] does
not designate 217.70.178.232 as permitted sender)
[email protected];
dmarc=fail (p=REJECT sp=REJECT dis=QUARANTINE)
header.from=flatironinstitute.org
On Mon, May 18, 2020 at 10:26 PM Andras Pataki
<[email protected]> wrote:
>
> In a recent cluster reorganization, we ended up with a lot of
> undersized/degraded PGs and a day of recovery from them, when all we
> expected was moving some data around. After retracing my steps, I found
> something odd. If I crush reweight an OSD to 0 while it is down - it
> results in the PGs of that OSD ending up degraded even after the OSD is
> restarted. If I do the same reweighting while the OSD is up - data gets
> moved without any degraded/undersized states. I would not expect this -
> so I wonder if this is a bug or is somehow intended. This is on ceph
> Nautilus 14.2.8. Below are the details.
>
> Andras
>
>
> First the case that works as I would expect:
>
> # Healthy cluster ...
> [root@xorphosd00 ~]# ceph -s
> cluster:
> id: 86d8a1b9-761b-4099-a960-6a303b951236
> health: HEALTH_WARN
> noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
> services:
> mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
> mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
> mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
> osd: 270 osds: 270 up (since 2m), 270 in (since 4h)
> flags noout,nobackfill,noscrub,nodeep-scrub
>
> data:
> pools: 4 pools, 5312 pgs
> objects: 75.87M objects, 287 TiB
> usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
> pgs: 5312 active+clean
>
> # Reweight an OSD to 0
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
> reweighted item id 0 name 'osd.0' to 0 in crush map
>
> # Crush map changes - data movement is set up, no degraded PGs:
> [root@xorphosd00 ~]# ceph -s
> cluster:
> id: 86d8a1b9-761b-4099-a960-6a303b951236
> health: HEALTH_WARN
> noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
> services:
> mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
> mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
> mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
> osd: 270 osds: 270 up (since 10m), 270 in (since 5h); 175 remapped pgs
> flags noout,nobackfill,noscrub,nodeep-scrub
>
> data:
> pools: 4 pools, 5312 pgs
> objects: 75.87M objects, 287 TiB
> usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
> pgs: 2562045/232996662 objects misplaced (1.100%)
> 5137 active+clean
> 172 active+remapped+backfilling
> 3 active+remapped+backfill_wait
>
> # Reweight it back to the original weight
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
>
> # Cluster goes back to clean
> reweighted item id 0 name 'osd.0' to 8 in crush map
> [root@xorphosd00 ~]# ceph -s
> cluster:
> id: 86d8a1b9-761b-4099-a960-6a303b951236
> health: HEALTH_WARN
> noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
> services:
> mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
> mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
> mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
> osd: 270 osds: 270 up (since 11m), 270 in (since 5h)
> flags noout,nobackfill,noscrub,nodeep-scrub
>
> data:
> pools: 4 pools, 5312 pgs
> objects: 75.87M objects, 287 TiB
> usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
> pgs: 5312 active+clean
>
>
>
>
> #
> # Now the problematic case
> #
>
> # Stop an OSD
> [root@xorphosd00 ~]# systemctl stop ceph-osd@0
>
> # We get degraded PGs - as expected
> [root@xorphosd00 ~]# ceph -s
> cluster:
> id: 86d8a1b9-761b-4099-a960-6a303b951236
> health: HEALTH_WARN
> noout,nobackfill,noscrub,nodeep-scrub flag(s) set
> 1 osds down
> Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
> services:
> mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
> mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
> mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
> osd: 270 osds: 269 up (since 16s), 270 in (since 5h)
> flags noout,nobackfill,noscrub,nodeep-scrub
>
> data:
> pools: 4 pools, 5312 pgs
> objects: 75.87M objects, 287 TiB
> usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
> pgs: 873964/232996662 objects degraded (0.375%)
> 5230 active+clean
> 82 active+undersized+degraded
>
> # Reweight the OSD to 0:
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 0.0
>
> # Still degraded - as expected
> reweighted item id 0 name 'osd.0' to 0 in crush map
> [root@xorphosd00 ~]# ceph -s
> cluster:
> id: 86d8a1b9-761b-4099-a960-6a303b951236
> health: HEALTH_WARN
> noout,nobackfill,noscrub,nodeep-scrub flag(s) set
> 1 osds down
> Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
> services:
> mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
> mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
> mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
> osd: 270 osds: 269 up (since 59s), 270 in (since 5h); 175 remapped pgs
> flags noout,nobackfill,noscrub,nodeep-scrub
>
> data:
> pools: 4 pools, 5312 pgs
> objects: 75.87M objects, 287 TiB
> usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
> pgs: 873964/232996662 objects degraded (0.375%)
> 1688081/232996662 objects misplaced (0.725%)
> 5137 active+clean
> 93 active+remapped+backfilling
> 82 active+undersized+degraded+remapped+backfilling
>
> # Restarting the OSD
> [root@xorphosd00 ~]# systemctl start ceph-osd@0
>
> # And the PGs still stay degraded - THIS IS UNEXPECTED!!!
> [root@xorphosd00 ~]# ceph -s
> cluster:
> id: 86d8a1b9-761b-4099-a960-6a303b951236
> health: HEALTH_WARN
> noout,nobackfill,noscrub,nodeep-scrub flag(s) set
> Degraded data redundancy: 873964/232996662 objects degraded
> (0.375%), 82 pgs degraded
>
> services:
> mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
> mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
> mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
> osd: 270 osds: 270 up (since 14s), 270 in (since 5h); 175 remapped pgs
> flags noout,nobackfill,noscrub,nodeep-scrub
>
> data:
> pools: 4 pools, 5312 pgs
> objects: 75.87M objects, 287 TiB
> usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
> pgs: 873964/232996662 objects degraded (0.375%)
> 1688081/232996662 objects misplaced (0.725%)
> 5137 active+clean
> 93 active+remapped+backfilling
> 82 active+undersized+degraded+remapped+backfilling
>
> # Now for something even more odd - reweight the OSD back to its
> original weigh
> # and all the data gets magically FOUND again on that OSD!!!
> [root@xorphosd00 ~]# ceph osd crush reweight osd.0 8.0
> reweighted item id 0 name 'osd.0' to 8 in crush map
> [root@xorphosd00 ~]# ceph -s
> cluster:
> id: 86d8a1b9-761b-4099-a960-6a303b951236
> health: HEALTH_WARN
> noout,nobackfill,noscrub,nodeep-scrub flag(s) set
>
> services:
> mon: 3 daemons, quorum xorphmon00,xorphmon01,xorphmon02 (age 11d)
> mgr: xorphmon01(active, since 6w), standbys: xorphmon02, xorphmon00
> mds: cephfs:1 {0=xorphmon02=up:active} 1 up:standby
> osd: 270 osds: 270 up (since 51s), 270 in (since 5h)
> flags noout,nobackfill,noscrub,nodeep-scrub
>
> data:
> pools: 4 pools, 5312 pgs
> objects: 75.87M objects, 287 TiB
> usage: 864 TiB used, 1.1 PiB / 1.9 PiB avail
> pgs: 5312 active+clean
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]