[ceph-users] Re: Ceph upgrade OSD unsafe to stop

GLE, Vivien Mon, 18 Aug 2025 05:14:19 -0700


> 'ceph osd ok-to-stop' is a safety check, nothing more.


Oh ok I though that you can add true to ignore pgs becoming inactive
This is the ouput for one OSD , all of my OSD are unsafe to stop.

root@ceph-monitor-1:/# ceph osd ok-to-stop 1
{"ok_to_stop":false,"osds":[1],"num_ok_pgs":30,"num_not_ok_pgs":170,"bad_become_inactive":["2.0","2.2","2.3","2.4","2.7","2.f","2.11","2.14","2.17","2.18","2.19","2.1a","2.1d","3.2","3.5","3.a","3.d","7.0","7.6","7.7","7.8","7.b","7.d","7.11","7.16","7.17","7.19","7.1f","10.2","10.9","10.b","10.c","10.d","10.e","10.10","10.11","10.14","10.1a","10.1b","10.1d","10.1f","11.1","11.4","11.5","11.7","11.9","11.a","11.c","11.d","11.e","11.11","11.13","11.14","11.15","11.16","11.1a","11.1b","11.1e","15.1","15.2","15.9","15.a","15.b","15.d","15.f","15.10","15.11","15.13","15.14","15.16","15.17","15.19","15.1c","15.1d","15.1f","16.2","16.3","16.5","16.6","16.9","16.c","16.d","16.f","16.18","16.19","16.1b","16.1c","16.1d","16.1e","17.6","17.a","17.b","17.c","17.d","17.10","17.11","17.18","17.1f","25.2","25.3","25.5","25.7","25.8","25.b","25.10","25.11","25.13","25.14","25.17","25.19","25.1a","25.1b","25.21","25.22","25.23","25.25","25.27","25.2a","25.2b","25.2e","25.31","25.35","25.37","25.3d","26.1","26.8","26.9","26.a","26.f","26.10","26.14","26.16","26.1a","26.1e","27.0","27.1","27.2","27.4","27.5","27.6","27.c","27.d","27.f","28.2","28.3","28.6","28.7","28.a","28.10","28.14","28.18","28.1a","28.1d","28.1e","28.1f","28.20","28.21","28.25","28.26","28.2a","28.2b","28.2c","28.2d","28.2f","28.33","28.34","28.37","28.3a","28.3d","28.3f"],"ok_become_degraded":["4.0","4.3","4.6","4.8","4.a","4.b","4.d","4.e","4.11","4.13","4.14","4.15","4.17","4.18","4.1b","4.1d","4.1f","18.0","18.2","18.6","18.c","18.d","18.e","18.14","18.16","18.17","18.19","18.1a","18.1d","18.1e"]}
Error EBUSY: unsafe to stop osd(s) at this time (170 PGs are or would become 
offline)

root@ceph-monitor-1:/# ceph osd pool ls detail
pool 1 '.mgr' replicated size 3 min_size 3 crush_rule 0 object_hash rjenkins 
pg_num 1 pgp_num 1 autoscale_mode on last_change 886201 flags hashpspool 
stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr read_balance_score 
6.98
pool 2 'rbd' replicated size 3 min_size 3 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 670923 lfor 0/1645/1643 
flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd 
read_balance_score 2.19
pool 3 'cephfs.toto-fs.meta' replicated size 3 min_size 3 crush_rule 0 
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 886203 
lfor 0/0/64 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
recovery_priority 5 application cephfs read_balance_score 2.19
pool 4 'cephfs.toto-fs.data' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 3008 
lfor 0/3008/3006 flags hashpspool,bulk max_bytes 32212254720 stripe_width 0 
application cephfs read_balance_score 1.53
pool 7 '.nfs' replicated size 3 min_size 3 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 886199 lfor 0/0/140 flags 
hashpspool stripe_width 0 application nfs read_balance_score 1.53
pool 10 'prbd' replicated size 3 min_size 3 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 670919 lfor 0/0/833 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 
1.53
pool 11 '.rgw.root' replicated size 3 min_size 3 crush_rule 0 object_hash 
rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 886205 lfor 0/0/833 
flags hashpspool stripe_width 0 application rgw read_balance_score 1.53
pool 15 'testzone.rgw.log' replicated size 3 min_size 3 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 886225 
lfor 0/0/1153 flags hashpspool stripe_width 0 application rgw 
read_balance_score 1.97
pool 16 'testzone.rgw.control' replicated size 3 min_size 3 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 886213 
lfor 0/0/1153 flags hashpspool stripe_width 0 application rgw 
read_balance_score 1.53
pool 17 'testzone.rgw.meta' replicated size 3 min_size 3 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 886215 
lfor 0/0/1875 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application 
rgw read_balance_score 1.53
pool 18 'testzone.rgw.buckets.index' replicated size 3 min_size 2 crush_rule 0 
object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 1965 
lfor 0/0/1877 flags hashpspool stripe_width 0 pg_autoscale_bias 4 application 
rgw read_balance_score 1.31
pool 25 'pool_VM' replicated size 3 min_size 3 crush_rule 0 object_hash 
rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 886217 lfor 
0/0/885938 flags hashpspool,selfmanaged_snaps max_bytes 107374182400 
stripe_width 0 target_size_bytes 536870912000 application rbd 
read_balance_score 1.64
pool 26 'k8s' replicated size 3 min_size 3 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 886233 lfor 0/0/886231 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd read_balance_score 
1.53
pool 27 'cephfs.testfs.meta' replicated size 3 min_size 3 crush_rule 0 
object_hash rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 886221 
lfor 0/0/885939 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 
16 recovery_priority 5 application cephfs read_balance_score 1.75
pool 28 'cephfs.testfs.data' replicated size 3 min_size 3 crush_rule 0 
object_hash rjenkins pg_num 64 pgp_num 64 autoscale_mode on last_change 886242 
lfor 0/0/886240 flags hashpspool,bulk stripe_width 0 application cephfs 
read_balance_score 1.31

root@ceph-monitor-1:/# ceph osd df tree
ID   CLASS  WEIGHT   REWEIGHT  SIZE     RAW USE  DATA    OMAP     META     
AVAIL    %USE  VAR   PGS  STATUS  TYPE NAME
 -1         6.63879         -  6.4 TiB  104 GiB  95 GiB  164 KiB  9.6 GiB  6.3 
TiB  1.60  1.00    -          root inist
-15         0.90970         -  932 GiB   12 GiB  11 GiB   35 KiB  889 MiB  920 
GiB  1.29  0.80    -              host ceph-monitor-1
  6    hdd  0.90970   1.00000  932 GiB   12 GiB  11 GiB   35 KiB  889 MiB  920 
GiB  1.29  0.80  197      up          osd.6
-21         5.72910         -  5.5 TiB   92 GiB  84 GiB  129 KiB  8.7 GiB  5.4 
TiB  1.65  1.03    -              datacenter dcml
-20         2.72910         -  2.7 TiB   46 GiB  42 GiB   62 KiB  4.0 GiB  2.7 
TiB  1.64  1.02    -                  room it02
-19         2.72910         -  2.7 TiB   46 GiB  42 GiB   62 KiB  4.0 GiB  2.7 
TiB  1.64  1.02    -                      row left
-18         2.72910         -  2.7 TiB   46 GiB  42 GiB   62 KiB  4.0 GiB  2.7 
TiB  1.64  1.02    -                          rack 10
 -3         0.90970         -  932 GiB   14 GiB  13 GiB   17 KiB  1.2 GiB  918 
GiB  1.50  0.94    -                              host ceph-node-1
  2    hdd  0.90970   1.00000  932 GiB   14 GiB  13 GiB   17 KiB  1.2 GiB  918 
GiB  1.50  0.94  201      up                          osd.2
 -5         0.90970         -  932 GiB   16 GiB  14 GiB   28 KiB  1.6 GiB  916 
GiB  1.70  1.06    -                              host ceph-node-2
  1    hdd  0.90970   1.00000  932 GiB   16 GiB  14 GiB   28 KiB  1.6 GiB  916 
GiB  1.70  1.06  200      up                          osd.1
 -9         0.90970         -  932 GiB   16 GiB  15 GiB   17 KiB  1.2 GiB  915 
GiB  1.72  1.07    -                              host ceph-node-3
  5    hdd  0.90970   1.00000  932 GiB   16 GiB  15 GiB   17 KiB  1.2 GiB  915 
GiB  1.72  1.07  200      up                          osd.5
-36         3.00000         -  2.7 TiB   47 GiB  42 GiB   67 KiB  4.8 GiB  2.7 
TiB  1.67  1.04    -                  room it06
-35         3.00000         -  2.7 TiB   47 GiB  42 GiB   67 KiB  4.8 GiB  2.7 
TiB  1.67  1.04    -                      row left06
-34         3.00000         -  2.7 TiB   47 GiB  42 GiB   67 KiB  4.8 GiB  2.7 
TiB  1.67  1.04    -                          rack 08
 -7         1.00000         -  932 GiB   15 GiB  13 GiB   21 KiB  1.6 GiB  917 
GiB  1.57  0.98    -                              host ceph-node-4
  0    hdd  1.00000   1.00000  932 GiB   15 GiB  13 GiB   21 KiB  1.6 GiB  917 
GiB  1.57  0.98  207      up                          osd.0
-13         1.00000         -  932 GiB   15 GiB  14 GiB   31 KiB  1.6 GiB  916 
GiB  1.63  1.02    -                              host ceph-node-5
  3    hdd  1.00000   1.00000  932 GiB   15 GiB  14 GiB   31 KiB  1.6 GiB  916 
GiB  1.63  1.02  217      up                          osd.3
-11         1.00000         -  932 GiB   17 GiB  15 GiB   15 KiB  1.5 GiB  915 
GiB  1.80  1.12    -                              host ceph-node-6
  4    hdd  1.00000   1.00000  932 GiB   17 GiB  15 GiB   15 KiB  1.5 GiB  915 
GiB  1.80  1.12  221      up                          osd.4
                        TOTAL  6.4 TiB  104 GiB  95 GiB  168 KiB  9.6 GiB  6.3 
TiB  1.60
MIN/MAX VAR: 0.80/1.12  STDDEV: 0.16

Vivien










________________________________
De : Eugen Block <[email protected]>
Envoyé : lundi 18 août 2025 12:37:14
À : [email protected]
Objet : [ceph-users] Re: Ceph upgrade OSD unsafe to stop

Hi,

'ceph osd ok-to-stop' is a safety check, nothing more. It basically
checks if PGs would become inactive if you stopped said OSD or if
those PGs would become only degraded. Which OSD does report that it's
unsafe to stop? Can you paste the output of 'ceph osd ok-to-stop
<OSD_ID>'? And with that also 'ceph osd pool ls detail' to see which
pool(s) is/are affected. And 'ceph osd df tree' can also be useful here.

Regards,
Eugen

Zitat von "GLE, Vivien" <[email protected]>:

> Hi,
>
>
> I'm trying to update my cluster (19.2.2 -> 19.2.3), mon and mgr
> upgrade goes well but I had some issue with OSD :
>
>
> Upgrade: unsafe to stop osd(s) at this time (165 PGs are or would
> become offline)
>
>
> Cluster is in health_ok
>
> All pools are replica 3 and pgs active+clean
>
> autoscaler is off following the ceph docs
>
>
> Does ceph osd ok-to-stop lead to lost data ?
>
>
> The only rules used in the cluster is replicated_rule :
>
>
> root@ceph-monitor-1:/# ceph osd crush rule dump replicated_rule
>
> {
>     "rule_id": 0,
>     "rule_name": "replicated_rule",
>     "type": 1,
>     "steps": [
>         {
>             "op": "take",
>             "item": -1,
>             "item_name": "inist"
>         },
>         {
>             "op": "chooseleaf_firstn",
>             "num": 0,
>             "type": "host"
>         },
>         {
>             "op": "emit"
>         }
>     ]
> }
>
> root@ceph-monitor-1:/# ceph osd tree
> ID   CLASS  WEIGHT   TYPE NAME                             STATUS
> REWEIGHT  PRI-AFF
>  -1         6.63879  root inist
> -15         0.90970      host ceph-monitor-1
>   6    hdd  0.90970          osd.6                             up
> 1.00000  1.00000
> -21         5.72910      datacenter bat1
> -20         2.72910          room room01
> -19         2.72910              row left
> -18         2.72910                  rack 10
>  -3         0.90970                      host ceph-node-1
>   2    hdd  0.90970                          osd.2             up
> 1.00000  1.00000
>  -5         0.90970                      host ceph-node-2
>   1    hdd  0.90970                          osd.1             up
> 1.00000  1.00000
>  -9         0.90970                      host ceph-node-3
>   5    hdd  0.90970                          osd.5             up
> 1.00000  1.00000
> -36         3.00000          room room03
> -35         3.00000              row left06
> -34         3.00000                  rack 08
>  -7         1.00000                      host ceph-node-4
>   0    hdd  1.00000                          osd.0             up
> 1.00000  1.00000
> -13         1.00000                      host ceph-node-5
>   3    hdd  1.00000                          osd.3             up
> 1.00000  1.00000
> -11         1.00000                      host ceph-node-6
>   4    hdd  1.00000                          osd.4             up
> 1.00000  1.00000
>
> Thanks !
>
> Vivien
>
>
>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph upgrade OSD unsafe to stop

Reply via email to