Hi,
after running the cluster for years, it is the first time i have a problem that
seems to require some expert knowlege ;-)
I had some network problems, which i suspect that lead to the damaged mds
service. Just as a side note: i upgraded to 17.2.2 but it had been running for
about a hour till the network outage, so i don't think this was related.
However, now i wonder how to go from here.
```
# ceph -s
cluster:
health: HEALTH_ERR
1 filesystem is degraded
1 filesystem is offline
1 mds daemon damaged
services:
mon: 5 daemons, quorum ceph3,ceph4,ceph1,ceph5,ceph2 (age 46m)
mgr: ceph5.zmvagf(active, since 5h), standbys: ceph2.defhpj
mds: 0/1 daemons up, 2 standby
osd: 10 osds: 10 up (since 46m), 10 in (since 2h)
data:
volumes: 0/1 healthy, 1 recovering; 1 damaged
pools: 4 pools, 193 pgs
objects: 5.42M objects, 13 TiB
usage: 26 TiB used, 11 TiB / 37 TiB avail
pgs: 193 active+clean
io:
client: 200 KiB/s rd, 84 KiB/s wr, 199 op/s rd, 169 op/s wr
```
```
# ceph health detail
HEALTH_ERR 1 filesystem is degraded; 1 filesystem is offline; 1 mds daemon
damaged
[WRN] FS_DEGRADED: 1 filesystem is degraded
fs cephfs is degraded
[ERR] MDS_ALL_DOWN: 1 filesystem is offline
fs cephfs is offline because no MDS is active for it.
[ERR] MDS_DAMAGE: 1 mds daemon damaged
fs cephfs mds.0 is damaged
```
```
# ceph fs status
cephfs - 0 clients
======
RANK STATE MDS ACTIVITY DNS INOS DIRS CAPS
0 failed
POOL TYPE USED AVAIL
cephfs.cephfs.meta metadata 1293M 1954G
cephfs.cephfs.data data 2797G 1954G
ecpool data 23.2T 2931G
STANDBY MDS
cephfs.ceph1.yzqmuo
cephfs.ceph3.vmieie
MDS version: ceph version 19.2.2 (0eceb0defba60152a8182f7bd87d164b639885b8)
squid (stable)
```
According to
* https://docs.ceph.com/en/latest/cephfs/disaster-recovery-experts/
i came up with the following procedure.
Deny all reconnect to clients.
```sh
ceph config set mds mds_deny_all_reconnect true
```
Deny new clients.
```sh
ceph fs set cephfs refuse_client_session true
```
Backup the current journal:
```sh
mkdir /root/mds-damaged
cephfs-journal-tool --rank=cephfs:0 journal export
/root/mds-damaged/backup-rank0.bin
```
Recover file metadata and discard what is damaged:
```sh
cephfs-journal-tool --rank=cephfs:0 event recover_dentries summary
```
Truncate any journal that is corrupt or that an MDS cannot replay:
```sh
cephfs-journal-tool --rank=cephfs:0 journal reset --yes-i-really-really-mean-it
```
Reset the SessionMap:
```sh
cephfs-table-tool all reset session
```
Does this make sense?
Yours,
bbk
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]