I'm running a cepf fs with an 8+2 EC data pool. Disks are on 10 hosts and
failure domain is host. Version is mimic 13.2.2. Today I added a few OSDs to
one of the hosts and observed that a lot of PGs became inactive even though 9
out of 10 hosts were up all the time. After getting the 10th host and all disks
up, I still ended up with a large amount of undersized PGs and degraded
objects, which I don't understand as no OSD was removed.
Here some details about the steps taken on the host with new disks, main
questions at the end:
- shut down OSDs (systemctl stop docker)
- reboot host (this is necessary due to OS deployment via warewulf)
Devices got renamed and not all disks came back up (4 OSDs remained down). This
is expected, I need to re-deploy the containers to adjust for device name
changes. Around this point PGs started peering and some failed waiting for 1 of
the down OSDs. I don't understand why they didn't just remain active with 9 out
of 10 disks. Until this moment of some OSDs coming up, all PGs were active.
With min_size=9 I would expect all PGs to remain active with no changes to 9
out of the 10 hosts.
- redeploy docker containers
- all disks/OSDs come up, including the 4 OSDs from above
- inactive PGs complete peering and become active
- now I have a los of degraded Objects and undersized PGs even though not a
single OSD was removed
I don't understand why I have degraded objects. I should just have misplaced
objects:
HEALTH_ERR
22995992/145698909 objects misplaced (15.783%)
Degraded data redundancy: 5213734/145698909 objects degraded
(3.578%), 208 pgs degraded, 208
pgs undersized
Degraded data redundancy (low space): 169 pgs backfill_toofull
Note: The backfill_toofull with low utilization (usage: 38 TiB used, 1.5 PiB /
1.5 PiB avail) is a known issue in ceph (https://tracker.ceph.com/issues/39555)
Also, I should be able to do whatever with 1 out of 10 hosts without loosing
data access. What could be the problem here?
Questions summary:
Why does peering not succeed to keep all PGs active with 9 out of 10 OSDs up
and in?
Why do undersized PGs arise even though all OSDs are up?
Why do degraded objects arise even though no OSD was removed?
Thanks!
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]