[ceph-users] Re: Debugging OSD cache thrashing

Janne Johansson Mon, 23 Jun 2025 00:15:41 -0700

> If anything strictly below 4GB is completely unsupported and expected to
> go into a thrashing tailspin, perhaps that doc should be updated to
> state that.


> > Angrily writing that a complex, mature, FREE system is “broken” because it 
> > doesn’t perform miracles when abused is folly, like expecting coffee to not 
> > be hot.
>
> Is there any reason to believe it's not broken besides "I set
> osd_memory_target to 2.4G"?
>
> If this repros with the 4GB setting on a node with enough RAM (one of
> the three does have as much), then is it broken?

We did not get a clear description of your system, but if we make a
parallel to say "fsck on unixes", a common recommendation for those is
to have 1G of ram for each TB of disk (or 1M of ram for each G in the
old days) in order to be able to build linked lists of files and dirs
and whatever fsck does to validate a possibly broken file system. If
you - in such a case - have a limited ram system and a huge drive, you
can expect a very poor progress where it either needs to swap or might
not actually work at all. There is basically no guarantee that fsck
will finish if you have too few resources. It is super bad if it
doesn't finish when you need to boot, but sometimes the solution is to
have enough ram for a certain size of disk (or certain number of files
on the disk).

So while I'm not certain if your tests did end up at the same place,
with a too-large disk for the ram you gave the process, it sounds
possible. Though one can run ceph on a single rpi with a usb-disk, its
still not recommended to do so, and if developer time is tight I would
(like Eugen might have been trying to say) not like for them to spend
time fixing rpi-size cluster issues at the expense of other bugs.

It might well be actual real bugs, but I'm more happy if they code on
65+ PB cluster bugs than 1-2G ram OSDs (to take a recent example of
working at the other end of size limits) because we are running
several multi-PB clusters and zero raspberry pi clusters.

Over the years, people have tried to run "test" clusters on 1G disks,
5G disks and so on, and they run in to problems and bugs that
literally no one with a "real" cluster ever sees. Same there, there
exists real bugs in the code based on the OSD assuming it can
preallocate several gigs for RocksDB and all that and that is totally
fine for everyone that has gone into a store and bought a disk for the
last 10+ years, all those disks have "a few G to spare". The machines
that do not have this are basically not fit for running clusters,
either being rpis trying to run on smartcards or VMs given too few
resources. There are other options for file serving on such small
machines, but being a part of ceph clusters might not be in that list.

-- 
May the most significant bit of your life be positive.
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Debugging OSD cache thrashing

Reply via email to