[ceph-users] Re: Debugging OSD cache thrashing

Mark Nelson Mon, 11 Aug 2025 09:01:26 -0700

Congrats on figuring this out Hector!  This is a huge find! Comments below.



On 8/11/25 4:31 AM, Hector Martin wrote:

For those who have been following along, I figured it out. I left all
the details with Mark on Slack, but TL;DR: The fix is *either one* (or
both works too) of these:

ceph config set osd rocksdb_cache_index_and_filter_blocks false
   (Ceph default: true, RocksDB default: false)
ceph config set osd rocksdb_cache_shard_bits 0
   (or 1, Ceph default: 4, RocksDB default: variable but probably 6)

(Restart all OSDs after changing these settings for them to take effect,
I don't think they apply live)

The side effect of the first option is that it might increase
unaccounted heap OSD memory usage (not managed by the cache autosize
code), as filter blocks are preloaded and cached outside the block
cache. The side effect of the second option is that it reduces the
parallelism/core scalability of the block cache. I suspect the impact of
either will be generally small for most typical deployments, and the
benefit of not having horrible snaptrim thrashing far outweighs it for
those affected.

Both options were changed from their RocksDB defaults in Ceph commits
long ago without any explanation / relevant commit messages, so I
suspect this is a case of "some people flipped some knobs that seemed
harmless and nobody really tested their impact". Other things have
changed since then too, so I can't say when the impact likely began, as
there are a lot of other factors involved.

I'll take ownership of that one since I was the one that flipped thoseknobs. ;)

FWIW, a lot of the discussion around these issues was happening back inthe bluestore standup in those days. I recall that the idea behindthese changes is that we didn't want index and filter blocks to be ableto consume memory outside the context of the cache. IE at the time wewere concerned with runaway memory usage of the OSD, so having index andfilter blocks cached with high (pinned!) priority in the block cache waspreferable to having memory allocated for them without bound and nooversight in RocksDB. The testing we did at the time didn't involverunning in scenarios where the RocksDB cache size was this small. Wedidn't have the osd_memory_target autotuning back then and the cachesize was a static ratio of the overall bluestore cache size. Back inthose days we did regularly see the OSD swings in memory usage dependingon the workload.

It seems to me this is really a combination of high memory pressureforcing the RocksDB block cache to shrink to the minimum 64MB, having 16cache shards, and forcing index/filter blocks into the cache that leadsto the bad behavior.


The root cause is that the sharding defaults to 16 shards, and that
filter blocks are cached in the block cache instead of separately (not
the RocksDB default). Filter blocks can be up to 17MB or thereabouts,
which means with 16 shards, the minimum viable KV block cache size is
~272MB to ensure non-pathological behavior (if a filter block does not
fit into its cache shard, it is not cached at all, causing the problem).
Since snaptrim uses a lot of memory, it squeezes out the OSD caches and
as soon as the RocksDB block cache size drops below that size or so
(depending on how large filter blocks you ended up with on a particular
OSD), you get pathological thrashing with potentially several gigabytes
per second of kernel->user memory copies repeatedly rereading filter
blocks from SSTs.

My thought is that unless testing says us otherwise, we probably shoulddrop the shard count down to around 4 and bump up the minimum cacheallocation a bit. We could disable storing index/filter blocks in theblock cache and now the osd_memory_target code will attempt tocompensate for it (which we didn't have before), but it feels a bit likegoing backwards (ideally we would be accounting for all significantmemory usage in the OSD).


Fixing this *might* mean that bluefs_buffered_io can be flipped to false
too.


This is one of the most exciting aspects of this discovery imho!


As for who is likely affected, it depends on snaptrim memory usage vs.
osd_memory_target. If you can guarantee 1GB or so (rough) for managed
caches, then it might never affect you, though it still depends on what
PriCache wants to do and it might still squeeze the KV cache under some
other set of conditions. With the default of 4G for osd_memory_target,
my guess is things are just about on the edge of being safe enough most
of the time, which is probably why most people don't see things go
horribly wrong. If you increase osd_memory_target above the default,
you're probably safe. If you decrease it, you're in danger.

Hence the bluefs_buffered_io crutch. Even if you hit this, if you haveenough page cache to keep the SST files cached, it doesn't really affectyou either.


Either of the above settings makes it impossible for this pathological
situation to occur. rocksdb_cache_index_and_filter_blocks=false is
safest, rocksdb_cache_shard_bits=0 could allow for some thrashing if
multiple SSTs are involved, rocksdb_cache_shard_bits=1 is borderline,
but neither should allow for the extreme pathological behavior where a
single thread thrashes reads repeatedly at extreme speeds in any case
(which is what I experienced).

I'd slightly disagree with the conclusion thatrocksdb_cache_index_and_filter_blocks=false is safest, especially whenrunning in extremely memory constrained scenarios. It's safest from theangle that this data will now be forced into memory, but it's not safefrom the viewpoint that it will contribute to osd_memory_target overagewithout oversight. It takes control out of the prioritycache's handsand forces the memory to be used. That's why I'm leaning towardreducing the number of cache shards and ensuring the minimum memoryallocation for the block cache is large enough that we can fit theindex/filter blocks into the cache. Having said that, there is areasonable view as well that if you are *extremely* memory constrained,loading the index/filter blocks from disk is the correct behavior evenif it's horribly slow.



Mark


- Hector


On 2025/06/22 21:51, Hector Martin wrote:

Hi all,

I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version
19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU
usage and cluster performance plummets. Most of the CPU usage was
accounted as "system", while actual disk I/O usage was low, so that
didn't sound right.

I perf traced the system, and found that most of the usage is in
__arch_copy_to_user in the kernel, in read() syscalls. That sounds like
the OSD is thrashing buffered block device reads which the kernel
satisfies from the page cache (hence no real I/O load, most of the CPU
usage is from data copies), so it's repeatedly reading the same disk
blocks, which doesn't sound right.

I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph
config`, and CPU usage immediately dropped to near zero. However,
waiting a bit eventually the CPU thrashing returns after memory usage
increases. Restarting an OSD has a similar effect. I believe that
something is wrong with the OSD bluestore cache allocation/flush policy,
and when the cache becomes full it starts thrashing reads instead of
evicting colder cached data (or perhaps some cache bucket is starving
another cache bucket of space).

I would appreciate some hints on how to debug this. Are there any cache
stats I should be looking at, or info on how the cache is partitioned?

Here is a perf call trace of the thrashing:

- OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)
- 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, 
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)
- OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, 
ThreadPool::TPHandle&)
- 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, 
ThreadPool::TPHandle&)
- PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)
- ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>)
- 88.86% ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, 
ECSubWrite&, ZTracer::Trace const&, ECListener&)
- 88.78% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, 
std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>)
- BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, 
std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, 
boost::intrusiv
- 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ceph::os::Transaction*)
- 88.71% BlueStore::_remove(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&)
- BlueStore::_do_remove(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&)
- 88.69% BlueStore::_do_truncate(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std
- 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*, 
boost::intrusive_ptr<BlueStore::Collection>&, 
boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ
- 88.65% 
BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>)
- 88.64% RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, 
std::allocator<char> > const&, std::__cxx11::basic_string<char, std::ch
- 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, 
rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*)
- rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, 
rocksdb::DBImpl::GetImplOptions&)
- 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey 
const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std
- 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&, 
rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Sl
- rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice 
const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bo
- 88.32% 
rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*, 
rocksdb::Slice const&, bool, rocksdb::SliceTransfor
- rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, 
rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks
- 88.29% 
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool,
 rocksdb::GetContext*, rocksdb
- 
rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable
 const*, rocksdb::Fi
- rocksdb::Status 
rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*,
 rocks
- rocksdb::Status 
rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref
- 88.25% rocksdb::BlockFetcher::ReadBlockContents()

[ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]

- 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, 
unsigned long, unsigned long, rocksdb::Sli

[ note: at this level the call trace splits into 4, but it all leads to
the same place ]

- 35.99% 0xaaaac0992e40
BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, 
char*) const
BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*)
- KernelDevice::read_random(unsigned long, unsigned long, char*, bool)
- 35.98% __libc_pread
- el0_svc
- invoke_syscall
- 35.97% __arm64_sys_pread64
- 35.96% vfs_read
- blkdev_read_iter
- 35.93% filemap_read
- 35.28% copy_page_to_iter
35.01% __arch_copy_to_user

- Hector
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

- Hector

--
Best Regards,
Mark Nelson
Head of Research and Development

Clyso GmbH
p: +49 89 21552391 12 | a: Minnesota, USA
w: https://clyso.com | e: [email protected]

We are hiring: https://www.clyso.com/jobs/
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Debugging OSD cache thrashing

Reply via email to