Congrats on figuring this out Hector! This is a huge find! Comments below.
On 8/11/25 4:31 AM, Hector Martin wrote:
For those who have been following along, I figured it out. I left all the details with Mark on Slack, but TL;DR: The fix is *either one* (or both works too) of these: ceph config set osd rocksdb_cache_index_and_filter_blocks false (Ceph default: true, RocksDB default: false) ceph config set osd rocksdb_cache_shard_bits 0 (or 1, Ceph default: 4, RocksDB default: variable but probably 6) (Restart all OSDs after changing these settings for them to take effect, I don't think they apply live) The side effect of the first option is that it might increase unaccounted heap OSD memory usage (not managed by the cache autosize code), as filter blocks are preloaded and cached outside the block cache. The side effect of the second option is that it reduces the parallelism/core scalability of the block cache. I suspect the impact of either will be generally small for most typical deployments, and the benefit of not having horrible snaptrim thrashing far outweighs it for those affected. Both options were changed from their RocksDB defaults in Ceph commits long ago without any explanation / relevant commit messages, so I suspect this is a case of "some people flipped some knobs that seemed harmless and nobody really tested their impact". Other things have changed since then too, so I can't say when the impact likely began, as there are a lot of other factors involved.
I'll take ownership of that one since I was the one that flipped those knobs. ;)
FWIW, a lot of the discussion around these issues was happening back in the bluestore standup in those days. I recall that the idea behind these changes is that we didn't want index and filter blocks to be able to consume memory outside the context of the cache. IE at the time we were concerned with runaway memory usage of the OSD, so having index and filter blocks cached with high (pinned!) priority in the block cache was preferable to having memory allocated for them without bound and no oversight in RocksDB. The testing we did at the time didn't involve running in scenarios where the RocksDB cache size was this small. We didn't have the osd_memory_target autotuning back then and the cache size was a static ratio of the overall bluestore cache size. Back in those days we did regularly see the OSD swings in memory usage depending on the workload.
It seems to me this is really a combination of high memory pressure forcing the RocksDB block cache to shrink to the minimum 64MB, having 16 cache shards, and forcing index/filter blocks into the cache that leads to the bad behavior.
The root cause is that the sharding defaults to 16 shards, and that filter blocks are cached in the block cache instead of separately (not the RocksDB default). Filter blocks can be up to 17MB or thereabouts, which means with 16 shards, the minimum viable KV block cache size is ~272MB to ensure non-pathological behavior (if a filter block does not fit into its cache shard, it is not cached at all, causing the problem). Since snaptrim uses a lot of memory, it squeezes out the OSD caches and as soon as the RocksDB block cache size drops below that size or so (depending on how large filter blocks you ended up with on a particular OSD), you get pathological thrashing with potentially several gigabytes per second of kernel->user memory copies repeatedly rereading filter blocks from SSTs.
My thought is that unless testing says us otherwise, we probably should drop the shard count down to around 4 and bump up the minimum cache allocation a bit. We could disable storing index/filter blocks in the block cache and now the osd_memory_target code will attempt to compensate for it (which we didn't have before), but it feels a bit like going backwards (ideally we would be accounting for all significant memory usage in the OSD).
Fixing this *might* mean that bluefs_buffered_io can be flipped to false too.
This is one of the most exciting aspects of this discovery imho!
As for who is likely affected, it depends on snaptrim memory usage vs. osd_memory_target. If you can guarantee 1GB or so (rough) for managed caches, then it might never affect you, though it still depends on what PriCache wants to do and it might still squeeze the KV cache under some other set of conditions. With the default of 4G for osd_memory_target, my guess is things are just about on the edge of being safe enough most of the time, which is probably why most people don't see things go horribly wrong. If you increase osd_memory_target above the default, you're probably safe. If you decrease it, you're in danger.
Hence the bluefs_buffered_io crutch. Even if you hit this, if you have enough page cache to keep the SST files cached, it doesn't really affect you either.
Either of the above settings makes it impossible for this pathological situation to occur. rocksdb_cache_index_and_filter_blocks=false is safest, rocksdb_cache_shard_bits=0 could allow for some thrashing if multiple SSTs are involved, rocksdb_cache_shard_bits=1 is borderline, but neither should allow for the extreme pathological behavior where a single thread thrashes reads repeatedly at extreme speeds in any case (which is what I experienced).
I'd slightly disagree with the conclusion that rocksdb_cache_index_and_filter_blocks=false is safest, especially when running in extremely memory constrained scenarios. It's safest from the angle that this data will now be forced into memory, but it's not safe from the viewpoint that it will contribute to osd_memory_target overage without oversight. It takes control out of the prioritycache's hands and forces the memory to be used. That's why I'm leaning toward reducing the number of cache shards and ensuring the minimum memory allocation for the block cache is large enough that we can fit the index/filter blocks into the cache. Having said that, there is a reasonable view as well that if you are *extremely* memory constrained, loading the index/filter blocks from disk is the correct behavior even if it's horribly slow.
Mark
- Hector On 2025/06/22 21:51, Hector Martin wrote:Hi all, I have a small 3-node cluster (4 HDD + 1 SSD OSD per node, ceph version 19.2.2) and noticed that during snaptrim ops (?), the OSDs max out CPU usage and cluster performance plummets. Most of the CPU usage was accounted as "system", while actual disk I/O usage was low, so that didn't sound right. I perf traced the system, and found that most of the usage is in __arch_copy_to_user in the kernel, in read() syscalls. That sounds like the OSD is thrashing buffered block device reads which the kernel satisfies from the page cache (hence no real I/O load, most of the CPU usage is from data copies), so it's repeatedly reading the same disk blocks, which doesn't sound right. I increased osd_memory_target to 2.4G (from 1.2G) live with `ceph config`, and CPU usage immediately dropped to near zero. However, waiting a bit eventually the CPU thrashing returns after memory usage increases. Restarting an OSD has a similar effect. I believe that something is wrong with the OSD bluestore cache allocation/flush policy, and when the cache becomes full it starts thrashing reads instead of evicting colder cached data (or perhaps some cache bucket is starving another cache bucket of space). I would appreciate some hints on how to debug this. Are there any cache stats I should be looking at, or info on how the cache is partitioned? Here is a perf call trace of the thrashing:- OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*) - 95.04% ceph::osd::scheduler::PGOpItem::run(OSD*, OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&) - OSD::dequeue_op(boost::intrusive_ptr<PG>, boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&) - 95.03% PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&, ThreadPool::TPHandle&) - PGBackend::handle_message(boost::intrusive_ptr<OpRequest>) - ECBackend::_handle_message(boost::intrusive_ptr<OpRequest>) - 88.86% ECBackend::handle_sub_write(pg_shard_t, boost::intrusive_ptr<OpRequest>, ECSubWrite&, ZTracer::Trace const&, ECListener&) - 88.78% non-virtual thunk to PrimaryLogPG::queue_transactions(std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusive_ptr<OpRequest>) - BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, std::vector<ceph::os::Transaction, std::allocator<ceph::os::Transaction> >&, boost::intrusiv - 88.75% BlueStore::_txc_add_transaction(BlueStore::TransContext*, ceph::os::Transaction*) - 88.71% BlueStore::_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&) - BlueStore::_do_remove(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&) - 88.69% BlueStore::_do_truncate(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, unsigned long, std - 88.68% BlueStore::_wctx_finish(BlueStore::TransContext*, boost::intrusive_ptr<BlueStore::Collection>&, boost::intrusive_ptr<BlueStore::Onode>&, BlueStore::Writ - 88.65% BlueStore::Collection::load_shared_blob(boost::intrusive_ptr<BlueStore::SharedBlob>) - 88.64% RocksDBStore::get(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::ch - 88.61% rocksdb::DBImpl::Get(rocksdb::ReadOptions const&, rocksdb::ColumnFamilyHandle*, rocksdb::Slice const&, rocksdb::PinnableSlice*) - rocksdb::DBImpl::GetImpl(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::DBImpl::GetImplOptions&) - 88.43% rocksdb::Version::Get(rocksdb::ReadOptions const&, rocksdb::LookupKey const&, rocksdb::PinnableSlice*, rocksdb::PinnableWideColumns*, std - 88.41% rocksdb::TableCache::Get(rocksdb::ReadOptions const&, rocksdb::InternalKeyComparator const&, rocksdb::FileMetaData const&, rocksdb::Sl - rocksdb::BlockBasedTable::Get(rocksdb::ReadOptions const&, rocksdb::Slice const&, rocksdb::GetContext*, rocksdb::SliceTransform const*, bo - 88.32% rocksdb::BlockBasedTable::FullFilterKeyMayMatch(rocksdb::FilterBlockReader*, rocksdb::Slice const&, bool, rocksdb::SliceTransfor - rocksdb::FullFilterBlockReader::MayMatch(rocksdb::Slice const&, bool, rocksdb::GetContext*, rocksdb::BlockCacheLookupContext*, rocks - 88.29% rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::GetOrReadFilterBlock(bool, rocksdb::GetContext*, rocksdb - rocksdb::FilterBlockReaderCommon<rocksdb::ParsedFullFilterBlock>::ReadFilterBlock(rocksdb::BlockBasedTable const*, rocksdb::Fi - rocksdb::Status rocksdb::BlockBasedTable::RetrieveBlock<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePrefetchBuffer*, rocks - rocksdb::Status rocksdb::BlockBasedTable::MaybeReadBlockAndLoadToCache<rocksdb::ParsedFullFilterBlock>(rocksdb::FilePref - 88.25% rocksdb::BlockFetcher::ReadBlockContents()[ note: split, the rest is + 19.42% rocksdb::VerifyBlockChecksum]- 68.80% rocksdb::RandomAccessFileReader::Read(rocksdb::IOOptions const&, unsigned long, unsigned long, rocksdb::Sli[ note: at this level the call trace splits into 4, but it all leads to the same place ]- 35.99% 0xaaaac0992e40 BlueRocksRandomAccessFile::Read(unsigned long, unsigned long, rocksdb::Slice*, char*) const BlueFS::_read_random(BlueFS::FileReader*, unsigned long, unsigned long, char*) - KernelDevice::read_random(unsigned long, unsigned long, char*, bool) - 35.98% __libc_pread - el0_svc - invoke_syscall - 35.97% __arm64_sys_pread64 - 35.96% vfs_read - blkdev_read_iter - 35.93% filemap_read - 35.28% copy_page_to_iter 35.01% __arch_copy_to_user- Hector _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]- Hector
-- Best Regards, Mark Nelson Head of Research and Development Clyso GmbH p: +49 89 21552391 12 | a: Minnesota, USA w: https://clyso.com | e: [email protected] We are hiring: https://www.clyso.com/jobs/ _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
