Hi Kasper,

Thanks for sharing.

I don't see anything wrong with this specific OSD when it comes to 
bluestore_rocksdb_*. It's RocksDB database is using column families and this 
OSD was resharded properly (if not created or recreated in Pacific). What the 
perf dump shows is that the db_used_bytes is above the db_total_bytes. If this 
cluster makes heavy use of metadata (RGW workloads for example) then 90GB of DB 
device for 10TB drives is less than 1% which is not enough. General 
recommendation for RGW workloads is to use a DB device of at least 4% in size 
of the data device [1].

Now, your best move is probably to enable RocksDB compression (ceph config set 
osd bluestore_rocksdb_options_annex 'compression=kLZ4Compression'), restart and 
compact these OSDs to update bluefs stats, and consider giving those OSDs 
larger RocksDB partitions in the future.

Regards,
Frédéric.

[1] 
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

----- Le 15 Mai 25, à 7:44, Kasper Rasmussen [email protected] a 
écrit :

> perf dump:
> "bluefs": {
> "db_total_bytes": 88906653696,
> "db_used_bytes": 11631853568,
> "wal_total_bytes": 0,
> "wal_used_bytes": 0,
> "slow_total_bytes": 9796816207872,
> "slow_used_bytes": 1881341952,
> "num_files": 229,
> "log_bytes": 11927552,
> "log_compactions": 78,
> "log_write_count": 281792,
> "logged_bytes": 1154220032,
> "files_written_wal": 179,
> "files_written_sst": 311,
> "write_count_wal": 280405,
> "write_count_sst": 29432,
> "bytes_written_wal": 4015595520,
> "bytes_written_sst": 15728308224,
> "bytes_written_slow": 2691231744,
> "max_bytes_wal": 0,
> "max_bytes_db": 13012828160,
> "max_bytes_slow": 3146252288,
> "alloc_unit_slow": 65536,
> "alloc_unit_db": 1048576,
> "alloc_unit_wal": 0,
> "read_random_count": 1871590,
> "read_random_bytes": 18959576586,
> "read_random_disk_count": 563421,
> "read_random_disk_bytes": 17110012647,
> "read_random_disk_bytes_wal": 0,
> "read_random_disk_bytes_db": 11373755941,
> "read_random_disk_bytes_slow": 5736256706,
> "read_random_buffer_count": 1313456,
> "read_random_buffer_bytes": 1849563939,
> "read_count": 275731,
> "read_bytes": 4825912551,
> "read_disk_count": 225997,
> "read_disk_bytes": 4016943104,
> "read_disk_bytes_wal": 0,
> "read_disk_bytes_db": 3909947392,
> "read_disk_bytes_slow": 106999808,
> "read_prefetch_count": 274534,
> "read_prefetch_bytes": 4785141168,
> "write_count": 591760,
> "write_disk_count": 591838,
> "write_bytes": 21062987776,
> "compact_lat": {
> "avgcount": 78,
> "sum": 0.572247346,
> "avgtime": 0.007336504
> },
> "compact_lock_lat": {
> "avgcount": 78,
> "sum": 0.182746199,
> "avgtime": 0.002342899
> },
> "alloc_slow_fallback": 0,
> "alloc_slow_size_fallback": 0,
> "read_zeros_candidate": 0,
> "read_zeros_errors": 0,
> "wal_alloc_lat": {
> "avgcount": 0,
> "sum": 0.000000000,
> "avgtime": 0.000000000
> },
> "db_alloc_lat": {
> "avgcount": 969,
> "sum": 0.006368060,
> "avgtime": 0.000006571
> },
> "slow_alloc_lat": {
> "avgcount": 39,
> "sum": 0.004502210,
> "avgtime": 0.000115441
> },
> "alloc_wal_max_lat": 0.000000000,
> "alloc_db_max_lat": 0.000113831,
> "alloc_slow_max_lat": 0.000301347
> },
> 
> 
> config show:
> "bluestore_rocksdb_cf": "true",
> "bluestore_rocksdb_cfs": "m(3) p(3,0-12) 
> O(3,0-13)=block_cache={type=binned_lru}
> L=min_write_buffer_number_to_merge=32 P=min_write_buffer_number_to_merge=32",
> "bluestore_rocksdb_options":
> "compression=kLZ4Compression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0",
> "bluestore_rocksdb_options_annex": "",
> 
> 
> Dono if it is of any help, but I've compared the config from an OSD not
> reporting an issues, and there is no difference.
> 
> 
> ________________________________
> From: Enrico Bocchi <[email protected]>
> Sent: Wednesday, May 14, 2025 22:47
> To: Kasper Rasmussen <[email protected]>; ceph-users
> <[email protected]>
> Subject: Re: BLUEFS_SPILLOVER after Reef upgrade
> 
> Hi Kasper,
> 
> Would you mind sharing the output of `perf dump` and `config show` from the
> daemon socket of one of the OSDs reporting blues spillover? I am interested in
> the bluefs part of the former and in the bluestore_rocksdb options of the
> latter.
> 
> The warning about slow ops in bluestore is a different story. There have been
> several messages on this mailing list recently with suggestions on how to tune
> the alert threshold. From my experience, they very likely relate to some
> problem with the underlying storage device, so I'd recommend investigating the
> root cause rather than simply silencing the warning.
> 
> Cheers,
> Enrico
> 
> 
> ________________________________
> From: Kasper Rasmussen <[email protected]>
> Sent: Wednesday, May 14, 2025 8:22:46 PM
> To: ceph-users <[email protected]>
> Subject: [ceph-users] BLUEFS_SPILLOVER after Reef upgrade
> 
> I've just upgraded our ceph cluster from pacific 16.2.15 -> Reef 18.2.7
> 
> After that I see the warnings:
> 
> [WRN] BLUEFS_SPILLOVER: 5 OSD(s) experiencing BlueFS spillover
>     osd.110 spilled over 4.5 GiB metadata from 'db' device (8.0 GiB used of 
> 83 GiB)
>     to slow device
>     osd.455 spilled over 1.1 GiB metadata from 'db' device (11 GiB used of 83 
> GiB)
>     to slow device
>     osd.533 spilled over 426 MiB metadata from 'db' device (10 GiB used of 83 
> GiB)
>     to slow device
>     osd.560 spilled over 389 MiB metadata from 'db' device (9.8 GiB used of 
> 83 GiB)
>     to slow device
>     osd.597 spilled over 8.6 GiB metadata from 'db' device (7.7 GiB used of 
> 83 GiB)
>     to slow device
> [WRN] BLUESTORE_SLOW_OP_ALERT: 4 OSD(s) experiencing slow operations in
> BlueStore
>     osd.410 observed slow operation indications in BlueStore
>     osd.443 observed slow operation indications in BlueStore
>     osd.508 observed slow operation indications in BlueStore
>     osd.593 observed slow operation indications in BlueStore
> 
> I've tried to run  ceph tell osd.XXX compact with no result.
> 
> Bluefs stats:
> 
> ceph tell osd.110 bluefs stats
> 1 : device size 0x14b33fe000 : using 0x202c00000(8.0 GiB)
> 2 : device size 0x8e8ffc00000 : using 0x5d31d150000(5.8 TiB)
> RocksDBBlueFSVolumeSelector
>>>Settings<< extra=0 B, l0_size=1 GiB, l_base=1 GiB, l_multi=8 B
> DEV/LEV     WAL         DB          SLOW        *           *           REAL
> FILES
> LOG         0 B         16 MiB      0 B         0 B         0 B         15 MiB
> 1
> WAL         0 B         18 MiB      0 B         0 B         0 B         6.3 
> MiB
> 1
> DB          0 B         8.0 GiB     0 B         0 B         0 B         8.0 
> GiB
> 140
> SLOW        0 B         0 B         4.5 GiB     0 B         0 B         4.5 
> GiB
> 78
> TOTAL       0 B         8.0 GiB     4.5 GiB     0 B         0 B         0 B
> 220
> MAXIMUMS:
> LOG         0 B         25 MiB      0 B         0 B         0 B         21 MiB
> WAL         0 B         118 MiB     0 B         0 B         0 B         93 MiB
> DB          0 B         8.2 GiB     0 B         0 B         0 B         8.2 
> GiB
> SLOW        0 B         0 B         14 GiB      0 B         0 B         14 GiB
> TOTAL       0 B         8.2 GiB     14 GiB      0 B         0 B         0 B
>>> SIZE <<  0 B         79 GiB      8.5 TiB
> 
> Help with what to do next will, be much appreciated
> 
> 
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> 
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to