[ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Igor Fedotov Tue, 20 May 2025 05:11:34 -0700

Hi Kasper,

sorry, I'm a bit late for the party but w.r.t. DB data migration itwould be interesting to know if it fixes the spillover permanently orthe latter re-occurs after a while.

Doing DB data migration forcefully could have temporary effect only ifBlueStore/RocksDB keep thinking spillover is needed.

Additionally could you please run 'ceph-kvstore-tool bluestore-kv<path-to-osd> stats' for a few OSDs that experience(-ed) spillovers andshare the output?



Thanks,

Igor


On 16/05/2025 09:49, Kasper Rasmussen wrote:

FIXED -

So here is what I have tried.


   1.
Stopped osd.110
   2.
Enabled sharding with command:  ceph-bluestore-tool --path ./osd.110 
--sharding="m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P" reshard
   3.
Started osd.110

Result: osd.110 still had spilled over data, but now only 64KiB (went down from 
+2 GiB)

Tried compacting + restart osd.110

Result: No changes..

Tried stopping osd.110 > compacting while offline > started osd.110

Result: No changes..

Finally -


   1.
Stopped osd.110
   2.
ceph-bluestore-tool bluefs-bdev-migrate --path "./osd.110" --devs-source 
"./osd.110/block" --dev-target "./osd.110/block.db"
   3.
Started osd.110

Result: SUCCESS - BLUEFS_SPILLOVER warning gone on osd.110

To be honest I'm not 100% sure if there is any caveats when migrating the data 
with command - ceph-bluestore-tool bluefs-bdev-migrate, so any input on that 
will be much appreciated.



________________________________
From: Frédéric Nass <[email protected]>
Sent: Thursday, May 15, 2025 17:30
To: Kasper Rasmussen <[email protected]>
Cc: Enrico Bocchi <[email protected]>; ceph-users <[email protected]>
Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade


----- Le 15 Mai 25, à 14:47, Kasper Rasmussen <[email protected]> a 
écrit :
Hi Both

Let me add some findings.

The cluster started on an older-than-pacific version - I don't know which 
version - and has at some point been migrated to Bluestore.

When running a ceph osd metadata <osd.id>, 50% or so of the OSDs has no data in the 
"ceph_version_when_created" the rest has ceph_version_when_created: ceph version 
16.2.....

So I probably have approx. 50% created-pre-pacific OSDs

Please note that only OSDs created after Pacific v16.2.11 will have the 
"ceph_version_when_created" and "created_at" metadata populated. So 
technically, you could have OSDs created in Pacific, say v16.2.9 using sharded RocksDBs.

Also, I wrongly assumed from the config show output that your OSD RocksDBs were 
sharded/resharded, but that was just configuration that could have been set 
after the OSDs creation.
The only way you can make sure if RocksDBs are using column families is how you 
did, by using the ceph-bluestore-tool show-sharding command. I guess you'll 
have to check them all and reshard them when needed.

Please be extremely cautious with the case (m p O L P) in the resharding 
command, as using wrong letters can wreck your OSDs.

Let us know how it goes.

Frédéric.

On a few of the created-pre-pacific OSDs I've executed -
ceph-bluestore-tool --path ./osd.XX show-sharding
With result:
failed to retrieve sharding def

On a few of the others I get results like:
m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru} L P

Also untill now it seems like all the OSD with warnings -
BLUEFS_SPILLOVER
BLUESTORE_SLOW_OP_ALERT (non of these has I/O errors)

Seems to be created-pre-pacific OSDs.
Although I have warnings on a very small percentage of the total number of 
OSDs, it might still be a clue.

I will return to the issue tomorrow



________________________________
From: Frédéric Nass <[email protected]>
Sent: Thursday, May 15, 2025 13:43
To: Kasper Rasmussen <[email protected]>; Enrico Bocchi 
<[email protected]>
Cc: ceph-users <[email protected]>
Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade


Hi Kasper, Hi Enrico,


You're right, Enrico! I misread the numbers. DB is now 11GB in size, not even 
close to 88GB. What probably happened at some point is that RocksDB had to 
allocate more space than 88GB, probably during compaction, and overspilled 
1.9GB to slow device.

If compacting this OSDs twice in a row doesn't help with getting the 1.9GB back 
to fast device, then the below command should:


1/ ceph orch daemon stop osd.${osd}
2/ cephadm shell --fsid $(ceph fsid) --name osd.${osd} -- ceph-bluestore-tool 
bluefs-bdev-migrate --path /var/lib/ceph/osd/ceph-${osd} --devs-source 
/var/lib/ceph/osd/ceph-${osd}/block --dev-target 
/var/lib/ceph/osd/ceph-${osd}/block.db
3/ ceph orch daemon start osd.${osd}
4/ ceph tell osd.${osd} compact


Step 4/ should update bluefs stats figures.


Regarding RocskDB, resharding should not be required as the new layout with 
column families is already being used.


Kasper, can you check the value of bluestore_volume_selection_policy for the 
overspilled OSDs? It should default to 'use_some_extra' that allows RocksDB to 
allocate space between 30GB and 88GB and not overspil to slow device after 
allocating ~30GB.


Also, don't forget to enable RocksDB compression.

Regards,
Frédéric.

----- Le 15 Mai 25, à 13:04, Kasper Rasmussen <[email protected]> a 
écrit :
Thanks to you both

I was just about to address the used vs total byte thing.

I will look into your pointers Enrico, and return with comments/findings.

________________________________
From: Enrico Bocchi <[email protected]>
Sent: Thursday, May 15, 2025 13:00
To: Kasper Rasmussen <[email protected]>; Frédéric Nass 
<[email protected]>
Cc: ceph-users <[email protected]>
Subject: Re: [ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Hi Kasper,

As Frédéric pointed out, you should consider resharding the RocksDB
database to use column families (if the OSD was create pre-pacific):
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#rocksdb-sharding
There's additional documentation available with some preliminary steps,
including making sure your RocksDB does not already use column families.

RocksDB options have changed in recent reef releases, and seem to be
quite different w.r.t. Pacific/Quincy. You may want to check if any of
the configuration options that have been modified are relevant for your
setup.
Here is an excellent deep-dive blog post by the unequaled Mark Nelson:
https://ceph.io/en/news/blog/2022/rocksdb-tuning-deep-dive/

For the used vs total bytes I have to disagree with Frédéric (sorry):
11631853568 / 88906653696 gives 13% utilization. So the OSD should not
overspill to slow storage.
I have seen this in the past and rolling back to previous
bluestore_rocksdb options helped. However, I have not resharded RocksDB
to column families yet. Would you please keep us posted if you reshard
to cf and this fixes overspill?

Cheers,
Enrico


On 5/15/25 12:35, Frédéric Nass wrote:

Hi Kasper,

Thanks for sharing.

I don't see anything wrong with this specific OSD when it comes to 
bluestore_rocksdb_*. It's RocksDB database is using column families and this 
OSD was resharded properly (if not created or recreated in Pacific). What the 
perf dump shows is that the db_used_bytes is above the db_total_bytes. If this 
cluster makes heavy use of metadata (RGW workloads for example) then 90GB of DB 
device for 10TB drives is less than 1% which is not enough. General 
recommendation for RGW workloads is to use a DB device of at least 4% in size 
of the data device [1].

Now, your best move is probably to enable RocksDB compression (ceph config set 
osd bluestore_rocksdb_options_annex 'compression=kLZ4Compression'), restart and 
compact these OSDs to update bluefs stats, and consider giving those OSDs 
larger RocksDB partitions in the future.

Regards,
Frédéric.

[1] 
https://docs.ceph.com/en/latest/rados/configuration/bluestore-config-ref/#sizing

----- Le 15 Mai 25, à 7:44, Kasper Rasmussen [email protected] a 
écrit :

perf dump:
"bluefs": {
"db_total_bytes": 88906653696,
"db_used_bytes": 11631853568,
"wal_total_bytes": 0,
"wal_used_bytes": 0,
"slow_total_bytes": 9796816207872,
"slow_used_bytes": 1881341952,
"num_files": 229,
"log_bytes": 11927552,
"log_compactions": 78,
"log_write_count": 281792,
"logged_bytes": 1154220032,
"files_written_wal": 179,
"files_written_sst": 311,
"write_count_wal": 280405,
"write_count_sst": 29432,
"bytes_written_wal": 4015595520,
"bytes_written_sst": 15728308224,
"bytes_written_slow": 2691231744,
"max_bytes_wal": 0,
"max_bytes_db": 13012828160,
"max_bytes_slow": 3146252288,
"alloc_unit_slow": 65536,
"alloc_unit_db": 1048576,
"alloc_unit_wal": 0,
"read_random_count": 1871590,
"read_random_bytes": 18959576586,
"read_random_disk_count": 563421,
"read_random_disk_bytes": 17110012647,
"read_random_disk_bytes_wal": 0,
"read_random_disk_bytes_db": 11373755941,
"read_random_disk_bytes_slow": 5736256706,
"read_random_buffer_count": 1313456,
"read_random_buffer_bytes": 1849563939,
"read_count": 275731,
"read_bytes": 4825912551,
"read_disk_count": 225997,
"read_disk_bytes": 4016943104,
"read_disk_bytes_wal": 0,
"read_disk_bytes_db": 3909947392,
"read_disk_bytes_slow": 106999808,
"read_prefetch_count": 274534,
"read_prefetch_bytes": 4785141168,
"write_count": 591760,
"write_disk_count": 591838,
"write_bytes": 21062987776,
"compact_lat": {
"avgcount": 78,
"sum": 0.572247346,
"avgtime": 0.007336504
},
"compact_lock_lat": {
"avgcount": 78,
"sum": 0.182746199,
"avgtime": 0.002342899
},
"alloc_slow_fallback": 0,
"alloc_slow_size_fallback": 0,
"read_zeros_candidate": 0,
"read_zeros_errors": 0,
"wal_alloc_lat": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
},
"db_alloc_lat": {
"avgcount": 969,
"sum": 0.006368060,
"avgtime": 0.000006571
},
"slow_alloc_lat": {
"avgcount": 39,
"sum": 0.004502210,
"avgtime": 0.000115441
},
"alloc_wal_max_lat": 0.000000000,
"alloc_db_max_lat": 0.000113831,
"alloc_slow_max_lat": 0.000301347
},


config show:
"bluestore_rocksdb_cf": "true",
"bluestore_rocksdb_cfs": "m(3) p(3,0-12) O(3,0-13)=block_cache={type=binned_lru}
L=min_write_buffer_number_to_merge=32 P=min_write_buffer_number_to_merge=32",
"bluestore_rocksdb_options":
"compression=kLZ4Compression,max_write_buffer_number=64,min_write_buffer_number_to_merge=6,compaction_style=kCompactionStyleLevel,write_buffer_size=16777216,max_background_jobs=4,level0_file_num_compaction_trigger=8,max_bytes_for_level_base=1073741824,max_bytes_for_level_multiplier=8,compaction_readahead_size=2MB,max_total_wal_size=1073741824,writable_file_max_buffer_size=0",
"bluestore_rocksdb_options_annex": "",


Dono if it is of any help, but I've compared the config from an OSD not
reporting an issues, and there is no difference.


________________________________
From: Enrico Bocchi <[email protected]>
Sent: Wednesday, May 14, 2025 22:47
To: Kasper Rasmussen <[email protected]>; ceph-users
<[email protected]>
Subject: Re: BLUEFS_SPILLOVER after Reef upgrade

Hi Kasper,

Would you mind sharing the output of `perf dump` and `config show` from the
daemon socket of one of the OSDs reporting blues spillover? I am interested in
the bluefs part of the former and in the bluestore_rocksdb options of the
latter.

The warning about slow ops in bluestore is a different story. There have been
several messages on this mailing list recently with suggestions on how to tune
the alert threshold. From my experience, they very likely relate to some
problem with the underlying storage device, so I'd recommend investigating the
root cause rather than simply silencing the warning.

Cheers,
Enrico


________________________________
From: Kasper Rasmussen <[email protected]>
Sent: Wednesday, May 14, 2025 8:22:46 PM
To: ceph-users <[email protected]>
Subject: [ceph-users] BLUEFS_SPILLOVER after Reef upgrade

I've just upgraded our ceph cluster from pacific 16.2.15 -> Reef 18.2.7

After that I see the warnings:

[WRN] BLUEFS_SPILLOVER: 5 OSD(s) experiencing BlueFS spillover
      osd.110 spilled over 4.5 GiB metadata from 'db' device (8.0 GiB used of 
83 GiB)
      to slow device
      osd.455 spilled over 1.1 GiB metadata from 'db' device (11 GiB used of 83 
GiB)
      to slow device
      osd.533 spilled over 426 MiB metadata from 'db' device (10 GiB used of 83 
GiB)
      to slow device
      osd.560 spilled over 389 MiB metadata from 'db' device (9.8 GiB used of 
83 GiB)
      to slow device
      osd.597 spilled over 8.6 GiB metadata from 'db' device (7.7 GiB used of 
83 GiB)
      to slow device
[WRN] BLUESTORE_SLOW_OP_ALERT: 4 OSD(s) experiencing slow operations in
BlueStore
      osd.410 observed slow operation indications in BlueStore
      osd.443 observed slow operation indications in BlueStore
      osd.508 observed slow operation indications in BlueStore
      osd.593 observed slow operation indications in BlueStore

I've tried to run  ceph tell osd.XXX compact with no result.

Bluefs stats:

ceph tell osd.110 bluefs stats
1 : device size 0x14b33fe000 : using 0x202c00000(8.0 GiB)
2 : device size 0x8e8ffc00000 : using 0x5d31d150000(5.8 TiB)
RocksDBBlueFSVolumeSelector

Settings<< extra=0 B, l0_size=1 GiB, l_base=1 GiB, l_multi=8 B

DEV/LEV     WAL         DB          SLOW        *           *           REAL
FILES
LOG         0 B         16 MiB      0 B         0 B         0 B         15 MiB
1
WAL         0 B         18 MiB      0 B         0 B         0 B         6.3 MiB
1
DB          0 B         8.0 GiB     0 B         0 B         0 B         8.0 GiB
140
SLOW        0 B         0 B         4.5 GiB     0 B         0 B         4.5 GiB
78
TOTAL       0 B         8.0 GiB     4.5 GiB     0 B         0 B         0 B
220
MAXIMUMS:
LOG         0 B         25 MiB      0 B         0 B         0 B         21 MiB
WAL         0 B         118 MiB     0 B         0 B         0 B         93 MiB
DB          0 B         8.2 GiB     0 B         0 B         0 B         8.2 GiB
SLOW        0 B         0 B         14 GiB      0 B         0 B         14 GiB
TOTAL       0 B         8.2 GiB     14 GiB      0 B         0 B         0 B

SIZE <<  0 B         79 GiB      8.5 TiB

Help with what to do next will, be much appreciated


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

--
Enrico Bocchi
CERN European Laboratory for Particle Physics
IT - Storage & Data Management  - General Storage Services
Mailbox: G20500 - Office: 31-2-010
1211 Genève 23
Switzerland



_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: BLUEFS_SPILLOVER after Reef upgrade

Reply via email to