Dear Ceph Community,
I hope this message finds you well. I am reaching out to seek assistance
regarding a stability issue we have encountered after upgrading our Ceph
cluster from version Pacific 16.2.3 to Quincy 17.2.8.
Following the upgrade, we have observed that several of our Object Storage
Daemons (OSDs) are experiencing erratic behavior. These OSDs frequently exhibit
a "flapping" condition, where they unexpectedly go down and then come back up.
This issue has predominantly affected the recently upgraded OSDs within the
cluster.
Upon reviewing the logs from the affected OSDs, we encountered the following
messages:
2025-02-03T08:34:09.769+0000 7f0f11390780 -1
bluestore::NCB::__restore_allocator::Failed open_for_read with error-code -2
2025-02-03T08:38:22.920+0000 7feb9dd44780 -1
bluestore::NCB::__restore_allocator::No Valid allocation info on disk (empty
file)
In an attempt to resolve the issue, we executed the ceph-bluestore-tool fsck
and repair commands. Although these commands executed successfully, they did
not rectify the problem at hand.
Additionally, we have captured the following crash information from the ceph
logs:
ceph crash info 2025-02-03T09:19:08.749233Z_9e2800fb-77f6-46cb-8087-203ea15a2039
{
"assert_condition": "log.t.seq == log.seq_live",
"assert_file":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/c
eph-17.2.8/src/os/bluestore/BlueFS.cc",
"assert_func": "uint64_t BlueFS::_log_advance_seq()",
"assert_line": 3029,
"assert_msg":
"/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ce
ph-17.2.8/src/os/bluestore/BlueFS.cc: In function 'uint64_t
BlueFS::_log_advance_seq()' thread 7ff983564640 time
2025-02-03T09:19:08.738781+0000\n/home/jenkins-build/build/workspace/ceph-bu
ild/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos9/DIST/centos9/MACHINE_SIZE/gigantic/release/17.2.8/rpm/el9/BUILD/ceph-17.2.8/src/os/bluestore/BlueFS.cc:
3029: FAILED ceph_assert
(log.t.seq == log.seq_live)\n",
"assert_thread_name": "bstore_kv_sync",
"backtrace": [
"/lib64/libc.so.6(+0x3e730) [0x7ff9930f5730]",
"/lib64/libc.so.6(+0x8bbdc) [0x7ff993142bdc]",
"raise()",
"abort()",
"(ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x179) [0x55882dfb7fdd]",
"/usr/bin/ceph-osd(+0x36b13e) [0x55882dfb813e]",
"/usr/bin/ceph-osd(+0x9cff3b) [0x55882e61cf3b]",
"(BlueFS::_flush_and_sync_log_jump_D(unsigned long)+0x4e)
[0x55882e6291ee]",
"(BlueFS::_compact_log_async_LD_LNF_D()+0x59b) [0x55882e62e8fb]",
"/usr/bin/ceph-osd(+0x9f2b15) [0x55882e63fb15]",
"(BlueFS::fsync(BlueFS::FileWriter*)+0x1b9) [0x55882e631989]",
"/usr/bin/ceph-osd(+0x9f4889) [0x55882e641889]",
"/usr/bin/ceph-osd(+0xd74cd5) [0x55882e9c1cd5]",
"(rocksdb::WritableFileWriter::SyncInternal(bool)+0x483)
[0x55882eade393]",
"(rocksdb::WritableFileWriter::Sync(bool)+0x120) [0x55882eae0b60]",
"(rocksdb::DBImpl::WriteToWAL(rocksdb::WriteThread::WriteGroup const&,
rocksdb::log::Writer*, unsigned long*, bool, bool, unsigned long)+0x337)
[0x55882ea00ab7]",
"(rocksdb::DBImpl::WriteImpl(rocksdb::WriteOptions const&,
rocksdb::WriteBatch*, rocksdb::WriteCallback*, unsigned long*, unsigned long,
bool, unsigned long*, unsigned long, rocksdb
::PreReleaseCallback*)+0x1935) [0x55882ea07675]",
"(rocksdb::DBImpl::Write(rocksdb::WriteOptions const&,
rocksdb::WriteBatch*)+0x35) [0x55882ea077c5]",
"(RocksDBStore::submit_common(rocksdb::WriteOptions&,
std::shared_ptr<KeyValueDB::TransactionImpl>)+0x83) [0x55882e992593]",
"(RocksDBStore::submit_transaction_sync(std::shared_ptr<KeyValueDB::TransactionImpl>)+0x99)
[0x55882e992ee9]",
"(BlueStore::_kv_sync_thread()+0xf64) [0x55882e578e24]",
"/usr/bin/ceph-osd(+0x8afb81) [0x55882e4fcb81]",
"/lib64/libc.so.6(+0x89e92) [0x7ff993140e92]",
"/lib64/libc.so.6(+0x10ef20) [0x7ff9931c5f20]"
],
"ceph_version": "17.2.8",
"crash_id":
"2025-02-03T09:19:08.749233Z_9e2800fb-77f6-46cb-8087-203ea15a2039",
"entity_name": "osd.211",
"os_id": "centos",
"os_name": "CentOS Stream",
"os_version": "9",
"os_version_id": "9",
"process_name": "ceph-osd",
"stack_sig":
"ba90de24e2beba9c6a75249a4cce7c533987ca5127cfba5b835a3456174d6080",
"timestamp": "2025-02-03T09:19:08.749233Z",
"utsname_hostname": "afra-osd18",
"utsname_machine": "x86_64",
"utsname_release": "5.15.0-119-generic",
"utsname_sysname": "Linux",
"utsname_version": "#129-Ubuntu SMP Fri Aug 2 19:25:20 UTC 2024"
}
The above crash logs highlight an assertion failure in the BlueFS component,
specifically within the function BlueFS::_log_advance_seq(). Despite our
efforts to analyze and resolve the issue, we have reached an impasse.
For completeness, we have verified the health of our disks using smartctl, and
they have all been deemed healthy.
We kindly request guidance from the community on how to address this issue or
any recommended steps for deeper diagnostics. We appreciate your support and
expertise during this troubleshooting process.
OSD logs: https://paste.mozilla.org/6STm6eum
Thank you for your attention and assistance.
Best regards
Nima
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]