Public bug reported: Upon upgrading a Ceph node with do-release-upgrade from eoan to focal, the OSD doesn't connect after the upgrade. I rolled back the change (VBox snapshot) and tried again, same result. I also tried to hold back the Ceph packages and upgrade after the fact, but again same result.
Epected behavior: OSD connects to cluster after upgrade. Actual behavior: OSD log shows endlessly repeated 'tick_without_osd_lock' messages. OSD will stay down from perspective of the cluster. Extract from debug log of OSD: 2020-04-24T16:25:35.811-0700 7fd70e83d700 5 osd.0 16499 heartbeat osd_stat(store_statfs(0x44990000/0x40000000/0x240000000, data 0x14bb97877/0x1bb660000, compress 0x0/0x0/0x0, omap 0x2bbf, meta 0x3fffd441), peers [] op hist []) 2020-04-24T16:25:35.811-0700 7fd70e83d700 20 osd.0 16499 check_full_status cur ratio 0.769796, physical ratio 0.769796, new state none 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 tick 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start 2020-04-24T16:25:36.043-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish 2020-04-24T16:25:36.043-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700 2020-04-24T16:25:36.631-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock 2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 tick 2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start 2020-04-24T16:25:37.055-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish 2020-04-24T16:25:37.055-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700 2020-04-24T16:25:37.595-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock 2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 tick 2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start 2020-04-24T16:25:38.071-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish 2020-04-24T16:25:38.071-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700 2020-04-24T16:25:38.243-0700 7fd71cc0d700 20 osd.0 16499 reports for 0 queries 2020-04-24T16:25:38.583-0700 7fd72606c700 10 osd.0 16499 tick_without_osd_lock 2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 tick 2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- start 2020-04-24T16:25:39.103-0700 7fd7272ea700 10 osd.0 16499 do_waiters -- finish 2020-04-24T16:25:39.103-0700 7fd7272ea700 20 osd.0 16499 tick last_purged_snaps_scrub 2020-04-24T15:54:43.601161-0700 next 2020-04-25T15:54:43.601161-0700 This repeats over and over again. strace of the process yields lots of unfinished futex access attempts: [pid 2130] futex(0x55b17b8e216c, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=937726129}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2100] write(12, "2020-04-24T16:47:33.915-0700 7fd"..., 79) = 79 [pid 2100] futex(0x55b17b7108e4, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772053, tv_nsec=969572004}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=20189832}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=70811223}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0 [pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0 [pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...> [pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...> [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=121288477}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2200] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 2200] futex(0x55b17b7108e4, FUTEX_WAKE_PRIVATE, 2147483647) = 1 [pid 2200] futex(0x7ffc4aa8b708, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2200] futex(0x7ffc4aa8b770, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772055, tv_nsec=102644954}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2100] <... futex resumed> ) = 0 [pid 2100] futex(0x55b17b710838, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2100] futex(0x55b17b7108e0, FUTEX_WAIT_PRIVATE, 0, NULL <unfinished ...> [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=171906673}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2127] <... clock_nanosleep resumed> NULL) = 0 [pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000}, <unfinished ...> [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=222271211}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=273226419}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=323615391}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0 [pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...> [pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0 [pid 2134] io_getevents(0x7fd7272fc000, 1, 16, <unfinished ...> [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=373946132}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=424283527}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2127] <... clock_nanosleep resumed> NULL) = 0 [pid 2127] clock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=256000000}, <unfinished ...> [pid 2190] <... futex resumed> ) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=474599677}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=525368586}, FUTEX_BITSET_MATCH_ANY) = -1 ETIMEDOUT (Connection timed out) [pid 2190] futex(0x55b17b775ac0, FUTEX_WAKE_PRIVATE, 1) = 0 [pid 2190] futex(0x55b17b775ab8, FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 0, {tv_sec=1587772054, tv_nsec=575839547}, FUTEX_BITSET_MATCH_ANY <unfinished ...> [pid 2163] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0 [pid 2134] <... io_getevents resumed> [], {tv_sec=0, tv_nsec=250000000}) = 0 [pid 2163] io_getevents(0x7fd7272eb000, 1, 16, <unfinished ...> [pid 2134] io_getevents(0x7fd7272fc000, 1, 16, ^Cstrace: Process 2093 detached Suspected cause: OSD can not connect to monitor. Repeatability: On 5 attempts (3 separate nodes and 2 repetitions) the result was the same. Research done: I checked Launchpad and the Ceph bug tracker, couldn't find something similar. Tried restart of process, reboot of node, revert change and re-upgrade, hold Ceph packages and manually upgrade after do- release-upgrade, strace of the process. Impact: Right now upgrade of Ceph to 20.04 LTS appears to be broken ProblemType: Bug DistroRelease: Ubuntu 20.04 Package: ceph-osd 15.2.1-0ubuntu1 ProcVersionSignature: Ubuntu 5.4.0-26.30-generic 5.4.30 Uname: Linux 5.4.0-26-generic x86_64 ApportVersion: 2.20.11-0ubuntu27 Architecture: amd64 CasperMD5CheckResult: skip Date: Fri Apr 24 16:30:53 2020 ProcEnviron: SHELL=/bin/bash LANG=en_US.UTF-8 TERM=xterm-256color PATH=(custom, no user) SourcePackage: ceph UpgradeStatus: Upgraded to focal on 2020-04-24 (0 days ago) ** Affects: ceph (Ubuntu) Importance: Undecided Status: New ** Tags: amd64 apport-bug focal uec-images -- You received this bug notification because you are a member of Ubuntu Bugs, which is subscribed to Ubuntu. https://bugs.launchpad.net/bugs/1874939 Title: ceph-osd can't connect after upgrade to focal To manage notifications about this bug go to: https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1874939/+subscriptions -- ubuntu-bugs mailing list ubuntu-bugs@lists.ubuntu.com https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs