Hi Thomas,
On Tue, Jan 17, 2023 at 5:34 PM Thomas Widhalm
<[email protected]> wrote:
>
> Another new thing that just happened:
>
> One of the MDS just crashed out of nowhere.
>
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
> In function 'void EMetaBlob::replay(MDSRank*, LogSegment*,
> MDPeerUpdate*)' thread 7fccc7153700 time 2023-01-17T10:05:15.420191+0000
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/17.2.5/rpm/el8/BUILD/ceph-17.2.5/src/mds/journal.cc:
> 1625: FAILED ceph_assert(g_conf()->mds_wipe_sessions)
>
> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
> (stable)
> 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x135) [0x7fccd759943f]
> 2: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
> 3: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
> [0x55fb2b98e89c]
> 4: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
> 5: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
> 6: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
> 7: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
> 8: clone()
To workaround this (for now) till the bug is fixed, set
mds_wipe_sessions = true
in ceph.conf, allow the MDS to transition to `active` state. Once
done, flush the journal:
ceph tell mds.<> flush journal
then you can safely remove the config.
>
>
> and
>
>
>
> *** Caught signal (Aborted) **
> in thread 7fccc7153700 thread_name:md_log_replay
>
> ceph version 17.2.5 (98318ae89f1a893a6ded3a640405cdbb33e08757) quincy
> (stable)
> 1: /lib64/libpthread.so.0(+0x12cf0) [0x7fccd6593cf0]
> 2: gsignal()
> 3: abort()
> 4: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x18f) [0x7fccd7599499]
> 5: /usr/lib64/ceph/libceph-common.so.2(+0x269605) [0x7fccd7599605]
> 6: (EMetaBlob::replay(MDSRank*, LogSegment*, MDPeerUpdate*)+0x5e5c)
> [0x55fb2b98e89c]
> 7: (EUpdate::replay(MDSRank*)+0x40) [0x55fb2b98f5a0]
> 8: (MDLog::_replay_thread()+0x9b3) [0x55fb2b915443]
> 9: (MDLog::ReplayThread::entry()+0x11) [0x55fb2b5d1e31]
> 10: /lib64/libpthread.so.0(+0x81ca) [0x7fccd65891ca]
> 11: clone()
> NOTE: a copy of the executable, or `objdump -rdS <executable>` is
> needed to interpret this.
>
> Is what I found in the logs. Since it's referring to log replaying,
> could this be related to my issue?
>
> On 17.01.23 10:54, Thomas Widhalm wrote:
> > Hi again,
> >
> > Another thing I found: Out of pure desperation, I started MDS on all
> > nodes. I had them configured in the past so I was hoping, they could
> > help with bringing in missing data even when they were down for quite a
> > while now. I didn't see any changes in the logs but the CPU on the hosts
> > that usually don't run MDS just spiked. So high I had to kill the MDS
> > again because otherwise they kept killing OSD containers. So I don't
> > really have any new information, but maybe that could be a hint of some
> > kind?
> >
> > Cheers,
> > Thomas
> >
> > On 17.01.23 10:13, Thomas Widhalm wrote:
> >> Hi,
> >>
> >> Thanks again. :-)
> >>
> >> Ok, that seems like an error to me. I never configured an extra rank for
> >> MDS. Maybe that's where my knowledge failed me but I guess, MDS is
> >> waiting for something that was never there.
> >>
> >> Yes, there are two filesystems. Due to "budget restrictions" (it's my
> >> personal system at home, I configured a second CephFS with only one
> >> replica for data that could be easily restored.
> >>
> >> Here's what I got when turning up the debug level:
> >>
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> Sending beacon up:replay seq 11107
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> sender thread waiting interval 4s
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> received beacon reply up:replay seq 11107 rtt 0.00200002
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> >> Jan 17 10:08:17 ceph05 ceph-mds[1209]: mds.0.158167
> >> schedule_update_timer_task
> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:18 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> >> Jan 17 10:08:19 ceph05 ceph-mds[1209]: mds.0.158167
> >> schedule_update_timer_task
> >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:20 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> Sending beacon up:replay seq 11108
> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> sender thread waiting interval 4s
> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> received beacon reply up:replay seq 11108 rtt 0.00200002
> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> >> Jan 17 10:08:21 ceph05 ceph-mds[1209]: mds.0.158167
> >> schedule_update_timer_task
> >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:22 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> >> Jan 17 10:08:23 ceph05 ceph-mds[1209]: mds.0.158167
> >> schedule_update_timer_task
> >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:24 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57628, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> Sending beacon up:replay seq 11109
> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> sender thread waiting interval 4s
> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.beacon.mds01.ceph05.pqxmvt
> >> received beacon reply up:replay seq 11109 rtt 0.00600006
> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> >> Jan 17 10:08:25 ceph05 ceph-mds[1209]: mds.0.158167
> >> schedule_update_timer_task
> >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57344, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache releasing free memory
> >> Jan 17 10:08:26 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57272, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167 get_task_status
> >> Jan 17 10:08:27 ceph05 ceph-mds[1209]: mds.0.158167
> >> schedule_update_timer_task
> >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache Memory usage: total
> >> 372640, rss 57040, heap 207124, baseline 182548, 0 / 3 inodes have caps,
> >> 0 caps, 0 caps per inode
> >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache cache not ready for
> >> trimming
> >> Jan 17 10:08:28 ceph05 ceph-mds[1209]: mds.0.cache upkeep thread waiting
> >> interval 1.000000000s
> >>
> >>
> >> The only thing that gives me hope here is that the line
> >> mds.beacon.mds01.ceph05.pqxmvt Sending beacon up:replay seq 11109 is
> >> chaning its sequence number.
> >>
> >> Anything else I can provide?
> >>
> >> Cheers,
> >> Thomas
> >>
> >> On 17.01.23 06:27, Kotresh Hiremath Ravishankar wrote:
> >>> Hi Thomas,
> >>>
> >>> Sorry, I misread the mds state to be stuck in 'up:resolve' state. The
> >>> mds is stuck in 'up:replay' which means the MDS taking over a failed
> >>> rank.
> >>> This state represents that the MDS is recovering its journal and other
> >>> metadata.
> >>>
> >>> I notice that there are two filesystems 'cephfs' and 'cephfs_insecure'
> >>> and the active mds for both filesystems are stuck in 'up:replay'. The
> >>> mds
> >>> logs shared are not providing much information to infer anything.
> >>>
> >>> Could you please enable the debug logs and pass on the mds logs ?
> >>>
> >>> Thanks,
> >>> Kotresh H R
> >>>
> >>> On Mon, Jan 16, 2023 at 2:38 PM Thomas Widhalm
> >>> <[email protected] <mailto:[email protected]>> wrote:
> >>>
> >>> Hi Kotresh,
> >>>
> >>> Thanks for your reply!
> >>>
> >>> I only have one rank. Here's the output of all MDS I have:
> >>>
> >>> ###################
> >>>
> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph05.pqxmvt status
> >>> 2023-01-16T08:55:26.055+0000 7f3412ffd700 0 client.61249926
> >>> ms_handle_reset on v2:192.168.23.65:6800/2680651694
> >>> <http://192.168.23.65:6800/2680651694>
> >>> 2023-01-16T08:55:26.084+0000 7f3412ffd700 0 client.61299199
> >>> ms_handle_reset on v2:192.168.23.65:6800/2680651694
> >>> <http://192.168.23.65:6800/2680651694>
> >>> {
> >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
> >>> "whoami": 0,
> >>> "id": 60984167,
> >>> "want_state": "up:replay",
> >>> "state": "up:replay",
> >>> "fs_name": "cephfs",
> >>> "replay_status": {
> >>> "journal_read_pos": 0,
> >>> "journal_write_pos": 0,
> >>> "journal_expire_pos": 0,
> >>> "num_events": 0,
> >>> "num_segments": 0
> >>> },
> >>> "rank_uptime": 150224.982558844,
> >>> "mdsmap_epoch": 143757,
> >>> "osdmap_epoch": 12395,
> >>> "osdmap_epoch_barrier": 0,
> >>> "uptime": 150225.39968057699
> >>> }
> >>>
> >>> ########################
> >>>
> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph04.cvdhsx status
> >>> 2023-01-16T08:59:05.434+0000 7fdb82ff5700 0 client.61299598
> >>> ms_handle_reset on v2:192.168.23.64:6800/3930607515
> >>> <http://192.168.23.64:6800/3930607515>
> >>> 2023-01-16T08:59:05.466+0000 7fdb82ff5700 0 client.61299604
> >>> ms_handle_reset on v2:192.168.23.64:6800/3930607515
> >>> <http://192.168.23.64:6800/3930607515>
> >>> {
> >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
> >>> "whoami": 0,
> >>> "id": 60984134,
> >>> "want_state": "up:replay",
> >>> "state": "up:replay",
> >>> "fs_name": "cephfs_insecure",
> >>> "replay_status": {
> >>> "journal_read_pos": 0,
> >>> "journal_write_pos": 0,
> >>> "journal_expire_pos": 0,
> >>> "num_events": 0,
> >>> "num_segments": 0
> >>> },
> >>> "rank_uptime": 150450.96934037199,
> >>> "mdsmap_epoch": 143815,
> >>> "osdmap_epoch": 12395,
> >>> "osdmap_epoch_barrier": 0,
> >>> "uptime": 150451.93533502301
> >>> }
> >>>
> >>> ###########################
> >>>
> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph06.wcfdom status
> >>> 2023-01-16T08:59:28.572+0000 7f16538c0b80 -1 client.61250376
> >>> resolve_mds: no MDS daemons found by name `mds01.ceph06.wcfdom'
> >>> 2023-01-16T08:59:28.583+0000 7f16538c0b80 -1 client.61250376 FSMap:
> >>> cephfs:1/1 cephfs_insecure:1/1
> >>>
> >>> {cephfs:0=mds01.ceph05.pqxmvt=up:replay,cephfs_insecure:0=mds01.ceph04.cvdhsx=up:replay}
> >>> 2 up:standby
> >>> Error ENOENT: problem getting command descriptions from
> >>> mds.mds01.ceph06.wcfdom
> >>>
> >>> ############################
> >>>
> >>> [ceph: root@ceph06 /]# ceph tell mds.mds01.ceph07.omdisd status
> >>> 2023-01-16T09:00:02.802+0000 7fb7affff700 0 client.61250454
> >>> ms_handle_reset on v2:192.168.23.67:6800/942898192
> >>> <http://192.168.23.67:6800/942898192>
> >>> 2023-01-16T09:00:02.831+0000 7fb7affff700 0 client.61299751
> >>> ms_handle_reset on v2:192.168.23.67:6800/942898192
> >>> <http://192.168.23.67:6800/942898192>
> >>> {
> >>> "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
> >>> "whoami": -1,
> >>> "id": 60984161,
> >>> "want_state": "up:standby",
> >>> "state": "up:standby",
> >>> "mdsmap_epoch": 97687,
> >>> "osdmap_epoch": 0,
> >>> "osdmap_epoch_barrier": 0,
> >>> "uptime": 150508.29091721401
> >>> }
> >>>
> >>> The error message from ceph06 is new to me. That didn't happen the
> >>> last
> >>> times.
> >>>
> >>> [ceph: root@ceph06 /]# ceph fs dump
> >>> e143850
> >>> enable_multiple, ever_enabled_multiple: 1,1
> >>> default compat: compat={},rocompat={},incompat={1=base
> >>> v0.20,2=client
> >>> writeable ranges,3=default file layouts on dirs,4=dir inode in
> >>> separate
> >>> object,5=mds uses versioned encoding,6=dirfrag is stored in
> >>> omap,8=no
> >>> anchor table,9=file layout v2,10=snaprealm v2}
> >>> legacy client fscid: 2
> >>>
> >>> Filesystem 'cephfs' (2)
> >>> fs_name cephfs
> >>> epoch 143850
> >>> flags 12 joinable allow_snaps allow_multimds_snaps
> >>> created 2023-01-14T14:30:05.723421+0000
> >>> modified 2023-01-16T09:00:53.663007+0000
> >>> tableserver 0
> >>> root 0
> >>> session_timeout 60
> >>> session_autoclose 300
> >>> max_file_size 1099511627776
> >>> required_client_features {}
> >>> last_failure 0
> >>> last_failure_osd_epoch 12321
> >>> compat compat={},rocompat={},incompat={1=base v0.20,2=client
> >>> writeable
> >>> ranges,3=default file layouts on dirs,4=dir inode in separate
> >>> object,5=mds uses versioned encoding,6=dirfrag is stored in
> >>> omap,7=mds
> >>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >>> max_mds 1
> >>> in 0
> >>> up {0=60984167}
> >>> failed
> >>> damaged
> >>> stopped
> >>> data_pools [4]
> >>> metadata_pool 5
> >>> inline_data disabled
> >>> balancer
> >>> standby_count_wanted 1
> >>> [mds.mds01.ceph05.pqxmvt{0:60984167} state up:replay seq 37637 addr
> >>> [v2:192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694
> >>>
> >>> <http://192.168.23.65:6800/2680651694,v1:192.168.23.65:6801/2680651694>]
> >>> compat {c=[1],r=[1],i=[7ff]}]
> >>>
> >>>
> >>> Filesystem 'cephfs_insecure' (3)
> >>> fs_name cephfs_insecure
> >>> epoch 143849
> >>> flags 12 joinable allow_snaps allow_multimds_snaps
> >>> created 2023-01-14T14:22:46.360062+0000
> >>> modified 2023-01-16T09:00:52.632163+0000
> >>> tableserver 0
> >>> root 0
> >>> session_timeout 60
> >>> session_autoclose 300
> >>> max_file_size 1099511627776
> >>> required_client_features {}
> >>> last_failure 0
> >>> last_failure_osd_epoch 12319
> >>> compat compat={},rocompat={},incompat={1=base v0.20,2=client
> >>> writeable
> >>> ranges,3=default file layouts on dirs,4=dir inode in separate
> >>> object,5=mds uses versioned encoding,6=dirfrag is stored in
> >>> omap,7=mds
> >>> uses inline data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >>> max_mds 1
> >>> in 0
> >>> up {0=60984134}
> >>> failed
> >>> damaged
> >>> stopped
> >>> data_pools [7]
> >>> metadata_pool 6
> >>> inline_data disabled
> >>> balancer
> >>> standby_count_wanted 1
> >>> [mds.mds01.ceph04.cvdhsx{0:60984134} state up:replay seq 37639 addr
> >>> [v2:192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515
> >>>
> >>> <http://192.168.23.64:6800/3930607515,v1:192.168.23.64:6801/3930607515>]
> >>> compat {c=[1],r=[1],i=[7ff]}]
> >>>
> >>>
> >>> Standby daemons:
> >>>
> >>> [mds.mds01.ceph07.omdisd{-1:60984161} state up:standby seq 2 addr
> >>> [v2:192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192
> >>>
> >>> <http://192.168.23.67:6800/942898192,v1:192.168.23.67:6800/942898192>]
> >>> compat
> >>> {c=[1],r=[1],i=[7ff]}]
> >>> [mds.mds01.ceph06.hsuhqd{-1:60984828} state up:standby seq 1 addr
> >>> [v2:192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518
> >>>
> >>> <http://192.168.23.66:6800/4259514518,v1:192.168.23.66:6801/4259514518>]
> >>> compat {c=[1],r=[1],i=[7ff]}]
> >>> dumped fsmap epoch 143850
> >>>
> >>> #############################
> >>>
> >>> [ceph: root@ceph06 /]# ceph fs status
> >>>
> >>> (doesn't come back)
> >>>
> >>> #############################
> >>>
> >>> All MDS show log lines similar to this one:
> >>>
> >>> Jan 16 10:05:00 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143927 from mon.1
> >>> Jan 16 10:05:05 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143929 from mon.1
> >>> Jan 16 10:05:09 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143930 from mon.1
> >>> Jan 16 10:05:13 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143931 from mon.1
> >>> Jan 16 10:05:20 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143933 from mon.1
> >>> Jan 16 10:05:24 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143935 from mon.1
> >>> Jan 16 10:05:29 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143936 from mon.1
> >>> Jan 16 10:05:33 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143937 from mon.1
> >>> Jan 16 10:05:40 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143939 from mon.1
> >>> Jan 16 10:05:44 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143941 from mon.1
> >>> Jan 16 10:05:49 ceph04 ceph-mds[1311]: mds.mds01.ceph04.cvdhsx
> >>> Updating
> >>> MDS map to version 143942 from mon.1
> >>>
> >>> Anything else, I can provide?
> >>>
> >>> Cheers and thanks again!
> >>> Thomas
> >>>
> >>> On 16.01.23 06:01, Kotresh Hiremath Ravishankar wrote:
> >>> > Hi Thomas,
> >>> >
> >>> > As the documentation says, the MDS enters up:resolve from
> >>> |up:replay| if
> >>> > the Ceph file system has multiple ranks (including this one),
> >>> i.e. it’s
> >>> > not a single active MDS cluster.
> >>> > The MDS is resolving any uncommitted inter-MDS operations. All
> >>> ranks in
> >>> > the file system must be in this state or later for progress to be
> >>> made,
> >>> > i.e. no rank can be failed/damaged or |up:replay|.
> >>> >
> >>> > So please check the status of the other active mds if it's
> >>> failed.
> >>> >
> >>> > Also please share the mds logs and the output of 'ceph fs dump'
> >>> and
> >>> > 'ceph fs status'
> >>> >
> >>> > Thanks,
> >>> > Kotresh H R
> >>> >
> >>> > On Sat, Jan 14, 2023 at 9:07 PM Thomas Widhalm
> >>> > <[email protected] <mailto:[email protected]>
> >>> <mailto:[email protected]
> >>> <mailto:[email protected]>>> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > I'm really lost with my Ceph system. I built a small cluster
> >>> for home
> >>> > usage which has two uses for me: I want to replace an old NAS
> >>> and I want
> >>> > to learn about Ceph so that I have hands-on experience. We're
> >>> using it
> >>> > in our company but I need some real-life experience without
> >>> risking any
> >>> > company or customers data. That's my preferred way of
> >>> learning.
> >>> >
> >>> > The cluster consists of 3 Raspberry Pis plus a few VMs
> >>> running on
> >>> > Proxmox. I'm not using Proxmox' built in Ceph because I want
> >>> to focus on
> >>> > Ceph and not just use it as a preconfigured tool.
> >>> >
> >>> > All hosts are running Fedora (x86_64 and arm64) and during an
> >>> Upgrade
> >>> > from F36 to F37 my cluster suddenly showed all PGs as
> >>> unavailable. I
> >>> > worked nearly a week to get it back online and I learned a
> >>> lot about
> >>> > Ceph management and recovery. The cluster is back but I still
> >>> can't
> >>> > access my data. Maybe you can help me?
> >>> >
> >>> > Here are my versions:
> >>> >
> >>> > [ceph: root@ceph04 /]# ceph versions
> >>> > {
> >>> > "mon": {
> >>> > "ceph version 17.2.5
> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
> >>> > quincy (stable)": 3
> >>> > },
> >>> > "mgr": {
> >>> > "ceph version 17.2.5
> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
> >>> > quincy (stable)": 3
> >>> > },
> >>> > "osd": {
> >>> > "ceph version 17.2.5
> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
> >>> > quincy (stable)": 5
> >>> > },
> >>> > "mds": {
> >>> > "ceph version 17.2.5
> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
> >>> > quincy (stable)": 4
> >>> > },
> >>> > "overall": {
> >>> > "ceph version 17.2.5
> >>> > (98318ae89f1a893a6ded3a640405cdbb33e08757)
> >>> > quincy (stable)": 15
> >>> > }
> >>> > }
> >>> >
> >>> >
> >>> > Here's MDS status output of one MDS:
> >>> > [ceph: root@ceph04 /]# ceph tell mds.mds01.ceph05.pqxmvt
> >>> status
> >>> > 2023-01-14T15:30:28.607+0000 7fb9e17fa700 0 client.60986454
> >>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694
> >>> <http://192.168.23.65:6800/2680651694>
> >>> > <http://192.168.23.65:6800/2680651694
> >>> <http://192.168.23.65:6800/2680651694>>
> >>> > 2023-01-14T15:30:28.640+0000 7fb9e17fa700 0 client.60986460
> >>> > ms_handle_reset on v2:192.168.23.65:6800/2680651694
> >>> <http://192.168.23.65:6800/2680651694>
> >>> > <http://192.168.23.65:6800/2680651694
> >>> <http://192.168.23.65:6800/2680651694>>
> >>> > {
> >>> > "cluster_fsid": "ff6e50de-ed72-11ec-881c-dca6325c2cc4",
> >>> > "whoami": 0,
> >>> > "id": 60984167,
> >>> > "want_state": "up:replay",
> >>> > "state": "up:replay",
> >>> > "fs_name": "cephfs",
> >>> > "replay_status": {
> >>> > "journal_read_pos": 0,
> >>> > "journal_write_pos": 0,
> >>> > "journal_expire_pos": 0,
> >>> > "num_events": 0,
> >>> > "num_segments": 0
> >>> > },
> >>> > "rank_uptime": 1127.54018615,
> >>> > "mdsmap_epoch": 98056,
> >>> > "osdmap_epoch": 12362,
> >>> > "osdmap_epoch_barrier": 0,
> >>> > "uptime": 1127.957307273
> >>> > }
> >>> >
> >>> > It's staying like that for days now. If there was a counter
> >>> moving, I
> >>> > just would wait but it doesn't change anything and alle stats
> >>> says, the
> >>> > MDS aren't working at all.
> >>> >
> >>> > The symptom I have is that Dashboard and all other tools I
> >>> use say, it's
> >>> > more or less ok. (Some old messages about failed daemons and
> >>> scrubbing
> >>> > aside). But I can't mount anything. When I try to start a VM
> >>> that's on
> >>> > RDS I just get a timeout. And when I try to mount a CephFS,
> >>> mount just
> >>> > hangs forever.
> >>> >
> >>> > Whatever command I give MDS or journal, it just hangs. The
> >>> only thing I
> >>> > could do, was take all CephFS offline, kill the MDS's and do
> >>> a "ceph fs
> >>> > reset <fs name> --yes-i-really-mean-it". After that I
> >>> rebooted all
> >>> > nodes, just to be sure but I still have no access to data.
> >>> >
> >>> > Could you please help me? I'm kinda desperate. If you need
> >>> any more
> >>> > information, just let me know.
> >>> >
> >>> > Cheers,
> >>> > Thomas
> >>> >
> >>> > --
> >>> > Thomas Widhalm
> >>> > Lead Systems Engineer
> >>> >
> >>> > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 |
> >>> > D-90429 Nuernberg
> >>> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77
> >>> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
> >>> > https://www.netways.de <https://www.netways.de>
> >>> <https://www.netways.de <https://www.netways.de>> |
> >>> > [email protected] <mailto:[email protected]>
> >>> <mailto:[email protected]
> >>> <mailto:[email protected]>>
> >>> >
> >>> > ** stackconf 2023 - September - https://stackconf.eu
> >>> <https://stackconf.eu>
> >>> > <https://stackconf.eu <https://stackconf.eu>> **
> >>> > ** OSMC 2023 - November - https://osmc.de <https://osmc.de>
> >>> <https://osmc.de <https://osmc.de>> **
> >>> > ** New at NWS: Managed Database -
> >>> > https://nws.netways.de/managed-database
> >>> <https://nws.netways.de/managed-database>
> >>> > <https://nws.netways.de/managed-database
> >>> <https://nws.netways.de/managed-database>> **
> >>> > ** NETWAYS Web Services - https://nws.netways.de
> >>> <https://nws.netways.de>
> >>> > <https://nws.netways.de <https://nws.netways.de>> **
> >>> > _______________________________________________
> >>> > ceph-users mailing list -- [email protected]
> >>> <mailto:[email protected]>
> >>> > <mailto:[email protected] <mailto:[email protected]>>
> >>> > To unsubscribe send an email to [email protected]
> >>> <mailto:[email protected]>
> >>> > <mailto:[email protected]
> >>> <mailto:[email protected]>>
> >>> >
> >>>
> >>> --
> >>> Thomas Widhalm
> >>> Lead Systems Engineer
> >>>
> >>> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 |
> >>> D-90429 Nuernberg
> >>> Tel: +49 911 92885-0 | Fax: +49 911 92885-77
> >>> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
> >>> https://www.netways.de <https://www.netways.de> |
> >>> [email protected] <mailto:[email protected]>
> >>>
> >>> ** stackconf 2023 - September - https://stackconf.eu
> >>> <https://stackconf.eu> **
> >>> ** OSMC 2023 - November - https://osmc.de <https://osmc.de> **
> >>> ** New at NWS: Managed Database -
> >>> https://nws.netways.de/managed-database
> >>> <https://nws.netways.de/managed-database> **
> >>> ** NETWAYS Web Services - https://nws.netways.de
> >>> <https://nws.netways.de> **
> >>>
> >>
> >> --
> >> Thomas Widhalm
> >> Lead Systems Engineer
> >>
> >> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429
> >> Nuernberg
> >> Tel: +49 911 92885-0 | Fax: +49 911 92885-77
> >> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
> >> https://www.netways.de | [email protected]
> >>
> >> ** stackconf 2023 - September - https://stackconf.eu **
> >> ** OSMC 2023 - November - https://osmc.de **
> >> ** New at NWS: Managed Database -
> >> https://nws.netways.de/managed-database **
> >> ** NETWAYS Web Services - https://nws.netways.de **
> >> _______________________________________________
> >> ceph-users mailing list -- [email protected]
> >> To unsubscribe send an email to [email protected]
> >
> > --
> > Thomas Widhalm
> > Lead Systems Engineer
> >
> > NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429
> > Nuernberg
> > Tel: +49 911 92885-0 | Fax: +49 911 92885-77
> > CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
> > https://www.netways.de | [email protected]
> >
> > ** stackconf 2023 - September - https://stackconf.eu **
> > ** OSMC 2023 - November - https://osmc.de **
> > ** New at NWS: Managed Database -
> > https://nws.netways.de/managed-database **
> > ** NETWAYS Web Services - https://nws.netways.de **
> > _______________________________________________
> > ceph-users mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]
>
> --
> Thomas Widhalm
> Lead Systems Engineer
>
> NETWAYS Professional Services GmbH | Deutschherrnstr. 15-19 | D-90429
> Nuernberg
> Tel: +49 911 92885-0 | Fax: +49 911 92885-77
> CEO: Julian Hein, Bernd Erk | AG Nuernberg HRB34510
> https://www.netways.de | [email protected]
>
> ** stackconf 2023 - September - https://stackconf.eu **
> ** OSMC 2023 - November - https://osmc.de **
> ** New at NWS: Managed Database - https://nws.netways.de/managed-database **
> ** NETWAYS Web Services - https://nws.netways.de **
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
--
Cheers,
Venky
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]