I have run the status and stat command below is the output.
ceph -s
cluster:
id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a
health: HEALTH_WARN
2 failed cephadm daemon(s)
1 filesystem is degraded
insufficient standby MDS daemons available
7 daemons have recently crashed
services:
mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 20h)
mgr: strg-node2.unyimy(active, since 20h), standbys: strg-node1.ivkfid
mds: 1/1 daemons up
osd: 32 osds: 32 up (since 20h), 32 in (since 10w)
data:
volumes: 0/1 healthy, 1 recovering
pools: 3 pools, 321 pgs
objects: 15.49M objects, 54 TiB
usage: 109 TiB used, 66 TiB / 175 TiB avail
pgs: 317 active+clean
4 active+clean+scrubbing+deep
ceph mds stat
mumstrg:1/1 {0=mumstrg.strg-node1.gchapr=up:replay(laggy or crashed)}
ceph osd lspools
1 device_health_metrics
2 cephfs.mumstrg.meta
3 cephfs.mumstrg.data
On Thu, Apr 17, 2025 at 10:33 AM Eugen Block <[email protected]> wrote:
> What’s your overall Ceph status? It says data pool 3 not found.
>
> Zitat von Amudhan P <[email protected]>:
>
> > There are few more logs in MDS. I have highlighted few log lines which I
> am
> > not sure what it is.
> >
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -79>
> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
> > register_command dump inode hook 0x560a2c354580
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -78>
> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
> > register_command exit hook 0x560a2c354580
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -77>
> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
> > register_command respawn hook 0x560a2c354580
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -76>
> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
> > register_command heap hook 0x560a2c354580
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -75>
> > 2025-04-16T14:43:59.170+0000 7f74b5030700 1
> mds.mumstrg.strg-node3.xhxbwx
> > Updating MDS map to version 127517 f
> > rom mon.2
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -74>
> > 2025-04-16T14:43:59.170+0000 7f74c6ecd780 5 asok(0x560a2c44e000)
> > register_command cpu_profiler hook 0x560a2c35458
> >
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -73>
> > 2025-04-16T14:43:59.170+0000 7f74b302c700 5
> > mds.beacon.mumstrg.strg-node3.xhxbwx Sending beacon up:boot seq 1
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -72>
> > 2025-04-16T14:43:59.170+0000 7f74b302c700 10 monclient: _send_mon_message
> > to mon.strg-node3 at v2:10.0.103.3:3300/
> >
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -71>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1
> mds.mumstrg.strg-node3.xhxbwx
> > Updating MDS map to version 127518 f
> > rom mon.2
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -70>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _renew_subs
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -69>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _send_mon_message
> > to mon.strg-node3 at v2:10.0.103.3:3300/
> >
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -68>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue
> operator():
> > data pool 3 not found in OSDMap
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -67>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 5 asok(0x560a2c44e000)
> > register_command objecter_requests hook 0x560a2c
> > 3544c0
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -66>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _renew_subs
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -65>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient: _send_mon_message
> > to mon.strg-node3 at v2:10.0.103.3:3300/
> >
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -64>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 10 log_channel(cluster)
> > update_config to_monitors: true to_syslog: false
> > syslog_facility: daemon prio: info to_graylog: false graylog_host:
> > 127.0.0.1 graylog_port: 12201)
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -63>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue
> operator():
> > data pool 3 not found in OSDMap
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -62>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.0 handle_osd_map epoch
> > 0, 0 new blocklist entries
> > Apr 16 20:13:59 strg-node3 bash[59114]: debug -61>
> > 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 handle_mds_map
> i
> > am now mds.0.127518
> >
> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -60>
> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 handle_mds_map
> >> state change up:boot --> up:replay
> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -59>
> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 5
> >> mds.beacon.mummasstrg.strg-node3.xhxbwx set_want_state: up:boot ->
> up:replay
> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -58>
> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 replay_start
> >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug -57>
> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 1 mds.0.127518 waiting for
> >> osdmap 45749 (which blocklists prior instance)*
> >> Apr 16 20:13:59 strg-node3 bash[59114]: debug -56>
> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 10 monclient:
> _send_mon_message
> >> to mon.strg-node3 at v2:10.0.103.3:3300/0
> >> *Apr 16 20:13:59 strg-node3 bash[59114]: debug -55>
> >> 2025-04-16T14:43:59.254+0000 7f74b5030700 4 mds.0.purge_queue
> operator():
> >> data pool 3 not found in OSDMap*
> >>
> >
> >
> > On Thu, Apr 17, 2025 at 7:06 AM Amudhan P <[email protected]> wrote:
> >
> >> Eugen,
> >>
> >> This is the output for the command
> >> cephfs-journal-tool --rank=mumstrg:all --journal=purge_queue journal
> >> inspect
> >> Overall journal integrity: OK
> >> cephfs-journal-tool --rank=mumstrg:all --journal=mdlog journal inspect
> >> Overall journal integrity: OK
> >>
> >> On Thu, Apr 17, 2025 at 2:59 AM Eugen Block <[email protected]> wrote:
> >>
> >>> I think either your mdlog or the purge_queue journal is corrupted:
> >>>
> >>> 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2:
> >>> waiting for purge queue recovered
> >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug -1>
> >>> 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient:
> get_auth_request
> >>> con 0x562856a25400 auth_method 0
> >>> Apr 16 15:29:30 strg-node4 bash[7566]: debug 0>
> >>> 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal
> >>> (Segmentation fault) **
> >>> Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700
> >>> thread_name:md_log_replay
> >>>
> >>> Can you paste the output of this command?
> >>>
> >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=purge_queue
> >>> journal inspect
> >>> cephfs-journal-tool --rank={YOUR_CEPH_FS}:all --journal=mdlog journal
> >>> inspect
> >>>
> >>> I expect one or more damaged entries. Check this thread for more
> details:
> >>>
> >>> https://www.spinics.net/lists/ceph-users/msg80124.html
> >>>
> >>> You should try to backup the journal, but in my case that wasn't
> >>> possible, so I had no other choice than resetting it.
> >>>
> >>> Zitat von Amudhan P <[email protected]>:
> >>>
> >>> > Hi,
> >>> >
> >>> > I am having 2 problems with my Ceph version 16.2.6
> >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable) deployed
> >>> thru
> >>> > cephadm.
> >>> >
> >>> > First issue :-
> >>> > 1 out 3 mon service went out of quorum .
> >>> > When restarted service it comes normal but after a few minutes in
> ceph
> >>> > watch log it reports slow ops and mon goes out of quorum.
> >>> > Node where this mon service failed had one weird thing that I could
> see
> >>> 40%
> >>> > of wait in the top command. But I don't see any error in dmesg or
> >>> anything
> >>> > related to drive IO error.
> >>> > Below are the logs that were printed in ceph watch command.
> >>> >
> >>> > 2025-04-16T09:30:00.000393+0530 mon.strg-node2 [WRN] [WRN] MON_DOWN:
> 1/3
> >>> > mons down, quorum strg-node2,strg-node3
> >>> > 2025-04-16T09:30:00.000416+0530 mon.strg-node2 [WRN]
> mon.strg-node1
> >>> > (rank 0) addr [v2:10.0.103.1:3300/0,v1:10.0.103.1:6789/0] is down
> (out
> >>> of
> >>> > quorum)
> >>> >
> >>> > For now this is not appearing again.
> >>> >
> >>> >
> >>> > Second issue Cephfs degraded :-
> >>> > I have 2 MDS services running in 2 different nodes. Both are in a
> >>> stopped
> >>> > state.
> >>> > when running Ceph -s command
> >>> >
> >>> > cluster:
> >>> > id: 7b3a4952-2131-11ec-94ce-0cc47a5ec98a
> >>> > health: HEALTH_WARN
> >>> > 2 failed cephadm daemon(s)
> >>> > 1 filesystem is degraded
> >>> > insufficient standby MDS daemons available
> >>> >
> >>> > services:
> >>> > mon: 3 daemons, quorum strg-node1,strg-node2,strg-node3 (age 4h)
> >>> > mgr: strg-node2.unyimy(active, since 4h), standbys:
> >>> strg-node1.ivkfid
> >>> > mds: 1/1 daemons up
> >>> > osd: 32 osds: 32 up (since 4h), 32 in (since 10w)
> >>> >
> >>> > data:
> >>> > volumes: 0/1 healthy, 1 recovering
> >>> > pools: 3 pools, 321 pgs
> >>> > objects: 15.49M objects, 54 TiB
> >>> > usage: 109 TiB used, 66 TiB / 175 TiB avail
> >>> > pgs: 321 active+clean
> >>> >
> >>> > Volume shows recovering but there wasn't any progress till now even
> >>> manual
> >>> > start mds service fails again. In Ceph -s command under services it
> >>> shows
> >>> > mds up no any mds service is running.
> >>> >
> >>> > Below is a log snip from one of the mds service.
> >>> >
> >>> >
> >>> > -25> 2025-04-16T09:59:29.954+0000 7f43d0874700 1
> >>> > mds.0.journaler.pq(ro) _finish_read_head loghead(trim 13967032320,
> ex>
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -24>
> >>> > 2025-04-16T09:59:29.954+0000 7f43d0874700 1 mds.0.journaler.pq(ro)
> >>> probing
> >>> > for end of the log
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -23>
> >>> > 2025-04-16T09:59:29.954+0000 7f43d9085700 10 monclient:
> get_auth_request
> >>> > con 0x562856a17400 auth_method 0
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -22>
> >>> > 2025-04-16T09:59:29.954+0000 7f43d8884700 10 monclient:
> get_auth_request
> >>> > con 0x562856a17c00 auth_method 0
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -21>
> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1
> mds.0.journaler.mdlog(ro)
> >>> > recover start
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -20>
> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 1
> mds.0.journaler.mdlog(ro)
> >>> > read_head
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -19>
> >>> > 2025-04-16T09:59:29.974+0000 7f43cf071700 4 mds.0.log Waiting for
> >>> journal
> >>> > 0x200 to recover...
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -18>
> >>> > 2025-04-16T09:59:29.974+0000 7f43d8083700 10 monclient:
> get_auth_request
> >>> > con 0x562856a25000 auth_method 0
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -17>
> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro)
> >>> > _finish_probe_end write_pos = 13968309289 (hea>
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -16>
> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 4 mds.0.purge_queue
> >>> operator():
> >>> > open complete
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -15>
> >>> > 2025-04-16T09:59:29.998+0000 7f43d0874700 1 mds.0.journaler.pq(ro)
> >>> > set_writeable
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -14>
> >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1
> mds.0.journaler.mdlog(ro)
> >>> > _finish_read_head loghead(trim 189741504921>
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -13>
> >>> > 2025-04-16T09:59:29.998+0000 7f43cf872700 1
> mds.0.journaler.mdlog(ro)
> >>> > probing for end of the log
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -12>
> >>> > 2025-04-16T09:59:30.002+0000 7f43d9085700 10 monclient:
> get_auth_request
> >>> > con 0x562856a25c00 auth_method 0
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -11>
> >>> > 2025-04-16T09:59:30.098+0000 7f43cf872700 1
> mds.0.journaler.mdlog(ro)
> >>> > _finish_probe_end write_pos = 1897428915052>
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -10>
> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Journal 0x200
> >>> > recovered.
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -9>
> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 4 mds.0.log Recovered
> journal
> >>> > 0x200 in format 1
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -8>
> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 2 mds.0.127506 Booting: 1:
> >>> > loading/discovering base inodes
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -7>
> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating
> system
> >>> > inode with ino:0x100
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -6>
> >>> > 2025-04-16T09:59:30.098+0000 7f43cf071700 0 mds.0.cache creating
> system
> >>> > inode with ino:0x1
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -5>
> >>> > 2025-04-16T09:59:30.098+0000 7f43d8884700 10 monclient:
> get_auth_request
> >>> > con 0x562856a25800 auth_method 0
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -4>
> >>> > 2025-04-16T09:59:30.098+0000 7f43d8083700 10 monclient:
> get_auth_request
> >>> > con 0x562856a5dc00 auth_method 0
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -3>
> >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2:
> >>> > replaying mds log
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -2>
> >>> > 2025-04-16T09:59:30.146+0000 7f43cf872700 2 mds.0.127506 Booting: 2:
> >>> > waiting for purge queue recovered
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug -1>
> >>> > 2025-04-16T09:59:30.146+0000 7f43d9085700 10 monclient:
> get_auth_request
> >>> > con 0x562856a25400 auth_method 0
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: debug 0>
> >>> > 2025-04-16T09:59:30.230+0000 7f43ce06f700 -1 *** Caught signal
> >>> > (Segmentation fault) **
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: in thread 7f43ce06f700
> >>> > thread_name:md_log_replay
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: ceph version 16.2.6
> >>> > (ee28fb57e47e9f88813e24bbf4c14496ca299d31) pacific (stable)
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: 1:
> >>> /lib64/libpthread.so.0(+0x12b20)
> >>> > [0x7f43dd293b20]
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: 2:
> >>> > /usr/lib64/ceph/libceph-common.so.2(+0x8ec7a00) [0x7f43e6ef3a00]
> >>> > Apr 16 15:29:30 strg-node4 bash[7566]: NOTE: a copy of the
> executable,
> >>> or
> >>> > `objdump -rdS <executable>` is needed to interpret this.
> >>> >
> >>> >
> >>> > Not sure what caused the issue. I couldn't find any resources to fix
> >>> this
> >>> > issue.
> >>> > Need help from someone to bring the ceph cluster online.
> >>> > _______________________________________________
> >>> > ceph-users mailing list -- [email protected]
> >>> > To unsubscribe send an email to [email protected]
> >>>
> >>>
> >>> _______________________________________________
> >>> ceph-users mailing list -- [email protected]
> >>> To unsubscribe send an email to [email protected]
> >>>
> >>
>
>
>
>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]