There isn't one specific thing I can point my finger at that would be "_this_ 
is where all the pain comes from".  Some of these issues are also our own 
doing.   We have been getting too comfortable seeing the cluster in HEALTH_WARN 
with "1 clients failing to respond to capability release" and the like.  Some 
of these are client bugs, some are mds issues, others are issues with HPC 
cluster job workflow.  Getting comfortable driving with the check engine light 
on is how you end up with a new ventilated engine block.

What I can talk about is when we get into a state with a huge journal / log 
segments and experience a failure, the recovery is painful.  Similar to what 
Lars ran into, there is a missing heartbeat in the recovery path.  I can't 
remember off the top of my head which status it is in 
(replay,reconnect,rejoin,etc).  This alone has stopped our cluster from coming 
back automatically after an off hours failure.

In recovery, the memory usage is easily 3x what was being used at the time of 
crashing.  Lots of swap is slow, but it at least gets you out of jail on this 
one.  What can't be maneuvered around is how much of the recovery process is 
single threaded.  One thread goes into overdrive while memory slowly goes over 
the edge.  Being recovery however I can understand being careful.

In operation, we've been bit by the finisher thread numerous times. 
Specifically when removing a large amount of empty directories when snapshots 
exist.  This is what was going on when we hit our last outage that had a huge 
journal and thusly painful recovery.  


--

Paul Mezzanini
Platform Engineer III
Research Computing

Rochester Institute of Technology

 “End users is a description, not a goal.”





________________________________________
From: Milind Changire <[email protected]>
Sent: Sunday, January 7, 2024 10:54 PM
To: Paul Mezzanini
Cc: Lars Köppel; Patrick Donnelly; [email protected]
Subject: Re: [ceph-users] Re: mds crashes after up:replay state

Hi Paul,
Could you create a ceph tracker (tracker.ceph.com) and list out things
that are suboptimal according to your investigation?
We'd like to hear more on this.

Alternatively, you could list the issues with mds here.

Thanks,
Milind

On Sun, Jan 7, 2024 at 4:37 PM Paul Mezzanini <[email protected]> wrote:
>
> We've seen it use as much as 1.6t of ram/swap.    Swap makes it slow, but a 
> slow recovery is better than no recovery.   My coworker looked into it at the 
>  source code level and while it is doing some things suboptimal that's how 
> it's currently written.
>
> The MDS code needs some real love if ceph is going to offer file services 
> that can match what the back end storage can actually provide.
>
> --
>
> Paul Mezzanini
> Platform Engineer III
> Research Computing
>
> Rochester Institute of Technology
>
>  Sent from my phone, please excuse typos and brevity
> ________________________________
> From: Lars Köppel <[email protected]>
> Sent: Sunday, January 7, 2024 4:20:05 AM
> To: Paul Mezzanini <[email protected]>
> Cc: Patrick Donnelly <[email protected]>; [email protected] 
> <[email protected]>
> Subject: Re: [ceph-users] Re: mds crashes after up:replay state
>
> Hi Paul,
>
> your suggestion was correct. The mds went through the replay state and was a 
> few minutes in the active state. But then it gets killed because of too high 
> memory consumption.
> @mds.cephfs.storage01.pgperp.service: Main process exited, code=exited, 
> status=137/n/a
> How could I raise the memory limit for the mds?
>
> From the looks in htop. It looked like there is a memory leak, because it 
> consumed over 200 GB of memory while reporting that it actually used 20 - 30 
> GB.
> Is this possible?
>
> Best regardes
> Lars
>
>
> [ariadne.ai Logo]       Lars Köppel
> Developer
> Email:  [email protected]<mailto:[email protected]>
> Phone:  +49 6221 5993580<tel:+4962215993580>
> ariadne.ai<http://ariadne.ai> (Germany) GmbH
> Häusserstraße 3, 69115 Heidelberg
> Amtsgericht Mannheim, HRB 744040
> Geschäftsführer: Dr. Fabian Svara
> https://ariadne.ai
>
>
> On Sat, Jan 6, 2024 at 3:33 PM Paul Mezzanini 
> <[email protected]<mailto:[email protected]>> wrote:
> I'm replying from my phone so hopefully this works well.  This sounds 
> suspiciously similar to an issue we have run into where there is an internal 
> loop in the MDS that doesn't have heartbeat in it. If that loop goes for too 
> long, it is marked as failed and the process jumps to another server and 
> starts again.
>
> We get around it by "wedging it in a corner" and removing the ability to 
> migrate. This is as simple as stopping all standby MDS services and just 
> waiting for the MDS to complete.
>
>
>
> --
>
> Paul Mezzanini
> Platform Engineer III
> Research Computing
>
> Rochester Institute of Technology
>
>  Sent from my phone, please excuse typos and brevity
> ________________________________
> From: Lars Köppel <[email protected]<mailto:[email protected]>>
> Sent: Saturday, January 6, 2024 7:22:14 AM
> To: Patrick Donnelly <[email protected]<mailto:[email protected]>>
> Cc: [email protected]<mailto:[email protected]> 
> <[email protected]<mailto:[email protected]>>
> Subject: [ceph-users] Re: mds crashes after up:replay state
>
> Hi Patrick,
>
> thank you for your response.
> I already changed the mentioned settings, but I had no luck with this.
>
> The journal inspection I had running yesterday finished with: 'Overall
> journal integrity: OK'.
> So you are probably right that the mds is crashing shortly after the replay
> finished.
>
> I checked the logs and there is every few seconds a new FSMap epoch without
> any visible changes. One of the current epochs is at the end. Is there
> anything useful in it?
>
> When the replay is finished the running mds goes to the state
> 'up:reconnect' and after a second to the state 'up:rejoin'. After this
> there is for ~20 min no new fsmap until this message pops up:
>
> > Jan 06 12:38:23 storage01 ceph-mds[223997]:
> > mds.beacon.cephfs.storage01.pgperp Skipping beacon heartbeat to monitors
> > (last acked 4.00012s ago); MDS internal heartbeat is not healthy!
> >
> A few seconds later (the heartbeat message is still there) a new fsmap is
> created with a new mds now in replay state.
> The last of the heartbeat messages is after 1446 seconds. Then it is gone
> and no more warnings or errors are displayed at this point. One minute
> after the last message the mds is back as standy mds.
>
> > Jan 06 13:02:26 storage01 ceph-mds[223997]:
> > mds.beacon.cephfs.storage01.pgperp Skipping beacon heartbeat to monitors
> > (last acked 1446.6s ago); MDS internal heartbeat is not healthy!
> >
>
> Also i can not find any warning in the logs when the mds crashes. What
> could I do to find the error for the crash?
>
> Best regardes
> Lars
>
> e205510
> > enable_multiple, ever_enabled_multiple: 1,1
> > default compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> > writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> > object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> > anchor table,9=file layout v2,10=snaprealm v2}
> > legacy client fscid: 3
> >
> > Filesystem 'cephfs' (3)
> > fs_name cephfs
> > epoch   205510
> > flags   32 joinable allow_snaps allow_multimds_snaps allow_standby_replay
> > created 2023-06-06T11:44:03.651905+0000
> > modified        2024-01-06T10:28:14.676738+0000
> > tableserver     0
> > root    0
> > session_timeout 60
> > session_autoclose       300
> > max_file_size   8796093022208
> > required_client_features        {}
> > last_failure    0
> > last_failure_osd_epoch  42962
> > compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> > data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> > max_mds 1
> > in      0
> > up      {0=2178448}
> > failed
> > damaged
> > stopped
> > data_pools      [11,12]
> > metadata_pool   10
> > inline_data     disabled
> > balancer
> > standby_count_wanted    1
> > [mds.cephfs.storage01.pgperp{0:2178448} state up:replay seq 4484
> > join_fscid=3 addr [v2:
> > 192.168.0.101:6800/855849996,v1:192.168.0.101:6801/855849996<http://192.168.0.101:6800/855849996,v1:192.168.0.101:6801/855849996>]
> >  compat
> > {c=[1],r=[1],i=[7ff]}]
> >
> >
> > Filesystem 'cephfs_recovery' (4)
> > fs_name cephfs_recovery
> > epoch   193460
> > flags   13 allow_snaps allow_multimds_snaps
> > created 2024-01-05T10:47:32.224388+0000
> > modified        2024-01-05T16:43:37.677241+0000
> > tableserver     0
> > root    0
> > session_timeout 60
> > session_autoclose       300
> > max_file_size   1099511627776
> > required_client_features        {}
> > last_failure    0
> > last_failure_osd_epoch  42904
> > compat  compat={},rocompat={},incompat={1=base v0.20,2=client writeable
> > ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds
> > uses versioned encoding,6=dirfrag is stored in omap,7=mds uses inline
> > data,8=no anchor table,9=file layout v2,10=snaprealm v2}
> > max_mds 1
> > in      0
> > up      {}
> > failed
> > damaged 0
> > stopped
> > data_pools      [11,12]
> > metadata_pool   13
> > inline_data     disabled
> > balancer
> > standby_count_wanted    1
> >
> >
> > Standby daemons:
> >
> > [mds.cephfs.storage02.zopcif{-1:2356728} state up:standby seq 1
> > join_fscid=3 addr [v2:
> > 192.168.0.102:6800/3567764205,v1:192.168.0.102:6801/3567764205<http://192.168.0.102:6800/3567764205,v1:192.168.0.102:6801/3567764205>]
> >  compat
> > {c=[1],r=[1],i=[7ff]}]
> > dumped fsmap epoch 205510
> >
>
>
> [image: ariadne.ai<http://ariadne.ai> Logo] Lars Köppel
> Developer
> Email: [email protected]<mailto:[email protected]>
> Phone: +49 6221 5993580 <+4962215993580>
> ariadne.ai<http://ariadne.ai> (Germany) GmbH
> Häusserstraße 3, 69115 Heidelberg
> Amtsgericht Mannheim, HRB 744040
> Geschäftsführer: Dr. Fabian Svara
> https://ariadne.ai
>
>
> On Fri, Jan 5, 2024 at 7:52 PM Patrick Donnelly 
> <[email protected]<mailto:[email protected]>> wrote:
>
> > Hi Lars,
> >
> > On Fri, Jan 5, 2024 at 9:53 AM Lars Köppel 
> > <[email protected]<mailto:[email protected]>>
> > wrote:
> > >
> > > Hello everyone,
> > >
> > > we are running a small cluster with 3 nodes and 25 osds per node. And
> > Ceph
> > > version 17.2.6.
> > > Recently the active mds crashed and since then the new starting mds has
> > > always been in the up:replay state. In the output of the command 'ceph
> > tell
> > > mds.cephfs:0 status' you can see that the journal is completely read in.
> > As
> > > soon as it's finished, the mds crashes and the next one starts reading
> > the
> > > journal.
> > >
> > > At the moment I have the journal inspection running ('cephfs-journal-tool
> > > --rank=cephfs:0 journal inspect').
> > >
> > > Does anyone have any further suggestions on how I can get the cluster
> > > running again as quickly as possible?
> >
> > Please review:
> >
> > https://docs.ceph.com/en/reef/cephfs/troubleshooting/#stuck-during-recovery
> >
> > Note: your MDS is probably not failing in up:replay but shortly after
> > reaching one of the later states. Check the mon logs to see what the
> > FSMap changes were.
> >
> >
> > Patrick Donnelly, Ph.D.
> > He / Him / His
> > Red Hat Partner Engineer
> > IBM, Inc.
> > GPG: 19F28A586F808C2402351B93C3301A3E258DD79D
> >
> >
> _______________________________________________
> ceph-users mailing list -- [email protected]<mailto:[email protected]>
> To unsubscribe send an email to 
> [email protected]<mailto:[email protected]>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]



--
Milind

_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to