Hi Eugen,
thanks for your input. I can't query the hung MDS, but the others say this here:
ceph tell mds.ceph-14 perf dump throttle-write_buf_throttle
{
"throttle-write_buf_throttle": {
"val": 0,
"max": 3758096384,
"get_started": 0,
"get": 5199,
"get_sum": 566691,
"get_or_fail_fail": 0,
"get_or_fail_success": 5199,
"take": 0,
"take_sum": 0,
"put": 719,
"put_sum": 566691,
"wait": {
"avgcount": 0,
"sum": 0.000000000,
"avgtime": 0.000000000
}
}
}
You might be on to something, we are also trying to find where this limit comes
from.
Please keep us posted.
Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14
________________________________________
From: Eugen Block <[email protected]>
Sent: Monday, January 20, 2025 11:12 AM
To: [email protected]
Subject: [ceph-users] Re: MDS hung in purge_stale_snap_data after populating
cache
Hi Frank,
are you able to query the daemon while it's trying to purge the snaps?
pacific:~ # ceph tell mds.{your_daemon} perf dump throttle-write_buf_throttle
...
"max": 3758096384,
I don't know yet where that "max" setting comes from, but I'll keep looking.
Zitat von Frank Schilder <[email protected]>:
> Hi all,
>
> we tracked the deadlock down to line
> https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.cc#L583
> in Journaler::append_entry(bufferlist& bl):
>
> // append
> size_t delta = bl.length() + journal_stream.get_envelope_size();
> // write_buf space is nearly full
> if (!write_buf_throttle.get_or_fail(delta)) {
> l.unlock();
> ldout(cct, 10) << "write_buf_throttle wait, delta " << delta << dendl;
> write_buf_throttle.get(delta); //<<<<<<<<< The MDS is stuck
> here <<<<<<<<<
> l.lock();
> }
> ldout(cct, 20) << "write_buf_throttle get, delta " << delta << dendl;
>
> This is indicated by the last message in the log before the lock up,
> which reads
>
> mds.2.journaler.pq(rw) write_buf_throttle wait, delta 101
>
> and is generated by the line above the call
> write_buf_throttle.get(delta). All log messages messages before
> start with "write_buf_throttle get, delta", which means these did
> not go into the if-statement.
>
> Obvious question is, which parameter influences the maximum size of
> the variable Journaler::write_buffer
> (https://github.com/ceph/ceph/blob/pacific/src/osdc/Journaler.h#L306) in the
> class definition of class Journaler? Increasing this limit should
> get us past the deadlock.
>
> Thanks for your help and best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Frank Schilder <[email protected]>
> Sent: Friday, January 17, 2025 3:02 PM
> To: Bailey Allison; [email protected]
> Subject: [ceph-users] Re: MDS hung in purge_stale_snap_data after
> populating cache
>
> Hi Bailey.
>
> ceph-14 (rank=0): num_stray=205532
> ceph-13 (rank=1): num_stray=4446
> ceph-21-mds (rank=2): num_stray=99446249
> ceph-23 (rank=3): num_stray=3412
> ceph-08 (rank=4): num_stray=1238
> ceph-15 (rank=5): num_stray=1486
> ceph-16 (rank=6): num_stray=5545
> ceph-11 (rank=7): num_stray=2995
>
> The stats for rank 2 are almost certainly out of date though. The
> config dump is large, but since you asked. Its only 3 settings that
> are present for maintenance and workaround reasons:
> mds_beacon_grace, auth_service_ticket_ttl and
> mon_osd_report_timeout. The last is for a different issue though.
>
> WHO MASK LEVEL
> OPTION VALUE RO
> global advanced auth_service_ticket_ttl
> 129600.000000
> global advanced mds_beacon_grace
> 1209600.000000
> global advanced mon_pool_quota_crit_threshold 90
> global advanced mon_pool_quota_warn_threshold 70
> global dev mon_warn_on_pool_pg_num_not_power_of_two false
> global advanced osd_map_message_max_bytes 16384
> global advanced osd_op_queue
> wpq *
> global advanced osd_op_queue_cut_off
> high *
> global advanced osd_pool_default_pg_autoscale_mode off
> mon advanced mon_allow_pool_delete false
> mon advanced mon_osd_down_out_subtree_limit host
> mon advanced mon_osd_min_down_reporters 3
> mon advanced mon_osd_report_timeout 86400
> mon advanced mon_osd_reporter_subtree_level host
> mon advanced mon_pool_quota_warn_threshold 70
> mon advanced mon_sync_max_payload_size 4096
> mon advanced mon_warn_on_insecure_global_id_reclaim false
> mon advanced mon_warn_on_insecure_global_id_reclaim_allowed false
> mgr advanced mgr/balancer/active false
> mgr advanced mgr/dashboard/ceph-01/server_addr
> 10.40.88.65 *
> mgr advanced mgr/dashboard/ceph-02/server_addr
> 10.40.88.66 *
> mgr advanced mgr/dashboard/ceph-03/server_addr
> 10.40.88.67 *
> mgr advanced mgr/dashboard/server_port
> 8443 *
> mgr advanced mon_pg_warn_max_object_skew
> 10.000000
> mgr basic target_max_misplaced_ratio
> 1.000000
> osd advanced bluefs_buffered_io true
> osd advanced bluestore_compression_min_blob_size_hdd
> 262144
> osd advanced bluestore_compression_min_blob_size_ssd 65536
> osd advanced bluestore_compression_mode
> aggressive
> osd class:rbd_perf advanced
> bluestore_compression_mode none
> osd dev bluestore_fsck_quick_fix_on_mount false
> osd advanced osd_deep_scrub_randomize_ratio
> 0.000000
> osd class:hdd advanced
> osd_delete_sleep 300.000000
> osd advanced osd_fast_shutdown false
> osd class:fs_meta advanced
> osd_max_backfills 12
> osd class:hdd advanced
> osd_max_backfills 3
> osd class:rbd_data advanced
> osd_max_backfills 6
> osd class:rbd_meta advanced
> osd_max_backfills 12
> osd class:rbd_perf advanced
> osd_max_backfills 12
> osd class:ssd advanced
> osd_max_backfills 12
> osd advanced osd_max_backfills 3
> osd class:fs_meta dev
> osd_memory_cache_min 2147483648
> osd class:hdd dev
> osd_memory_cache_min 1073741824
> osd class:rbd_data dev
> osd_memory_cache_min 2147483648
> osd class:rbd_meta dev
> osd_memory_cache_min 1073741824
> osd class:rbd_perf dev
> osd_memory_cache_min 2147483648
> osd class:ssd dev
> osd_memory_cache_min 2147483648
> osd dev osd_memory_cache_min
> 805306368
> osd class:fs_meta basic
> osd_memory_target 6442450944
> osd class:hdd basic
> osd_memory_target 3221225472
> osd class:rbd_data basic
> osd_memory_target 4294967296
> osd class:rbd_meta basic
> osd_memory_target 2147483648
> osd class:rbd_perf basic
> osd_memory_target 6442450944
> osd class:ssd basic
> osd_memory_target 4294967296
> osd basic osd_memory_target
> 2147483648
> osd class:rbd_perf advanced
> osd_op_num_threads_per_shard 4 *
> osd class:hdd advanced
> osd_recovery_delay_start 600.000000
> osd class:rbd_data advanced
> osd_recovery_delay_start 300.000000
> osd class:rbd_perf advanced
> osd_recovery_delay_start 300.000000
> osd class:fs_meta advanced
> osd_recovery_max_active 32
> osd class:hdd advanced
> osd_recovery_max_active 8
> osd class:rbd_data advanced
> osd_recovery_max_active 16
> osd class:rbd_meta advanced
> osd_recovery_max_active 32
> osd class:rbd_perf advanced
> osd_recovery_max_active 16
> osd class:ssd advanced
> osd_recovery_max_active 32
> osd advanced osd_recovery_max_active 8
> osd class:fs_meta advanced
> osd_recovery_sleep 0.002500
> osd class:hdd advanced
> osd_recovery_sleep 0.050000
> osd class:rbd_data advanced
> osd_recovery_sleep 0.025000
> osd class:rbd_meta advanced
> osd_recovery_sleep 0.002500
> osd class:rbd_perf advanced
> osd_recovery_sleep 0.010000
> osd class:ssd advanced
> osd_recovery_sleep 0.002500
> osd advanced osd_recovery_sleep
> 0.050000
> osd class:hdd dev
> osd_scrub_backoff_ratio 0.330000
> osd class:hdd advanced
> osd_scrub_during_recovery true
> osd advanced osd_scrub_load_threshold
> 0.750000
> osd class:fs_meta advanced
> osd_snap_trim_sleep 0.050000
> osd class:hdd advanced
> osd_snap_trim_sleep 2.000000
> osd class:rbd_data advanced
> osd_snap_trim_sleep 0.100000
> mds basic client_cache_size 8192
> mds advanced defer_client_eviction_on_laggy_osds false
> mds advanced mds_bal_fragment_size_max
> 100000
> mds basic mds_cache_memory_limit
> 25769803776
> mds advanced mds_cache_reservation
> 0.500000
> mds advanced mds_max_caps_per_client 65536
> mds advanced mds_min_caps_per_client 4096
> mds advanced mds_recall_max_caps 32768
> mds advanced mds_session_blocklist_on_timeout false
>
> Best regards,
> =================
> Frank Schilder
> AIT Risø Campus
> Bygning 109, rum S14
>
> ________________________________________
> From: Bailey Allison <[email protected]>
> Sent: Thursday, January 16, 2025 10:08 PM
> To: [email protected]
> Subject: [ceph-users] Re: MDS hung in purge_stale_snap_data after
> populating cache
>
> Frank,
>
> Are you able to share an update to date ceph config dump and ceph daemon
> mds.X perf dump | grep strays from the cluster?
>
> We're just getting through our comically long ceph outage, so i'd like
> to be able to share the love here hahahaha
>
> Regards,
>
> Bailey Allison
> Service Team Lead
> 45Drives, Ltd.
> 866-594-7199 x868
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]