** Summary changed:

- mgr can be very slow in a large ceph cluster
+ [SRU] mgr can be very slow in a large ceph cluster

** Description changed:

- upstream implemented a new feature [1] that will check/report those long
- network ping times between osds, but it introduced an issue that ceph-
- mgr might be very slow because it needs to dump all the new osd network
- ping stats [2] for some tasks, this can be bad especially when the
- cluster has large number of osds.
+ [Impact] 
+ Ceph upstream implemented a new feature [1] that will check/report those long 
network ping times between osds, but it introduced an issue that ceph-mgr might 
be very slow because it needs to dump all the new osd network ping stats [2] 
for some tasks, this can be bad especially when the cluster has large number of 
osds.
  
- Since these kind osd network ping stats doesn't need to be exposed to the 
python mgr module.
- so, it only makes the mgr doing more work than it needs to, it could cause 
the mgr slow or even hang and could cause the cpu usage of mgr process 
constantly high. the fix is to disable the ping time dump for those mgr python 
modules.
+ Since these kind osd network ping stats doesn't need to be exposed to
+ the python mgr module. so, it only makes the mgr doing more work than it
+ needs to, it could cause the mgr slow or even hang and could cause the
+ cpu usage of mgr process constantly high. The fix is to disable the ping
+ time dump for those mgr python modules.
+ 
+ This resulted in ceph-mgr not responding to commands and/or hanging (and
+ had to be restarted) in clusters with a large number of OSDs.
+ 
+ [0] is the upstreambug. It was backported to Nautilus but rejected for
+ Luminous and Mimic because they reached EOL in upstream. But I want to
+ backport to these two releases Ubuntu/UCA.
  
  The major fix from upstream is here [3], and also I found an improvement
  commit [4] that submitted later in another PR.
  
- We need to backport them to bionic Luminous and Mimic(Stein), Nautilus
- and Octopus have the fix
+ [Test Case]
+ Deploy a Ceph cluster (Luminous 13.2.9 or Mimic 13.2.9) with large number of 
Ceph OSDs (600+). During normal operations of the cluster, as the ceph-mgr 
dumps the network ping stats regularly, this problem would manifest. This is 
relatively hard to reproduce as the ceph-mgr may not always get overloaded and 
thus not hang.
  
+ [Regression Potential]
+ Fix has been accepted upstream (the changes are here in "sync" with upstream 
to the extent these old releases match the latest source code) and have been 
confirmed to work. So the risk is minimal.
+ 
+ At worst, this could affect modules that consume the stats from ceph-mgr
+ (such as prometheus or other monitoring scripts/tools) and thus becomes
+ less useful. But still shouldn't cause any problems to the operations of
+ the cluster itself.
+ 
+ [Other Info]
+ - In addition to the fix from [1], another commit [4] is also cherry-picked 
and backported here - this was also accepted upstream.
+ 
+ - Since the ceph-mgr hangs when affected, this also impact sosreport
+ collection - commands time out as the mgr doesn't respond and thus info
+ get truncated/not collected in that case. This fix should help avoid
+ that problem in sosreports.
+ 
+ [0] https://tracker.ceph.com/issues/43364
  [1] https://github.com/ceph/ceph/pull/28755
  [2] 
https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] 
https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  [SRU] mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to