Public bug reported:

upstream implemented a new feature [1] that will check/report those long
network ping times between osds, but it introduced an issue that ceph-
mgr might be very slow because it needs to dump all the new osd network
ping stats [2] for some tasks, this can be bad especially when the
cluster has large number of osds.

Since these kind osd network ping stats doesn't need to be exposed to the 
python mgr module.
so, it only makes the mgr doing more work than it needs to, it could cause the 
mgr slow or even hang and could cause the cpu usage of mgr process constantly 
high. the fix is to disable the ping time dump for those mgr python modules.

The major fix from upstream is here [3], and also I found an improvement
commit [4] that submitted later in another PR.

We need to backport them to bionic Luminous and Mimic(Stein), Nautilus
and Octopus have the fix

[1] https://github.com/ceph/ceph/pull/28755
[2] 
https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
[3] https://github.com/ceph/ceph/pull/32406
[4] 
https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

** Affects: ceph (Ubuntu)
     Importance: Undecided
         Status: New

** Description changed:

  upstream implemented a new feature [1] that will check/report those long
  network ping times between osds, but it introduced an issue that ceph-
- mgr might be very slow  because it needs to dump all the new osd network
+ mgr might be very slow because it needs to dump all the new osd network
  ping stats [2] for some tasks, this can be bad especially when the
  cluster has large number of osds.
  
  Since these kind osd network ping stats doesn't need to be exposed to the 
python mgr module.
- so, it only makes the mgr doing more extra work than it needs, the fix is to 
disable the ping time dump for those mgr python modules. 
-  
- The major fix from upstream is here [3], and also I found an improvement 
commit [4] that submitted later in another PR. 
+ so, it only makes the mgr doing more work than it needs to, it could cause 
the mgr slow or even hang and could cause the cpu usage of mgr process 
constantly high. the fix is to disable the ping time dump for those mgr python 
modules.
+ 
+ The major fix from upstream is here [3], and also I found an improvement
+ commit [4] that submitted later in another PR.
  
  We need to backport them to bionic Luminous and Mimic(Stein), Nautilus
  and Octopus have the fix
- 
  
  [1] https://github.com/ceph/ceph/pull/28755
  [2] 
https://github.com/ceph/ceph/pull/28755/files#diff-5498d83111f1210998ee186e98d5836d2bce9992be7648addc83f59e798cddd8L430
  [3] https://github.com/ceph/ceph/pull/32406
  [4] 
https://github.com/ceph/ceph/pull/32554/commits/1112584621016c4a8cac1bedb1a1b8b17c394f7f

** Summary changed:

- mgr can be very slow within a large ceph cluster
+ mgr can be very slow in a large ceph cluster

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/1906496

Title:
  mgr can be very slow in a large ceph cluster

To manage notifications about this bug go to:
https://bugs.launchpad.net/ubuntu/+source/ceph/+bug/1906496/+subscriptions

-- 
ubuntu-bugs mailing list
ubuntu-bugs@lists.ubuntu.com
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to