I have a fix in the works for Noble. I'm letting it soak overnight and
I'll propose it shortly.

** Description changed:

+ SRU Template
+ 
+ [Impact]
+ Watcher's decision-engine accumulates idle SQLAlchemy connections over time
+ and eventually exhausts its connection pool (size 2 + 50 overflow), causing
+ the service to report FAILED in `openstack optimize service list`. In a
+ production Sunbeam 2024.1 deployment this typically takes multiple days to
+ manifest. Once the pool is exhausted, all background jobs in the decision
+ engine fail with:
+ 
+   sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
+   reached, connection timed out, timeout 30.00
+ 
+ The only workarounds available to operators today are killing the sleeping
+ MySQL connections out from under watcher and restarting the watcher pods.
+ Once this happens, Watcher cannot reconcile audits, schedule action plans,
+ or run any continuous audit workload until manual intervention.
+ 
+ The update contains the following package updates:
+ 
+   * watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal)
+ 
+ [Test Case]
+ The following SRU process was followed:
+ 
https://documentation.ubuntu.com/sru/en/latest/reference/exception-OpenStack-Updates
+ 
+ In order to avoid regression of existing consumers, the OpenStack team will 
run their continuous integration test against the packages that are in 
-proposed.  A successful run of all available tests will be required before the
+ proposed packages can be let into -updates.
+ 
+ The OpenStack team will be in charge of attaching the output summary of
+ the executed tests. The OpenStack team members will not mark
+ ‘verification-done’ until this has happened.
+ 
+ ------------------------------------------------------------------------
+ Check 1 -- No QueuePool TimeoutError in decision-engine logs
+ ------------------------------------------------------------------------
+ 
+ Original symptom from the bug report:
+ 
+     [watcher-decision-engine] ERROR apscheduler.executors.default
+     sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
+     reached, connection timed out, timeout 30.00
+ 
+ Run:
+ 
+     sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \
+         --since=60m | grep -iEc "queuepool|TimeoutError"
+ 
+ Pass criterion: 0 (zero matches in the last 60 minutes of logs).
+ 
+ ------------------------------------------------------------------------
+ Check 2 -- Sleeping MySQL connections are not accumulating
+ ------------------------------------------------------------------------
+ 
+ First discover the watcher DB credentials (auto-generated, unique per
+ deployment):
+ 
+     sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine 
-- \
+         grep ^connection /etc/watcher/watcher.conf
+ 
+ The output line has the form:
+ 
+     connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router-
+ service...:6446/watcher_api
+ 
+ Extract <USER> and <PASS> from the URL. Then run the same query as
+ the bug report:
+ 
+     sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \
+         -c mysql-router -- \
+         mysql -u <USER> -p<PASS> \
+         -h watcher-mysql-router-service.openstack.svc.cluster.local \
+         -P 6446 watcher_api \
+         -e "SELECT count(*), state FROM information_schema.processlist
+             GROUP BY state;"
+ 
+ Run the query twice, ideally allowing watcher to run overnight between
+ samples.
+ 
+ Pass criterion: the count for the empty-state row (sleeping connections)
+ stays bounded (under ~15) across both samples and does not trend upward.
+ 
+ On the broken package, this count grows by ~4 per minute
+ and exceeds the pool ceiling of 52 within ~15 minutes after a pod
+ restart, at which point Check 3 fails.
+ 
+ ------------------------------------------------------------------------
+ Check 3 -- watcher-decision-engine reports ACTIVE
+ ------------------------------------------------------------------------
+ 
+ Run:
+     openstack optimize service list
+ 
+ Pass criterion: every watcher-decision-engine row shows Status = ACTIVE.
+ 
+ 
+ -------------------------------------
+ Original Bug Report Content Below
+ -------------------------------------
+ 
  in the newton release a background job scheduler was added to the
  Decision Engine.
- 
  
  
https://github.com/openstack/watcher/commit/06c6c4691b103bf0b3fd3304a1a45fb22aedad50
  
  to facilitate this the apscheduler lib was introduced as a depency to watcher.
  apscheduler has a lost of capability but does not officially support eventlet.
  
  since its introduction to watcher it has mostly worked partly by accident.
  over the year as oslo, apscheduler and eventlet have evolved and adapted to 
newer python
  release watcher has continued to use apscheduler even though that is not 
technically supported.
  
  with the move to python 3.12 it became apparent that the background jobs 
executed on the apscheduler
  BackgroundScheduler instances were accellign shared global state from a 
non-monkeypatched native thread.
  
  that results in greenthread sometimes calling into objects that are
  using un monkey patched code.
  
  for example oslo.db uses time.sleep to yield executions.
  when that oslo.db function is first imported from a non patched thread if its 
invoked after that in the main thread it will block.
  
  this can by this expction "RuntimeError: do not call blocking functions
  from the mainloop" here
  https://paste.opendev.org/show/bGPgfURx1cZYOsgmtDyw/
  
  this has been repdocuded in ci as part of moving the ci jobs to ubutnu
  24.04 and python 3.12
  
  
https://review.opendev.org/c/openstack/watcher/+/932963/comments/f54005d7_b0f831bb
  
  to address this issue we need to ensure that the background thread used
  to schedule background task is properly monkey patched.

** Description changed:

  SRU Template
  
  [Impact]
  Watcher's decision-engine accumulates idle SQLAlchemy connections over time
  and eventually exhausts its connection pool (size 2 + 50 overflow), causing
  the service to report FAILED in `openstack optimize service list`. In a
  production Sunbeam 2024.1 deployment this typically takes multiple days to
  manifest. Once the pool is exhausted, all background jobs in the decision
  engine fail with:
  
-   sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
-   reached, connection timed out, timeout 30.00
+   sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
+   reached, connection timed out, timeout 30.00
  
  The only workarounds available to operators today are killing the sleeping
  MySQL connections out from under watcher and restarting the watcher pods.
  Once this happens, Watcher cannot reconcile audits, schedule action plans,
  or run any continuous audit workload until manual intervention.
  
  The update contains the following package updates:
  
-   * watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal)
+   * watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal)
  
  [Test Case]
  The following SRU process was followed:
  
https://documentation.ubuntu.com/sru/en/latest/reference/exception-OpenStack-Updates
  
  In order to avoid regression of existing consumers, the OpenStack team will 
run their continuous integration test against the packages that are in 
-proposed.  A successful run of all available tests will be required before the
  proposed packages can be let into -updates.
  
  The OpenStack team will be in charge of attaching the output summary of
  the executed tests. The OpenStack team members will not mark
  ‘verification-done’ until this has happened.
  
  ------------------------------------------------------------------------
  Check 1 -- No QueuePool TimeoutError in decision-engine logs
  ------------------------------------------------------------------------
  
  Original symptom from the bug report:
  
-     [watcher-decision-engine] ERROR apscheduler.executors.default
-     sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
-     reached, connection timed out, timeout 30.00
+     [watcher-decision-engine] ERROR apscheduler.executors.default
+     sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
+     reached, connection timed out, timeout 30.00
  
  Run:
  
-     sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \
-         --since=60m | grep -iEc "queuepool|TimeoutError"
+     sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \
+         --since=60m | grep -iEc "queuepool|TimeoutError"
  
  Pass criterion: 0 (zero matches in the last 60 minutes of logs).
  
  ------------------------------------------------------------------------
  Check 2 -- Sleeping MySQL connections are not accumulating
  ------------------------------------------------------------------------
  
  First discover the watcher DB credentials (auto-generated, unique per
  deployment):
  
-     sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine 
-- \
-         grep ^connection /etc/watcher/watcher.conf
+     sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine 
-- \
+         grep ^connection /etc/watcher/watcher.conf
  
  The output line has the form:
  
-     connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router-
+     connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router-
  service...:6446/watcher_api
  
  Extract <USER> and <PASS> from the URL. Then run the same query as
  the bug report:
  
-     sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \
-         -c mysql-router -- \
-         mysql -u <USER> -p<PASS> \
-         -h watcher-mysql-router-service.openstack.svc.cluster.local \
-         -P 6446 watcher_api \
-         -e "SELECT count(*), state FROM information_schema.processlist
-             GROUP BY state;"
+     sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \
+         -c mysql-router -- \
+         mysql -u <USER> -p<PASS> \
+         -h watcher-mysql-router-service.openstack.svc.cluster.local \
+         -P 6446 watcher_api \
+         -e "SELECT count(*), state FROM information_schema.processlist
+             GROUP BY state;"
  
  Run the query twice, ideally allowing watcher to run overnight between
  samples.
  
  Pass criterion: the count for the empty-state row (sleeping connections)
  stays bounded (under ~15) across both samples and does not trend upward.
  
  On the broken package, this count grows by ~4 per minute
  and exceeds the pool ceiling of 52 within ~15 minutes after a pod
  restart, at which point Check 3 fails.
  
  ------------------------------------------------------------------------
  Check 3 -- watcher-decision-engine reports ACTIVE
  ------------------------------------------------------------------------
  
  Run:
-     openstack optimize service list
+     openstack optimize service list
  
  Pass criterion: every watcher-decision-engine row shows Status = ACTIVE.
  
- 
- -------------------------------------
+ 
--------------------------------------------------------------------------------
  Original Bug Report Content Below
- -------------------------------------
+ 
--------------------------------------------------------------------------------
  
  in the newton release a background job scheduler was added to the
  Decision Engine.
  
  
https://github.com/openstack/watcher/commit/06c6c4691b103bf0b3fd3304a1a45fb22aedad50
  
  to facilitate this the apscheduler lib was introduced as a depency to watcher.
  apscheduler has a lost of capability but does not officially support eventlet.
  
  since its introduction to watcher it has mostly worked partly by accident.
  over the year as oslo, apscheduler and eventlet have evolved and adapted to 
newer python
  release watcher has continued to use apscheduler even though that is not 
technically supported.
  
  with the move to python 3.12 it became apparent that the background jobs 
executed on the apscheduler
  BackgroundScheduler instances were accellign shared global state from a 
non-monkeypatched native thread.
  
  that results in greenthread sometimes calling into objects that are
  using un monkey patched code.
  
  for example oslo.db uses time.sleep to yield executions.
  when that oslo.db function is first imported from a non patched thread if its 
invoked after that in the main thread it will block.
  
  this can by this expction "RuntimeError: do not call blocking functions
  from the mainloop" here
  https://paste.opendev.org/show/bGPgfURx1cZYOsgmtDyw/
  
  this has been repdocuded in ci as part of moving the ci jobs to ubutnu
  24.04 and python 3.12
  
  
https://review.opendev.org/c/openstack/watcher/+/932963/comments/f54005d7_b0f831bb
  
  to address this issue we need to ensure that the background thread used
  to schedule background task is properly monkey patched.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2086710

Title:
  watcher's use of apscheduler is incompatible with python 3.12 and
  eventlet

To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2086710/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to