** Description changed:

  SRU Template
  
  [Impact]
  Watcher's decision-engine accumulates idle SQLAlchemy connections over time
  and eventually exhausts its connection pool (size 2 + 50 overflow), causing
  the service to report FAILED in `openstack optimize service list`. In a
  production Sunbeam 2024.1 deployment this typically takes multiple days to
  manifest. Once the pool is exhausted, all background jobs in the decision
  engine fail with:
  
    sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
    reached, connection timed out, timeout 30.00
  
  The only workarounds available to operators today are killing the sleeping
  MySQL connections out from under watcher and restarting the watcher pods.
  Once this happens, Watcher cannot reconcile audits, schedule action plans,
  or run any continuous audit workload until manual intervention.
  
  The update contains the following package updates:
  
    * watcher 2:12.0.0-0ubuntu1.3 (noble / cloud-archive:caracal)
  
  [Test Case]
  The following SRU process was followed:
  
https://documentation.ubuntu.com/sru/en/latest/reference/exception-OpenStack-Updates
  
  In order to avoid regression of existing consumers, the OpenStack team will 
run their continuous integration test against the packages that are in 
-proposed.  A successful run of all available tests will be required before the
  proposed packages can be let into -updates.
  
  The OpenStack team will be in charge of attaching the output summary of
  the executed tests. The OpenStack team members will not mark
  ‘verification-done’ until this has happened.
  
  ------------------------------------------------------------------------
  Check 1 -- No QueuePool TimeoutError in decision-engine logs
  ------------------------------------------------------------------------
  
  Original symptom from the bug report:
  
      [watcher-decision-engine] ERROR apscheduler.executors.default
      sqlalchemy.exc.TimeoutError: QueuePool limit of size 2 overflow 50
      reached, connection timed out, timeout 30.00
  
  Run:
  
      sudo k8s kubectl logs -n openstack watcher-0 -c watcher-decision-engine \
          --since=60m | grep -iEc "queuepool|TimeoutError"
  
  Pass criterion: 0 (zero matches in the last 60 minutes of logs).
  
  ------------------------------------------------------------------------
  Check 2 -- Sleeping MySQL connections are not accumulating
  ------------------------------------------------------------------------
  
  First discover the watcher DB credentials (auto-generated, unique per
  deployment):
  
      sudo k8s kubectl exec -n openstack watcher-0 -c watcher-decision-engine 
-- \
          grep ^connection /etc/watcher/watcher.conf
  
  The output line has the form:
  
      connection = mysql+pymysql://<USER>:<PASS>@watcher-mysql-router-
  service...:6446/watcher_api
  
  Extract <USER> and <PASS> from the URL. Then run the same query as
  the bug report:
  
      sudo k8s kubectl exec -n openstack watcher-mysql-router-0 \
          -c mysql-router -- \
          mysql -u <USER> -p<PASS> \
          -h watcher-mysql-router-service.openstack.svc.cluster.local \
          -P 6446 watcher_api \
          -e "SELECT count(*), state FROM information_schema.processlist
              GROUP BY state;"
  
  Run the query twice, ideally allowing watcher to run overnight between
  samples.
  
  Pass criterion: the count for the empty-state row (sleeping connections)
  stays bounded (under ~15) across both samples and does not trend upward.
  
  On the broken package, this count grows by ~4 per minute
  and exceeds the pool ceiling of 52 within ~15 minutes after a pod
  restart, at which point Check 3 fails.
  
  ------------------------------------------------------------------------
  Check 3 -- watcher-decision-engine reports ACTIVE
  ------------------------------------------------------------------------
  
  Run:
      openstack optimize service list
  
  Pass criterion: every watcher-decision-engine row shows Status = ACTIVE.
  
+ [Regression Potential]
+ In order to mitigate the regression potential, the results of the
+ aforementioned OpenStack CI tests are attached to this bug.
+ 
+ The bulk of the change is further-database-refactoring.patch, which
+ rewrites the SQLAlchemy session lifecycle in watcher/db/sqlalchemy/api.py
+ to use the enginefacade reader/writer context managers. It is a direct
+ cherry-pick from upstream stable/2025.1 and has been in upstream watcher
+ since 14.0.0.0rc1 (February 2025) without revert. The prerequisite patch
+ (replace-deprecated-legacy-enginefacade.patch) is 8 lines and only swaps
+ a deprecated facade-construction call site.
+ 
+ Two regression modes are possible, both low risk:
+  * Session lifetime is now scoped to the context manager. Out-of-tree
+    code that uses ORM objects after the helper returns may raise
+    DetachedInstanceError. All in-tree callers were migrated as part
+    of the patch.
+ 
+  * The model_query() helper has been removed. Out-of-tree code that
+    imports it will fail at import time.
+ 
+ [Discussion]
+ The leak originates in watcher/db/sqlalchemy/api.py: model_query()
+ calls get_session() and returns the query without ever closing the
+ session. Every Audit.list() call therefore leaks one connection, and
+ ContinuousAuditHandler.launch_audits_periodically calls it twice per
+ tick. Upstream addressed this by migrating the DB layer to oslo_db's
+ enginefacade reader/writer context-manager API (Change-Id
+ Ib5e9aa288232cc1b766bbf2a8ce2113d5a8e2f7d, upstream LP #2067815), 
+ which auto-closes sessions on context exit.
+ That fix shipped in 14.0.0.0rc1 (Epoxy / 2025.1) and is cherry-picked
+ here, together with its prerequisite (Change-Id
+ I5570698262617eae3f48cf29aacf2e23ad541e5f, "Replace deprecated
+ LegacyEngineFacade").
+ 
+ 
  
--------------------------------------------------------------------------------
  Original Bug Report Content Below
  
--------------------------------------------------------------------------------
  
  in the newton release a background job scheduler was added to the
  Decision Engine.
  
  
https://github.com/openstack/watcher/commit/06c6c4691b103bf0b3fd3304a1a45fb22aedad50
  
  to facilitate this the apscheduler lib was introduced as a depency to watcher.
  apscheduler has a lost of capability but does not officially support eventlet.
  
  since its introduction to watcher it has mostly worked partly by accident.
  over the year as oslo, apscheduler and eventlet have evolved and adapted to 
newer python
  release watcher has continued to use apscheduler even though that is not 
technically supported.
  
  with the move to python 3.12 it became apparent that the background jobs 
executed on the apscheduler
  BackgroundScheduler instances were accellign shared global state from a 
non-monkeypatched native thread.
  
  that results in greenthread sometimes calling into objects that are
  using un monkey patched code.
  
  for example oslo.db uses time.sleep to yield executions.
  when that oslo.db function is first imported from a non patched thread if its 
invoked after that in the main thread it will block.
  
  this can by this expction "RuntimeError: do not call blocking functions
  from the mainloop" here
  https://paste.opendev.org/show/bGPgfURx1cZYOsgmtDyw/
  
  this has been repdocuded in ci as part of moving the ci jobs to ubutnu
  24.04 and python 3.12
  
  
https://review.opendev.org/c/openstack/watcher/+/932963/comments/f54005d7_b0f831bb
  
  to address this issue we need to ensure that the background thread used
  to schedule background task is properly monkey patched.

-- 
You received this bug notification because you are a member of Ubuntu
Bugs, which is subscribed to Ubuntu.
https://bugs.launchpad.net/bugs/2086710

Title:
  watcher's use of apscheduler is incompatible with python 3.12 and
  eventlet

To manage notifications about this bug go to:
https://bugs.launchpad.net/watcher/+bug/2086710/+subscriptions


-- 
ubuntu-bugs mailing list
[email protected]
https://lists.ubuntu.com/mailman/listinfo/ubuntu-bugs

Reply via email to