Package: swift-container
Version: 2.26.0-10+deb11u1+wmf1
Severity: normal
Hi,
We had an outage due to swift simultaneously quarantining all three
copies of a container database (saying they were corrupt) during a
listing operation. Given all three databases were corrupt, this is I
think not a case of a disk/fs fault causing corruption, but rather that
swift had processed an operation that wrote to the container DB in a way
that corrupted the sqlite file.
Each container-server had a backtrace like this:
Jan 5 07:20:28 ms-be2058 container-server: ERROR __call__ error with
GET /sdb3/16503/AUTH_mw/wikipedia-commons-local-thumb.f8 :
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/swift/common/db.py", line 475,
in get
yield conn
File "/usr/lib/python3/dist-packages/swift/container/backend.py",
line 1173, in list_objects_iter
return [transform_func(r) for r in curs]
File "/usr/lib/python3/dist-packages/swift/container/backend.py",
line 1173, in <listcomp>
return [transform_func(r) for r in curs]
sqlite3.DatabaseError: database disk image is malformed
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/swift/container/server.py", line
867, in __call__
res = getattr(self, req.method)(req)
File "/usr/lib/python3/dist-packages/swift/common/utils.py", line
2007, in _timing_stats
resp = func(ctrl, *args, **kwargs)
File "/usr/lib/python3/dist-packages/swift/container/server.py", line
752, in GET
container_list = src_broker.list_objects_iter(
File "/usr/lib/python3/dist-packages/swift/container/backend.py",
line 1223, in list_objects_iter
return results
File "/usr/lib/python3.9/contextlib.py", line 135, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/lib/python3/dist-packages/swift/common/db.py", line 483,
in get
self.possibly_quarantine(*sys.exc_info())
File "/usr/lib/python3/dist-packages/swift/common/db.py", line 436,
in possibly_quarantine
self.quarantine(exc_hint)
File "/usr/lib/python3/dist-packages/swift/common/db.py", line 414,
in quarantine
raise sqlite3.DatabaseError(detail)
sqlite3.DatabaseError: Quarantined
/srv/swift-storage/sdb3/containers/16503/280/4077d9164732d6587761ef101bcbc280
to
/srv/swift-storage/sdb3/quarantined/containers/4077d9164732d6587761ef101bcbc280
due to malformed database (txn: tx4d7ef4ae3a434f458e950-00677a32bc)
And, indeed, if I did an integrity check on the quarantined files, each
one showed the same errors:
mvernon@ms-be2073:~$ sqlite3 4077d9164732d6587761ef101bcbc280.db "PRAGMA
integrity_check"
row 423322 missing from index ix_object_deleted_name
row 2701219 missing from index ix_object_deleted_name
Which is quite surprising, given generally rowids are not the same
across the 3 databases. One of the complained-of rows is still extant in
the table (and dates from 2016), and the other isn't.
In all 3 cases, the latest object (based on rowid) is an object that was
deleted very shortly before the outage started - 07:19:50 UTC, with the
outage starting at 07:20:28.
It is perhaps significant that that last listing that succeeded was
using that object as a prefix in a list request at the same time as the
object was being deleted:
object with highest rowid:
19933856|f/f8/Gascones,_molino_(1988)_02.jpg/300px-Gascones,_molino_(1988)_02.jpg|1736061590.04401|0|application/deleted|noetag|1|0
final successful listing:
Jan 5 07:19:50 ms-fe2010 proxy-server: 10.194.179.98 10.192.16.76
05/Jan/2025/07/19/50 GET
/v1/AUTH_mw/wikipedia-commons-local-thumb.f8%3Flimit%3D9000%26prefix%3Df%252Ff8%252FGascones%252C_molino_%25281988%2529_02.jpg%252F%26format%3Djson%26states%3Dlisting
HTTP/1.0 200 - wikimedia/multi-http-client%20v1.1 AUTH_tk22395377a... -
511 - txc6028b8aef0d4705aef82-00677a3296 - 0.0301 - -
1736061590.006474018 1736061590.036608696 0
final delete:
ms-fe2011.proxylog.gz:Jan 5 07:19:50 ms-fe2011 proxy-server:
10.194.179.98 10.192.32.36 05/Jan/2025/07/19/50 DELETE
/v1/AUTH_mw/wikipedia-commons-local-thumb.f8/f/f8/Gascones%252C_molino_%25281988%2529_02.jpg/300px-Gascones%252C_molino_%25281988%2529_02.jpg
HTTP/1.0 204 - wikimedia/multi-http-client%20v1.1 AUTH_tk22395377a... -
- - tx8d7b6c325ca54e89a4a08-00677a3296 - 0.1511 - - 1736061590.041912317
1736061590.193058491 0
There is further investigation of the incident at
https://phabricator.wikimedia.org/T383053 and I have lots of logs, which
these seem like the most pertinent parts of, but if you'd like other log
extracts that can be done.
Finally, I should note that this container contains image thumbnails,
generated by a separate service via a 404 handler in swift middleware -
see
https://github.com/wikimedia/operations-puppet/blob/45d5772c846e42269c2f1a19c8784fd9d2deb240/modules/swift/files/python3.9/SwiftMedia/wmf/rewrite.py#L48
Thanks,
Matthew
-- System Information:
Debian Release: 11.11
APT prefers oldstable-updates
APT policy: (500, 'oldstable-updates'), (500, 'oldstable-security'),
(500, 'oldstable-debug'), (500, 'oldstable')
Architecture: amd64 (x86_64)
Kernel: Linux 5.10.0-30-amd64 (SMP w/48 CPU threads)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8), LANGUAGE
not set
Shell: /bin/sh linked to /usr/bin/dash
Init: systemd (via /run/systemd/system)
Versions of packages swift-container depends on:
ii init-system-helpers 1.60
ii lsb-base 11.1.0
ii openstack-pkg-tools 117
ii python3 3.9.2-3
ii python3-pastescript 2.0.2-4
ii python3-swift 2.26.0-10+deb11u1+wmf1
ii rsync 3.2.3-4+deb11u1
ii swift 2.26.0-10+deb11u1+wmf1
ii uwsgi-plugin-python3 2.0.19.1-7.1
Versions of packages swift-container recommends:
ii swift-drive-audit 2.26.0-10+deb11u1+wmf1
swift-container suggests no packages.