#33516: QuerySet update() under postgres with 
select_for_update(skip_locked=True)
results in neither FOR UPDATE nor SKIP LOCKED
-------------------------------------+-------------------------------------
               Reporter:  Grant      |          Owner:  nobody
  Gainey                             |
                   Type:  Bug        |         Status:  new
              Component:  Database   |        Version:  3.2
  layer (models, ORM)                |
               Severity:  Normal     |       Keywords:
           Triage Stage:             |      Has patch:  0
  Unreviewed                         |
    Needs documentation:  0          |    Needs tests:  0
Patch needs improvement:  0          |  Easy pickings:  0
                  UI/UX:  0          |
-------------------------------------+-------------------------------------
 === Versions:
 {{{
 $ cat /etc/fedora-release
 Fedora release 34 (Thirty Four)

 $ pip freeze | grep -i django
 Django==3.2.12
 django-crispy-forms==1.13.0
 django-currentuser==0.5.3
 django-extensions==3.1.5
 django-filter==21.1
 django-guardian==2.4.0
 django-guid==3.2.1
 django-import-export==2.7.1
 django-lifecycle==0.9.3
 django-readonly-field==1.1.1
 django-storages==1.12.3
 djangorestframework==3.13.1
 djangorestframework-queryfields==1.0.0

 $ psql --version
 psql (PostgreSQL) 13.4
 }}}

 === Problem Statement
 pulpcore is a Django app that does A Lot Of Concurrent Stuff. One of those
 Things includes "update the last-touched timestamp of large batches of
 entities at the end of a workflow". We use
 `queryset.update(timestamp_of_interest=now())` for this.

 Batch updates don't guarantee row-update-order in Postgresql. In heavy-
 concurrency contexts, when updating overlapping content in multiple
 threads, this can result in deadlocks as postgres acquires row-locks in
 arbitrary orders.

 To get around this, in raw postgres, one uses `UPDATE...WHERE id IN
 (SELECT  id FROM table WHERE expression ORDER BY id FOR UPDATE)`. The
 SELECT-FOR-UPDATE-ORDER-BY enforces row-lock-ordering prior to the UPDATE,
 and everyone is happily synchronized and all the dining philosophers get
 to eat.

 If (as in our specific use-case) you don't actually care **who** updates
 the row, just that it is being updated "now()-ish", you can regain some
 concurrency by using Postgres' `SKIP LOCKED` functionality.

 So - we attempted to use the following to accomplish that in our update
 workflow:
 {{{
 TestModel.objects.filter(pk__in=content_q).order_by("pk").select_for_update(
         skip_locked=True
     ).update(timestamp_of_interest=now())
 }}}

 By the time we get to Postgres, this **does not emit FOR UPDATE or SKIP
 LOCKED**, and therefore doesn't solve our deadlock problem. Cue sad-
 trombone.

 Specifically, from the driver-example below, this code:
 {{{
 def update_timestamp(index):
     content_q = (
         TestModel.objects.filter()
             .order_by("pk")
             .values_list("pk", flat=True)
     )
 TestModel.objects.filter(pk__in=content_q).order_by("pk").select_for_update(
         skip_locked=True
     ).update(timestamp_of_interest=now())
 }}}

 generates this SQL:
 {{{
 2022-02-15 19:12:47.276 UTC [337361] LOG:  statement: UPDATE
 "core_testmodel" SET "timestamp_of_interest" =
 '2022-02-15T19:12:47.187475+00:00'::timestamptz WHERE
 "core_testmodel"."pulp_id" IN (SELECT U0."pulp_id" FROM "core_testmodel"
 U0)
 }}}

 Running the driver-example will generate deadlocks; the more TestModel
 entries exist, the more often you hit a deadlock. With 5K entries, test
 runs hit deadlocks ~5-10 times. With 20K entries, it was more like ~60-70%
 of the 100 runs.

 === Workaround
 Right now, we implemented a workaround in raw SQL. You can see it here:

 https://github.com/pulp/pulpcore/pull/2159/files#diff-
 81f6a78175bb93934b6beff952646d3ca1ef3731f1ff14492d4ec77bfc3fdf82R132-R141

 We would **much much rather** not have to use raw-sql.

 === Full Example:

 * Here's a really simple model, stored as `pulpcore/apps/models/test.py`
 {{{
 import uuid
 from django.db import models
 from django_lifecycle import LifecycleModel


 class TestModel(LifecycleModel):
     pulp_id = models.UUIDField(primary_key=True, default=uuid.uuid4,
 editable=False)
     timestamp_of_interest = models.DateTimeField(auto_now=True)
 }}}

 * Here's a migration that creates that model:
 {{{
 # Generated by Django 3.2.12 on 2022-02-15 18:53

 from django.db import migrations, models
 import django_lifecycle.mixins
 import uuid


 class Migration(migrations.Migration):

     dependencies = [
     ]

     operations = [
         migrations.CreateModel(
             name='TestModel',
             fields=[
                 ('pulp_id', models.UUIDField(default=uuid.uuid4,
 editable=False, primary_key=True, serialize=False)),
                 ('timestamp_of_interest',
 models.DateTimeField(auto_now=True)),
             ],
             options={
                 'abstract': False,
             },
             bases=(django_lifecycle.mixins.LifecycleModelMixin,
 models.Model),
         ),
     ]
 }}}

 * Here's some python that adds 5000 rows to the resulting table (there is
 probably a better/faster way to do this, but this is obvious):
 {{{
 from pulpcore.app.models import TestModel

 for tm in range(5000):
     TestModel.objects.create()
 }}}

 * Turning on statement-loggin in Postgres and restarting:
 {{{
 $ sudo vim /var/lib/pgsql/data/postgresql.conf
 ...
 log_statement = 'all'                   # none, ddl, mod, all
 $ sudo systemctl restart postgresql
 }}}

 * And here, at last, is driver-code that will generate deadlocks and let
 us inspect the emitted SQL:
 {{{
 from threading import Thread
 from pulpcore.app.models import TestModel
 from django.utils.timezone import now


 durations = []

 def update_timestamp(index):
     content_q = (
         TestModel.objects.filter()
             .order_by("pk")
             .values_list("pk", flat=True)
     )
 TestModel.objects.filter(pk__in=content_q).order_by("pk").select_for_update(
         skip_locked=True
     ).update(timestamp_of_interest=now())
     print(">>>done")

 for r in range(10):
     threads = []
     for i in range(10):
         threads.append(Thread(target=update_timestamp, args=(i,)))
     for t in threads:
         t.start()
     for t in threads:
         t.join()
 }}}

-- 
Ticket URL: <https://code.djangoproject.com/ticket/33516>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/050.17c1aab02d0c59a571b1574f850fde86%40djangoproject.com.

Reply via email to