#33706: Implement annotations as (lateral) joins
-------------------------------------+-------------------------------------
               Reporter:  Bálint     |          Owner:  nobody
  Balina                             |
                   Type:             |         Status:  new
  Cleanup/optimization               |
              Component:  Database   |        Version:  4.0
  layer (models, ORM)                |       Keywords:
               Severity:  Normal     |  annotate,performance,subquery,postgresql
           Triage Stage:             |      Has patch:  0
  Unreviewed                         |
    Needs documentation:  0          |    Needs tests:  0
Patch needs improvement:  0          |  Easy pickings:  0
                  UI/UX:  0          |
-------------------------------------+-------------------------------------
 The ORM annotations are a very good and flexible tool for dynamic data.
 However there is a huge performance loss because of the way they are
 implemented. See the following example:


 {{{
 Product.objects.annotate(
     on_stock=Subquery(
         Stock.objects.filter(
             product_id=OuterRef('id'),
         ).values('product_id').annotate(
             on_stock=Sum('quantity'),
         ).values('on_stock').order_by(),
         output_field=DecimalField(),
     ),
     outgoing=Subquery(
         OrderItem.objects.filter(
             product_id=OuterRef('id'),
         ).values('product_id').annotate(
             outgoing=Sum('quantity'),
         ).values('outgoing').order_by(),
         output_field=DecimalField(),
     ),
     available=F('on_stock') - F('outgoing'),
 ).filter(
     available__gt=0,
 ).values('id', 'on_stock', 'outgoing', 'available')
 }}}

 When aggregates are needed from multiple tables annotation can be used as
 subqueries. Annotations can also be filtered against and reused. However
 django implements this query with the following SQL:


 {{{
 SELECT "product"."id",
        (SELECT SUM(u0."quantity") AS "on_stock"
         FROM "stock" u0
         WHERE ( u0."product_id" = ("product"."id"))
         GROUP BY u0."product_id")                                AS
 "on_stock",
        (SELECT SUM(u0."quantity") AS "outgoing"
         FROM "order_item" u0
         WHERE u0."product_id" = ("product"."id")
         GROUP BY u0."product_id")                                AS
 "outgoing",
        ((SELECT SUM(u0."quantity") AS "on_stock"
          FROM "stock" u0
          WHERE (u0."product_id" = ("product"."id"))
          GROUP BY u0."product_id") - (SELECT SUM(u0."quantity") AS
 "outgoing"
                                       FROM "order_item" u0
                                       WHERE u0."product_id" =
 ("product"."id")
                                       GROUP BY u0."product_id")) AS
 "available"
 FROM "product"
 WHERE (((SELECT SUM(u0."quantity") AS "on_stock"
          FROM "stock" u0
          WHERE (u0."product_id" = ("product"."id"))
          GROUP BY u0."product_id") - (SELECT SUM(u0."quantity") AS
 "outgoing"
                                       FROM "order_item" u0
                                       WHERE u0."product_id" =
 ("product"."id")
                                       GROUP BY u0."product_id")) > 0)
 }}}

 Both the on_stock and outgoing annotations are inlined 3 times and
 calculated 3 times by the database, which is 6 subquery instead of 2.

 See the following PostgreSQL query adjusted by hand:

 {{{
 SELECT "product"."id",
        "on_stock",
        "outgoing",
        "on_stock" - "outgoing" as available
 FROM "product"
    , LATERAL (SELECT SUM(u0."quantity") AS "on_stock"
               FROM "stock" u0
               WHERE (u0."product_id" = ("product"."id"))
               GROUP BY u0."product_id") AS "on_stock"
    , LATERAL (SELECT SUM(u0."quantity") AS "outgoing"
               FROM "order_item" u0
               WHERE u0."product_id" = ("product"."id")
               GROUP BY u0."product_id") AS "outgoing"
 WHERE (("on_stock" - "outgoing") > 0)
 }}}

 The exact same subquery can be joined with the lateral keyword, producing
 the exact same results, but reusing calculations. The performance benefit
 is huge, it can be hundreds of milliseconds even for this simple query,
 even more so the more complicated and reused these calculations are.

 I like the way annotate works, there are many small pieces to put together
 to get a complex result. I'm planning to look into implementing this
 behaviour, but wanted to check first about your opinions:

 - do you think it is possible to patch the ORM to generate sql like above?
 - do you see any pitfalls or caveats to this method of performing the
 query?
 - if there are no pitfalls, or they can be resolved do you think that this
 could be the default way of annotating calculations in django? The
 examples are in postgresql, not sure about the other supported database
 engines compatibility.

-- 
Ticket URL: <https://code.djangoproject.com/ticket/33706>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

-- 
You received this message because you are subscribed to the Google Groups 
"Django updates" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-updates/01070180bd54e1b4-98d25634-efb6-4478-b098-5e87159e21ab-000000%40eu-central-1.amazonses.com.

Reply via email to