On 12/4/06, Jacob Kaplan-Moss <[EMAIL PROTECTED]> wrote:
>
> I'm taking this to django-dev for more discussion; it'll get seen by more the
> right people there.
>
> Thoughts, anyone?

Ok; here's my thoughts.

1. Introduction
~~~~~~~~~~~~~~~
Consider the following pseudo-model:

class Book(Model):
   name = CharField()
   price = FloatField()
   pub_date = DateField()

class Order(Model):
   description = CharField()
   time = DateTimeField()
   books = ManyToManyField(Book)

Simple aggregation use cases would be questions like:
a) What is the cheapest book?
b) What is the average book price?
c) Find me all the books with a price greater than the average.
d) What is the total cost of order 1234?

These questions require that you query the database to obtain a single
aggregated value, or a object matching that aggregated value.

However, the more general class of problem is to generate a summary
statistic for each object in a query, with the summary stats coming
from the related objects. For example:

e) What is the total cost of _each_ order?

IMHO, what we need is the ability to annotate summary statistics onto
the results returned by a query set.

2. Proposal
~~~~~~~~~~~
Add an 'annotate()' modifier to QuerySets. The arguments of annotate()
describe the annotations that you want to be applied to the objects
that are returned by the Query Set.

annotate() returns a query set, so it can be used multiple times, be
combined with filters, etc. The argument handling strategy employed in
filter() is reused here; kwargs to annotate() can be decomposed on a
__ boundary to describe table joins, with the last part describing the
aggregate operator to be used. The syntax follows something like:

Model.objects.annotate(field1__field2__aggregate='annotation')

field1__field2 describes the path to get to the field that will be
ultimately aggregated; aggregate is the aggregation function (max,min
etc); the value of the argument is a string that is the name of the
annotate argument.

The objects that are returned when the queryset is executed will be
normal objects, but with additional attributes corresponding to the
annotations.

e.g.,
# Get order 1234, and annotate it a few different ways
>>> order = Order.objects.get(id=1234).annotate(
    books__price__sum='total_cost',
    books__count='item_count')
# Inspect the order
>>> order.description
"Dad's birthday"
# Annotated orders have a 'total_cost' attribute...
>>> order.total_cost
47.2
# ... and an 'item_count' attribute
>>> order.item_count
3

3. Just the facts, Ma'am
~~~~~~~~~~~~~~~~~~~~~~~~
Ok; so what if you just want _the minimum_, or _the average_? For
this, I propose an aggregates() queryset modifier.

>>> Book.objects.aggregates(price__min='min_price', pub_date__max='last_update')
{'min_price':0.5, 'last_update': 2006-11-22}

aggregates() would expand queries in the same manner as annotate(),
but would be a terminal clause, just like the values() clause.

This is a more verbose notation than the simple 'max()/min()' . I have
discussed my problems with these operators previously; however, if
there is sufficient demand, I don't see any reason that min('field')
couldn't be included in the API as a shorthand for
Model.objects.aggregates(field__min='min')['min'].

4. Comparisons
~~~~~~~~~~~~~~
There is one additional requirement I can see; to perform queries like
(c), you need to be able to compare annotation attributes to object
attributes.

# Annotate a query set with the average price of books
>>> qs = Book.objects.annotate(price__average='avg_price').
# Filter all books with obj.avg_price < obj.price
>>> expensive_books = qs.filter(avg_price__lt=F('price'))

The F() object is a placeholder to let the query language know that
'price' isn't just a string, it's the name of a field. This follows
the example of Q() objects providing query wrappers.

This capability would also be beneficial to the query language in
general; at present, there is no easy way to pull out all objects
where field1 = field2.

5. Implementation
~~~~~~~~~~~~~~~~~
Now the mechanics: What does annotate() do to the SQL?

If there is an annotate clause in a query set:
- All of the base model attributes (the attributes that would normally
be returned by the queryset) are placed in a GROUP BY clause
- Any query on a base model attribute is placed in a HAVING clause
- Any query on the annotation field is placed in a WHERE clause
- If the annotation clause traverses joins, those tables are joined in
the same manner as they would be for filter().

6. Limitations/Problems
~~~~~~~~~~~~~~~~~~~~~~~
- This approach doesn't handle any particularly creative usage of the
GROUP BY clause. I'm open to suggestions, but I'm also happy to put
this problem subset into the 20% "do it in raw SQL" category.

- I'm not overly enamoured with the name aggregates()  - it isn't a
particularly intuitive name for the functionality provided; summary()
is the only other option I could think of.

- I haven't tried to implement this, so there are probably some sharp
edges in the SQL generation process.

~~~~~~~~~~~~~~~~~~~~~~~

So - that's my two bits. Comments?

Yours,
Russ Magee %-)

--~--~---------~--~----~------------~-------~--~----~
 You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to django-developers@googlegroups.com
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers?hl=en
-~----------~----~----~----~------~----~------~--~---

Reply via email to