On 12/4/06, Jacob Kaplan-Moss <[EMAIL PROTECTED]> wrote: > > I'm taking this to django-dev for more discussion; it'll get seen by more the > right people there. > > Thoughts, anyone?
Ok; here's my thoughts. 1. Introduction ~~~~~~~~~~~~~~~ Consider the following pseudo-model: class Book(Model): name = CharField() price = FloatField() pub_date = DateField() class Order(Model): description = CharField() time = DateTimeField() books = ManyToManyField(Book) Simple aggregation use cases would be questions like: a) What is the cheapest book? b) What is the average book price? c) Find me all the books with a price greater than the average. d) What is the total cost of order 1234? These questions require that you query the database to obtain a single aggregated value, or a object matching that aggregated value. However, the more general class of problem is to generate a summary statistic for each object in a query, with the summary stats coming from the related objects. For example: e) What is the total cost of _each_ order? IMHO, what we need is the ability to annotate summary statistics onto the results returned by a query set. 2. Proposal ~~~~~~~~~~~ Add an 'annotate()' modifier to QuerySets. The arguments of annotate() describe the annotations that you want to be applied to the objects that are returned by the Query Set. annotate() returns a query set, so it can be used multiple times, be combined with filters, etc. The argument handling strategy employed in filter() is reused here; kwargs to annotate() can be decomposed on a __ boundary to describe table joins, with the last part describing the aggregate operator to be used. The syntax follows something like: Model.objects.annotate(field1__field2__aggregate='annotation') field1__field2 describes the path to get to the field that will be ultimately aggregated; aggregate is the aggregation function (max,min etc); the value of the argument is a string that is the name of the annotate argument. The objects that are returned when the queryset is executed will be normal objects, but with additional attributes corresponding to the annotations. e.g., # Get order 1234, and annotate it a few different ways >>> order = Order.objects.get(id=1234).annotate( books__price__sum='total_cost', books__count='item_count') # Inspect the order >>> order.description "Dad's birthday" # Annotated orders have a 'total_cost' attribute... >>> order.total_cost 47.2 # ... and an 'item_count' attribute >>> order.item_count 3 3. Just the facts, Ma'am ~~~~~~~~~~~~~~~~~~~~~~~~ Ok; so what if you just want _the minimum_, or _the average_? For this, I propose an aggregates() queryset modifier. >>> Book.objects.aggregates(price__min='min_price', pub_date__max='last_update') {'min_price':0.5, 'last_update': 2006-11-22} aggregates() would expand queries in the same manner as annotate(), but would be a terminal clause, just like the values() clause. This is a more verbose notation than the simple 'max()/min()' . I have discussed my problems with these operators previously; however, if there is sufficient demand, I don't see any reason that min('field') couldn't be included in the API as a shorthand for Model.objects.aggregates(field__min='min')['min']. 4. Comparisons ~~~~~~~~~~~~~~ There is one additional requirement I can see; to perform queries like (c), you need to be able to compare annotation attributes to object attributes. # Annotate a query set with the average price of books >>> qs = Book.objects.annotate(price__average='avg_price'). # Filter all books with obj.avg_price < obj.price >>> expensive_books = qs.filter(avg_price__lt=F('price')) The F() object is a placeholder to let the query language know that 'price' isn't just a string, it's the name of a field. This follows the example of Q() objects providing query wrappers. This capability would also be beneficial to the query language in general; at present, there is no easy way to pull out all objects where field1 = field2. 5. Implementation ~~~~~~~~~~~~~~~~~ Now the mechanics: What does annotate() do to the SQL? If there is an annotate clause in a query set: - All of the base model attributes (the attributes that would normally be returned by the queryset) are placed in a GROUP BY clause - Any query on a base model attribute is placed in a HAVING clause - Any query on the annotation field is placed in a WHERE clause - If the annotation clause traverses joins, those tables are joined in the same manner as they would be for filter(). 6. Limitations/Problems ~~~~~~~~~~~~~~~~~~~~~~~ - This approach doesn't handle any particularly creative usage of the GROUP BY clause. I'm open to suggestions, but I'm also happy to put this problem subset into the 20% "do it in raw SQL" category. - I'm not overly enamoured with the name aggregates() - it isn't a particularly intuitive name for the functionality provided; summary() is the only other option I could think of. - I haven't tried to implement this, so there are probably some sharp edges in the SQL generation process. ~~~~~~~~~~~~~~~~~~~~~~~ So - that's my two bits. Comments? Yours, Russ Magee %-) --~--~---------~--~----~------------~-------~--~----~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers?hl=en -~----------~----~----~----~------~----~------~--~---