Hi devs,

When prefetching related items for a queryset returning a large amount of 
items, the generated SQL can be quite inefficient. Here's an example:

class Category(models.Model):
    type = models.PositiveIntegerField(db_index=True)


class Item(models.Model):
   category = models.ForeignKey(Category, related_name='%(class)ss')


If I have 50.000 categories of type=5, then 
"Category.objects.filter(type=5).prefetch_related('items')" will generate 
an SQL on the Item table with an IN clause containing the 50.000 Category 
ID's. This is because get_prefetch_queryset() on the Manager classes in 
django.db.models.related all do this:

def get_prefetch_queryset(self, instances, queryset=None):
    [...]
    query = {'%s__in' % self.query_field_name: instances}
    queryset = queryset._next_is_sticky().filter(**query)
    [...]


While this is great when instances is a short list, we can hope to do 
better than that when instances is large. A much more efficient query in 
the above case would be queryset._next_is_sticky().filter(category__type=5). 
We just need to make sure this alternative query will fetch (at least) the 
same items as the query with the IN clause would.

I thought I could use Prefetch('items', 
queryset=Item.objects.filter(category__type=5))to accomplish this, but 
while the custom queryset *is* used, get_prefetch_queryset() still adds the 
IN clause unconditionally. This is A) redundant B) sends a huge SQL string 
over the wire for the database to process, and C) seriously messes up the 
database query planner, often generating an inefficient execution plan.

I would like to fix this problem. The easiest way seems to be to let 
Prefetch() tell get_prefetch_queryset() to skip the IN clause. 
django.db.models.prefetch_one_level() is in charge of passing the Prefetch 
queryset to get_prefetch_queryset(), so one option would be to add a 
skip_in_clause attribute to the Prefetch model which prefetch_one_level() 
would pass to get_prefetch_queryset(). The latter could then do:

def get_prefetch_queryset(self, instances, queryset=None, skip_in_clause=
False):
    [...]
    if not skip_in_clause:
        query = {'%s__in' % self.query_field_name: instances}
        queryset = queryset._next_is_sticky().filter(**query)
    [...]


In my preliminary testing on real data, this:

Category.objects.filter(type=5).prefetch_related(Prefetch('items', queryset=
Item.objects.filter(category__type=5), skip_in_clause=True))

is about 20x faster than the current implementation when the query returns 
50.000 categories. The improvement would be greater on larger datasets.

Any comments? If needed, I can provide a pull request with a suggested 
implementation. Apart from the Prefetch class and prefetch_one_level(), 
there are 4 instances of get_prefetch_queryset() in 
django.db.models.related that would need to be changed.


Thanks,
Erik

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/8fe6b600-854c-491e-9acf-e13399d1eafb%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to