I thought I'd chime in since I run a Django app with fairly large tables 
(1M, 30M, 120M rows) on a fairly poor VPS (4GB ram, 2 cores, database and 
all services on one machine).

I just don't see a situation where "safety limits on ORM queries" would 
improve things. Changing behaviour based on the size of the queryset seems 
like a really bad idea. Now instead of gradual slowdown as the dataset 
grows, you start getting surprise exceptions, invalid data in results, or 
missing functionality once you hit a magic threshold.

There is no good way to set that threshold automatically, since people have 
very different expectations of what's "slow". For example, loading the 
second-to-last list page in the Django admin for my 1M row table takes a 
perfectly reasonable 2-3 seconds. That's too slow for the front page of a 
popular site, but even 30 seconds would be fine for an admin page that's 
visited maybe once a month.

I'm not saying that performance improvements or optimisation suggestions 
are a bad idea, but if they are possible, they should stay consistent. If 
using "if queryset:" is a bug, then it should show a warning and suggestion 
to use "if queryset.exists():" for both 100 and 100M rows. If there's an 
option to turn off full pagination in admin, then it should be a ModelAdmin 
option, not dependent on data size - the developer can enable it for big 
tables manually.

Behaviour changes based on the queryset size make it more difficult, not 
easier for developers - right now you can delay optimisation work until it 
really becomes too painful for users, whereas having strict limits would 
require doing that work as soon as a limit is reached.

Finally, I agree that it would be nice if Postgresql server-side cursors 
were better integrated with the ORM. It's possible to use them for raw 
queries already:

import connection, transaction
connection.ensure_connection()
with transaction.atomic():
    cursor = connection.connection.cursor(name='hello')
    cursor.execute('BIG QUERYSET SQL')
    for row in cursor: # fetches rows in cursor.itersize chunks
        pass

I use this in a few scripts that iterate over 10GB of table data. But a way 
to map the rows to Django objects would be nice.


On Thursday, 20 November 2014 13:46:01 UTC+2, Rick van Hattem wrote:
>
> On Thursday, November 20, 2014 8:31:06 AM UTC+1, Christian Schmitt wrote:
>>
>> Nope. a large OFFSET of N will read through N rows, regardless index 
>>> coverage. see 
>>> http://www.postgresql.org/docs/9.1/static/queries-limit.html
>>>
>>
>> That's simple not true.
>> If you define a Order By with a well indexes query, the database will 
>> only do a bitmap scan.
>> This wiki isn't well explained. Take this:
>>
>> > 
>> https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
>>
>
> Sounds awesome in theory, here's a real world example:
>
> http://explain.depesz.com/s/WpQU
> # explain analyze select * from entity_entity order by id limit 10 offset 
> 1000000;
>                                                                           
>   QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=44832.84..44833.28 rows=10 width=423) (actual 
> time=8280.248..8280.382 rows=10 loops=1)
>    ->  Index Scan using entity_entity_pkey_two on entity_entity 
>  (cost=0.44..823410.92 rows=18366416 width=423) (actual 
> time=0.232..6861.507 rows=1000010 loops=1)
>  Total runtime: 8280.442 ms
> (3 rows)
>
> http://explain.depesz.com/s/qexc
> # explain analyze select * from entity_entity where id > 1000000 order by 
> id limit 10;
>                                                                         
> QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.44..0.91 rows=10 width=423) (actual time=0.044..0.102 
> rows=10 loops=1)
>    ->  Index Scan using entity_entity_pkey_two on entity_entity 
>  (cost=0.44..823830.94 rows=17405140 width=423) (actual time=0.040..0.071 
> rows=10 loops=1)
>          Index Cond: (id > 1000000)
>  Total runtime: 0.152 ms
> (4 rows)
>
>
> And that's not even all the way in the table. That's just the beginning.
>
>
>> read the relevant theread I posted before. There are huge memory 
>>> allocation issues even if you just iterate the first two items of a query 
>>> that potentially returns a million rows: all the rows will be fetched and 
>>> stored in memory by  psycopg and django ORM 
>>>
>>
>> Thats only happening since django will built something like "SELECT 
>> "bla", "bla", "bla" FROM table LIMIT 10 OFFSET 100;
>> There is no ORDER BY clause, which is of course really slow, even with 
>> only 1000 entries inside a table.
>>
>
> Once again, great in theory. Doesn't always mean much in real world 
> scenarios. Compare the following numbers to the ones above:
>
>
> http://explain.depesz.com/s/50Wo
> # explain analyze select * from entity_entity where id > 1000000 limit 10;
>                                                          QUERY PLAN
>
> -----------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=0.00..0.67 rows=10 width=423) (actual time=0.012..0.070 
> rows=10 loops=1)
>    ->  Seq Scan on entity_entity  (cost=0.00..1168998.20 rows=17405140 
> width=423) (actual time=0.008..0.038 rows=10 loops=1)
>          Filter: (id > 1000000)
>  Total runtime: 0.122 ms
> (4 rows)
>
> http://explain.depesz.com/s/jDJS
> # explain analyze select * from entity_entity limit 10 offset 1000000;
>                                                              QUERY PLAN
>
> -------------------------------------------------------------------------------------------------------------------------------------
>  Limit  (cost=61148.68..61149.30 rows=10 width=423) (actual 
> time=3820.305..3820.347 rows=10 loops=1)
>    ->  Seq Scan on entity_entity  (cost=0.00..1123082.16 rows=18366416 
> width=423) (actual time=0.007..2410.809 rows=1000010 loops=1)
>  Total runtime: 3820.403 ms
> (3 rows)
>
>
> I don't know why we should enforce something just because somebody has a 
>> huge amount of data and can't deal with it. That wouldn't follow the clean 
>> design pattern. Most people rely on the "Page to" part of the Paginator, if 
>> we patch django to use a seek method as a default it wouldn't be that good 
>> for a lot of developers.
>> The most people that have a performance problem could easily "fix" that 
>> by themself or throwing more hardware at it.
>>
>
> The point of this discussion is to make Django easier and safer to use in 
> all scenarios. Even in scenarios with relatively small amounts of data (sub 
> 100k rows) Django can already kill a small VPS easily. Having some 
> (optional) safeties in place doesn't hurt and can help a lot of people.
>
> It's not that large amounts of data are not something you can deal with, 
> but I've seen many real-life projects and Django apps where people didn't 
> take into consideration that some people have more than 100 rows. With 1M+ 
> rows you can kill a lot, but many things slow down or die with even 10k 
> rows.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/8815aae8-1581-4447-8126-c1a0687894bf%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to