Re: Configurable safety options for high performance Django systems

Rick van Hattem Sun, 23 Nov 2014 04:12:08 -0800

Hi Mattias,

Can you comment on the example I've given where this should not cause any 
problems and should help pretty much everyone (but just in case, the 
setting could be *optional*)?


If/when an *unsliced* queryset were to reach a certain limit (say, 10,000, 
but configurable) the system would raise an error.

My arguments for why this would be a good idea and not a problem for anyone:

- Protects servers from going down due to memory constraints.
- Why would anyone need to run through a list of *10,000* items without any 
slicing/pagination/etc?
- If someone _does_ need to walk through more than 10,000 items, they'll 
know how to slice a queryset.
- You can write your own apps perfectly, but installing 3rd party apps can 
still kill the system without fixes like these.


So please, can anyone give a good argument as to why any sane person would 
have a problem with a huge default limit which will kill the performance of 
your site anyhow but isn't enough to kill the entire system?


On Sunday, November 23, 2014 11:49:14 AM UTC+1, Mattias Linnap wrote:
>
> I thought I'd chime in since I run a Django app with fairly large tables 
> (1M, 30M, 120M rows) on a fairly poor VPS (4GB ram, 2 cores, database and 
> all services on one machine).
>
> I just don't see a situation where "safety limits on ORM queries" would 
> improve things. Changing behaviour based on the size of the queryset seems 
> like a really bad idea. Now instead of gradual slowdown as the dataset 
> grows, you start getting surprise exceptions, invalid data in results, or 
> missing functionality once you hit a magic threshold.
>
> There is no good way to set that threshold automatically, since people 
> have very different expectations of what's "slow". For example, loading the 
> second-to-last list page in the Django admin for my 1M row table takes a 
> perfectly reasonable 2-3 seconds. That's too slow for the front page of a 
> popular site, but even 30 seconds would be fine for an admin page that's 
> visited maybe once a month.
>
> I'm not saying that performance improvements or optimisation suggestions 
> are a bad idea, but if they are possible, they should stay consistent. If 
> using "if queryset:" is a bug, then it should show a warning and suggestion 
> to use "if queryset.exists():" for both 100 and 100M rows. If there's an 
> option to turn off full pagination in admin, then it should be a ModelAdmin 
> option, not dependent on data size - the developer can enable it for big 
> tables manually.
>
> Behaviour changes based on the queryset size make it more difficult, not 
> easier for developers - right now you can delay optimisation work until it 
> really becomes too painful for users, whereas having strict limits would 
> require doing that work as soon as a limit is reached.
>
> Finally, I agree that it would be nice if Postgresql server-side cursors 
> were better integrated with the ORM. It's possible to use them for raw 
> queries already:
>
> import connection, transaction
> connection.ensure_connection()
> with transaction.atomic():
>     cursor = connection.connection.cursor(name='hello')
>     cursor.execute('BIG QUERYSET SQL')
>     for row in cursor: # fetches rows in cursor.itersize chunks
>         pass
>
> I use this in a few scripts that iterate over 10GB of table data. But a 
> way to map the rows to Django objects would be nice.
>
>
> On Thursday, 20 November 2014 13:46:01 UTC+2, Rick van Hattem wrote:
>>
>> On Thursday, November 20, 2014 8:31:06 AM UTC+1, Christian Schmitt wrote:
>>>
>>> Nope. a large OFFSET of N will read through N rows, regardless index 
>>>> coverage. see 
>>>> http://www.postgresql.org/docs/9.1/static/queries-limit.html
>>>>
>>>
>>> That's simple not true.
>>> If you define a Order By with a well indexes query, the database will 
>>> only do a bitmap scan.
>>> This wiki isn't well explained. Take this:
>>>
>>> > 
>>> https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
>>>
>>
>> Sounds awesome in theory, here's a real world example:
>>
>> http://explain.depesz.com/s/WpQU
>> # explain analyze select * from entity_entity order by id limit 10 offset 
>> 1000000;
>>                                                                           
>>   QUERY PLAN
>>
>> -------------------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=44832.84..44833.28 rows=10 width=423) (actual 
>> time=8280.248..8280.382 rows=10 loops=1)
>>    ->  Index Scan using entity_entity_pkey_two on entity_entity 
>>  (cost=0.44..823410.92 rows=18366416 width=423) (actual 
>> time=0.232..6861.507 rows=1000010 loops=1)
>>  Total runtime: 8280.442 ms
>> (3 rows)
>>
>> http://explain.depesz.com/s/qexc
>> # explain analyze select * from entity_entity where id > 1000000 order by 
>> id limit 10;
>>                                                                         
>> QUERY PLAN
>>
>> -----------------------------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.44..0.91 rows=10 width=423) (actual time=0.044..0.102 
>> rows=10 loops=1)
>>    ->  Index Scan using entity_entity_pkey_two on entity_entity 
>>  (cost=0.44..823830.94 rows=17405140 width=423) (actual time=0.040..0.071 
>> rows=10 loops=1)
>>          Index Cond: (id > 1000000)
>>  Total runtime: 0.152 ms
>> (4 rows)
>>
>>
>> And that's not even all the way in the table. That's just the beginning.
>>
>>
>>> read the relevant theread I posted before. There are huge memory 
>>>> allocation issues even if you just iterate the first two items of a query 
>>>> that potentially returns a million rows: all the rows will be fetched and 
>>>> stored in memory by  psycopg and django ORM 
>>>>
>>>
>>> Thats only happening since django will built something like "SELECT 
>>> "bla", "bla", "bla" FROM table LIMIT 10 OFFSET 100;
>>> There is no ORDER BY clause, which is of course really slow, even with 
>>> only 1000 entries inside a table.
>>>
>>
>> Once again, great in theory. Doesn't always mean much in real world 
>> scenarios. Compare the following numbers to the ones above:
>>
>>
>> http://explain.depesz.com/s/50Wo
>> # explain analyze select * from entity_entity where id > 1000000 limit 10;
>>                                                          QUERY PLAN
>>
>> -----------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=0.00..0.67 rows=10 width=423) (actual time=0.012..0.070 
>> rows=10 loops=1)
>>    ->  Seq Scan on entity_entity  (cost=0.00..1168998.20 rows=17405140 
>> width=423) (actual time=0.008..0.038 rows=10 loops=1)
>>          Filter: (id > 1000000)
>>  Total runtime: 0.122 ms
>> (4 rows)
>>
>> http://explain.depesz.com/s/jDJS
>> # explain analyze select * from entity_entity limit 10 offset 1000000;
>>                                                              QUERY PLAN
>>
>> -------------------------------------------------------------------------------------------------------------------------------------
>>  Limit  (cost=61148.68..61149.30 rows=10 width=423) (actual 
>> time=3820.305..3820.347 rows=10 loops=1)
>>    ->  Seq Scan on entity_entity  (cost=0.00..1123082.16 rows=18366416 
>> width=423) (actual time=0.007..2410.809 rows=1000010 loops=1)
>>  Total runtime: 3820.403 ms
>> (3 rows)
>>
>>
>> I don't know why we should enforce something just because somebody has a 
>>> huge amount of data and can't deal with it. That wouldn't follow the clean 
>>> design pattern. Most people rely on the "Page to" part of the Paginator, if 
>>> we patch django to use a seek method as a default it wouldn't be that good 
>>> for a lot of developers.
>>> The most people that have a performance problem could easily "fix" that 
>>> by themself or throwing more hardware at it.
>>>
>>
>> The point of this discussion is to make Django easier and safer to use in 
>> all scenarios. Even in scenarios with relatively small amounts of data (sub 
>> 100k rows) Django can already kill a small VPS easily. Having some 
>> (optional) safeties in place doesn't hurt and can help a lot of people.
>>
>> It's not that large amounts of data are not something you can deal with, 
>> but I've seen many real-life projects and Django apps where people didn't 
>> take into consideration that some people have more than 100 rows. With 1M+ 
>> rows you can kill a lot, but many things slow down or die with even 10k 
>> rows.
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/313c740e-64e8-4c6b-80eb-7d6303809635%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Configurable safety options for high performance Django systems

Reply via email to