Re: Configurable safety options for high performance Django systems

Rick van Hattem Thu, 20 Nov 2014 03:46:37 -0800

On Thursday, November 20, 2014 8:31:06 AM UTC+1, Christian Schmitt wrote:
>
> Nope. a large OFFSET of N will read through N rows, regardless index 
>> coverage. see 
>> http://www.postgresql.org/docs/9.1/static/queries-limit.html
>>
>
> That's simple not true.
> If you define a Order By with a well indexes query, the database will only 
> do a bitmap scan.
> This wiki isn't well explained. Take this:
>
> > 
> https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
>


Sounds awesome in theory, here's a real world example:

http://explain.depesz.com/s/WpQU
# explain analyze select * from entity_entity order by id limit 10 offset 
1000000;
                                                                            
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=44832.84..44833.28 rows=10 width=423) (actual 
time=8280.248..8280.382 rows=10 loops=1)
   ->  Index Scan using entity_entity_pkey_two on entity_entity 
 (cost=0.44..823410.92 rows=18366416 width=423) (actual 
time=0.232..6861.507 rows=1000010 loops=1)
 Total runtime: 8280.442 ms
(3 rows)

http://explain.depesz.com/s/qexc
# explain analyze select * from entity_entity where id > 1000000 order by 
id limit 10;
                                                                        
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.44..0.91 rows=10 width=423) (actual time=0.044..0.102 
rows=10 loops=1)
   ->  Index Scan using entity_entity_pkey_two on entity_entity 
 (cost=0.44..823830.94 rows=17405140 width=423) (actual time=0.040..0.071 
rows=10 loops=1)
         Index Cond: (id > 1000000)
 Total runtime: 0.152 ms
(4 rows)


And that's not even all the way in the table. That's just the beginning.


> read the relevant theread I posted before. There are huge memory 
>> allocation issues even if you just iterate the first two items of a query 
>> that potentially returns a million rows: all the rows will be fetched and 
>> stored in memory by  psycopg and django ORM 
>>
>
> Thats only happening since django will built something like "SELECT "bla", 
> "bla", "bla" FROM table LIMIT 10 OFFSET 100;
> There is no ORDER BY clause, which is of course really slow, even with 
> only 1000 entries inside a table.
>

Once again, great in theory. Doesn't always mean much in real world 
scenarios. Compare the following numbers to the ones above:


http://explain.depesz.com/s/50Wo
# explain analyze select * from entity_entity where id > 1000000 limit 10;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=0.00..0.67 rows=10 width=423) (actual time=0.012..0.070 
rows=10 loops=1)
   ->  Seq Scan on entity_entity  (cost=0.00..1168998.20 rows=17405140 
width=423) (actual time=0.008..0.038 rows=10 loops=1)
         Filter: (id > 1000000)
 Total runtime: 0.122 ms
(4 rows)

http://explain.depesz.com/s/jDJS
# explain analyze select * from entity_entity limit 10 offset 1000000;
                                                             QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=61148.68..61149.30 rows=10 width=423) (actual 
time=3820.305..3820.347 rows=10 loops=1)
   ->  Seq Scan on entity_entity  (cost=0.00..1123082.16 rows=18366416 
width=423) (actual time=0.007..2410.809 rows=1000010 loops=1)
 Total runtime: 3820.403 ms
(3 rows)


I don't know why we should enforce something just because somebody has a 
> huge amount of data and can't deal with it. That wouldn't follow the clean 
> design pattern. Most people rely on the "Page to" part of the Paginator, if 
> we patch django to use a seek method as a default it wouldn't be that good 
> for a lot of developers.
> The most people that have a performance problem could easily "fix" that by 
> themself or throwing more hardware at it.
>

The point of this discussion is to make Django easier and safer to use in 
all scenarios. Even in scenarios with relatively small amounts of data (sub 
100k rows) Django can already kill a small VPS easily. Having some 
(optional) safeties in place doesn't hurt and can help a lot of people.

It's not that large amounts of data are not something you can deal with, 
but I've seen many real-life projects and Django apps where people didn't 
take into consideration that some people have more than 100 rows. With 1M+ 
rows you can kill a lot, but many things slow down or die with even 10k 
rows.

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/2e56ee96-2fae-4671-8319-b5ac784aca26%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: Configurable safety options for high performance Django systems

Reply via email to