Re: Configurable safety options for high performance Django systems

Marco Paolini Thu, 20 Nov 2014 03:05:22 -0800

2014-11-20 8:30 GMT+01:00 Schmitt, Christian <c.schm...@briefdomain.de>:


> Nope. a large OFFSET of N will read through N rows, regardless index
>> coverage. see
>> http://www.postgresql.org/docs/9.1/static/queries-limit.html
>>
>
> That's simple not true.
> If you define a Order By with a well indexes query, the database will only
> do a bitmap scan.
> This wiki isn't well explained. Take this:
>
> >
> https://wiki.postgresql.org/images/3/35/Pagination_Done_the_PostgreSQL_Way.pdf
>

You are right: it is not a table scan but the point is OFFSET has *linear
complexity* even with the ORDER BY and the proper index.

The "seek" (with a PK > seek_point, as I was suggesting)  is much better.


> The problem relies on the django paginator, that builds OFFSET queries
>> that are unefficient by design. There are alternatives such as ordering by
>> primary key and doing SELECT * from auth_user WHERE id > latest_id_returned
>> . that "latest_id_returned" has to be passed back and forth to the client
>> to keep the pagination hapening. Too bad you can't do "jump to page 15"
>> like the django admin does.
>>
>
> Paginator needs to use OFFSET, which is slower than your seek method, but
> SEEK has certain limitations. As you already said. Index and Order By to
> the Rescue, which makes OFFSET as fast as possible. WITH INDEX, ORDER BY
> and maybe some READ Slaves or even if you Shard your data the result is as
> good as the SEEK method.
>

Putting sharding into the mix further complicates this discussion, but even
in a sharded environment, having linar complexity in a single shard is to
be avoided.


>
>
> read the relevant theread I posted before. There are huge memory
>> allocation issues even if you just iterate the first two items of a query
>> that potentially returns a million rows: all the rows will be fetched and
>> stored in memory by  psycopg and django ORM
>>
>
> Thats only happening since django will built something like "SELECT "bla",
> "bla", "bla" FROM table LIMIT 10 OFFSET 100;
> There is no ORDER BY clause, which is of course really slow, even with
> only 1000 entries inside a table.
>

No, it has nothing to do with the ORDER BY. It's just how psycopg works:
the first time you fetch from a cursor, say you fetch the first row, it
automatically fetches all the results from the server. The only solution is
using server-side cursor, which apparently oracle uses by default.



>
>
> ...except when the cache is bigger than your entire VM.
>>
>> In this case if your dataset is , say, 10 TB, you will need roughly 10 TB
>> of memory and 10 TB of data transferred between your django app and your
>> postgres server. It is not a matter of optimization: is mere survival.
>>
>
> Okai  just as I said, If you deal with that amount of data u certainly
> will use READ Slaves AND / OR partition your data.
>

Not always true: a single machine can happily handle a few terabytes.

The point here is client-side cursors will eat up a lot your memory in the
*application tier* if you happen to do a simple:

for user in User.objects.all().iterator():
  break

even if you only have a few thousand users, you will see a big memory and
network spike


>
>> Let's first understand what's needed, than we can decide if it has a
>> story inside the django core
>>
>
>
> The only thing that could be improved is that we should force a ORDER BY
> id clause here and there. The index on the automatic id is already there.
>

That only solves the issue in the best-case scenario: when you fetch the
first pages. The last page scenario even with order by and an index still
has N complexity. That's written clearly in the article you posted.



>
> I don't know why we should enforce something just because somebody has a
> huge amount of data and can't deal with it. That wouldn't follow the clean
> design pattern. Most people rely on the "Page to" part of the Paginator, if
> we patch django to use a seek method as a default it wouldn't be that good
> for a lot of developers.
> The most people that have a performance problem could easily "fix" that by
> themself or throwing more hardware at it.
>

I am all against enforcing something: I only thin it is nice to provider
the options for not killing your application layer . These things are used
normally by all django developers that deal with medium-size applications.

Throwing iron is sub-optimal when you can just use 10 lines of code as I
normally do



>  --
> You received this message because you are subscribed to the Google Groups
> "Django developers (Contributions to Django itself)" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to django-developers+unsubscr...@googlegroups.com.
> To post to this group, send email to django-developers@googlegroups.com.
> Visit this group at http://groups.google.com/group/django-developers.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/django-developers/CAPDLAU5HNtE%2BDfSxR7FcgdtAN6U3bSG7AVtGYhT7LLZUdY0Fjw%40mail.gmail.com
> <https://groups.google.com/d/msgid/django-developers/CAPDLAU5HNtE%2BDfSxR7FcgdtAN6U3bSG7AVtGYhT7LLZUdY0Fjw%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Django developers  (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to django-developers+unsubscr...@googlegroups.com.
To post to this group, send email to django-developers@googlegroups.com.
Visit this group at http://groups.google.com/group/django-developers.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/django-developers/CAEbA68mDESqNGQYheshYymE_oKyqbh1hheL47s5u9Okxcqvifg%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Configurable safety options for high performance Django systems

Reply via email to