Slow Query

2020-10-14 Thread Parth Shah
Hi all,

We've been struggling with a slow query! -- and it's been exploding as rows
have been added to relevant tables. It seems like a fairly common workflow,
so we think we're overlooking the potential for an index or rewriting the
query.

I've linked a document compiling the information as per the Postgresql
recommendation for Slow Query Questions. Here's the link:
https://docs.google.com/document/d/10qO5jkQNVtKw2Af1gcKAKiNw7tYFNQruzOQrUYXd4hk/edit?usp=sharing
(we've enabled commenting)

Here's a high-level summary of the issue:
__

We’re trying to show a list of active conversations. Each conversation
(named a spool in the database) has multiple threads, kind of like Slack
channels. And the messages are stored in each thread. We want to return the
30 most recent conversations with recency determined as the most recent
message in any thread of the conversation you are a participant of (you may
not be a participant of certain threads in a conversation so it’s important
those don’t leak sensitive data).

We found that as the number of threads increases, the query slowed down
dramatically. We think the issue has to do with the fact that there is no
easy way to go from a thread you are a participant to its most recent
message, however, it is possible the issue is elsewhere. We’ve provided the
full query and a simplified query of where we think the issue is, along
with the EXPLAIN ANALYZE BUFFERS result. We figure this is not exactly an
uncommon use case, so it’s likely that we are overlooking the potential for
some missing indices or a better way to write the query. We appreciate the
help and any advice!

__

We'd really appreciate any help and advice!

Best,
Parth


Re: Slow Query

2020-10-14 Thread Michael Lewis
Is there no index on thread.spool? What about notification.user? How about
message.time (without thread as a leading column). Those would all seem
very significant. Your row counts are very low to have a query perform so
badly. Work_mem could probably be increased above 4MB, but it isn't hurting
this query in particular.

My primary concern is that the query is rather chaotic at a glance. It
would be great to re-write and remove the unneeded keywords, double quotes,
totally worthless parentheses, etc. Something like the below may help you
see the crux of the query and what could be done and understand how many
rows might be coming out of those subqueries. I re-ordered some joins and
there might be syntax errors, but give it a shot once you've added the
indexes suggested above.

SELECT

spool.id,

handle.handle,

spool.name,

thread.id,

case.closed,

notification.read,

notification2.time,

message.message,

message.time,

message.author,

thread.name,

location.geo

FROM

spool

JOIN handle ON handle.id = spool.id

JOIN thread ON thread.spool = spool.id

JOIN message ON message.thread = thread.id

LEFT JOIN location ON location.id = spool.location

LEFT JOIN case ON case.id = spool.id

LEFT JOIN notification ON notification.user =
'b16690e4-a3c5-4868-945e-c2458c27a525'

AND

notification.id = (

SELECT

notification3.id

FROM

notification AS notification3

JOIN notification_thread ON notification_thread.id = notification3.id

JOIN thread AS thread2 ON thread2.id = notification_thread.thread

WHERE

thread2.spool = spool.id

AND notification3.user = 'b16690e4-a3c5-4868-945e-c2458c27a525'

AND notification3.time <= '2020-09-30 16:32:38.054558'

ORDER BY

notification3.time DESC

LIMIT 1

)

LEFT JOIN notification AS notification2 ON notification2.user =
'b16690e4-a3c5-4868-945e-c2458c27a525'

AND notification2.id = (

SELECT

notification3.id

FROM

notification AS notification3

JOIN notification_thread ON notification_thread.id = notification3.id

JOIN thread AS thread2 ON thread2.id = notification_thread.thread

WHERE

thread2.spool = spool.id

AND notification3.user = 'b16690e4-a3c5-4868-945e-c2458c27a525'

AND notification3.time > '2020-09-30 16:32:38.054558'

ORDER BY

notification3.time DESC

LIMIT 1

)

WHERE

message.time = (

SELECT

MAX ( message2.time )

FROM

message AS message2

JOIN thread AS thread2 ON thread2.id = message2.thread

JOIN participant ON participant.thread = thread2.id

JOIN identity ON identity.id = participant.identity

LEFT JOIN relation ON relation.to = identity.id

AND relation.from = 'b16690e4-a3c5-4868-945e-c2458c27a525'

AND relation.manages = TRUE

WHERE

NOT message2.draft

AND ( identity.id = 'b16690e4-a3c5-4868-945e-c2458c27a525' OR NOT
relation.to IS NULL )

AND thread2.spool = spool.id

LIMIT 1

)

AND notification.id IS NOT NULL

ORDER BY

message.time DESC

LIMIT 31;


Re: Slow Query

2020-10-14 Thread Michael Lewis
Based on the execution plan, it looks like the part that takes 13 seconds
of the total 14.4 seconds is just calculating the max time used in the
where clause. Anytime I see an OR involved in a plan gone off the rails, I
always always check if re-writing the query some other way may be faster.
How's the plan for something like this?


WHERE message.time = greatest( *sub1.time*, *sub2.time* )

/* sub1.time */
(
select
MAX ( message2.time )
FROM
message AS message2
JOIN thread AS thread2 ON thread2.id = message2.thread
JOIN participant ON participant.thread = thread2.id
WHERE
NOT message2.draft
AND participant.identity = 'b16690e4-a3c5-4868-945e-c2458c27a525'
AND thread2.spool = spool.id
)

/* sub2.time */
(
select
MAX ( message2.time )
FROM
message AS message2
JOIN thread AS thread2 ON thread2.id = message2.thread
JOIN participant ON participant.thread = thread2.id
JOIN relation ON relation.to = participant.identity
AND relation.from = 'b16690e4-a3c5-4868-945e-c2458c27a525'
AND relation.manages = TRUE
WHERE
NOT message2.draft
AND thread2.spool = spool.id
)

>


Poor Performance running Django unit tests after upgrading from 10.6

2020-10-14 Thread Roger Hunwicks
We have a large Django application running against a Postgresql database.

The test suite for the application runs in GitLab-CI using Docker
Compose to run the unit tests inside the application container against
a Postgresql database running in another container.

When Django runs the unit tests for the application, it starts by
creating a new test database and then it runs the database migrations
to create all the tables and the necessary reference data.Then for
each test it opens a transaction, runs the test and then then rolls
back the transaction. This ensures that the database is "clean" before
each test run and reduces the risk that data created by one test will
cause a different test to fail. Consequently this means that almost
all the tables in the test database have zero or very few rows in
them. It also means that the statistics for the tables in the test
database are pretty meaningless. The statistics, if they exist, will
probably say there are zero rows, and the query will actually be
dealing with 0 - 10 rows that are visible in the transaction, but
which will be rolled back.

When we run the test suite using Postgresql 10.7 in a Docker container
we consistently get:

Ran 1166 tests in 1291.855s

When we first started running the same test suite against Postgresql
12.4 we got:

Ran 1166 tests in 8502.030s

I think that this reduction in performance is caused by the lack of
accurate statistics because we had a similar problem (a large
reduction in performance) in a load test that we used which we cured
by running ANALYZE after creating the test data and before running the
load test. The load test is using the same Django application code,
but creates a "large amount" of test data (in the 100s to 1000s of
rows per table - it is looking for N+1 query problems rather than
absolute performance".

We have since managed to get the performance of the test run using
12.4 back to approximately the normal range by customizing the
Postgresql parameters. `seq_page_cost=0.1` and `random_page_cost=0.11`
seem to be key, but we are also setting `shared_buffers`, etc. and all
the other typical parameters. With Postgresql 10.7 we weren't setting
anything and performance was fine using just the defaults, given the
tiny data volumes.

However, even though we have similar performance for 12.4 for most
test runs, it remains very variable. About 30% of the time we get
something like:

Ran 1166 tests in 5362.690s.

We also see similar performance reductions and inconsistent results
with 11.9, so whatever change is causing the problem was likely
introduced in 11 rather than in 12.

I think we have narrowed down the problem to a single, very complex,
materialized view using CTEs; the unit tests create the test data and
then refresh the materialized view before executing the actual test
code.

Database logging using autoexplain shows things like:

db_1 | 2020-10-14 10:27:59.692 UTC [255] LOG:  duration:
4134.625 ms  plan:
db_1 | Query Text: REFRESH MATERIALIZED VIEW
price_marketpricefacts_materialized
db_1 | Merge Join
(cost=14141048331504.30..9635143213364288.00 rows=116618175994107184
width=3302) (actual time=4134.245..4134.403 rows=36 loops=1)

For comparison, the equivalent query on 10.7 has:

db_1 | 2020-10-15 03:28:58.382 UTC [163] LOG:  duration:
10.500 ms  plan:
db_1 | Query Text: REFRESH MATERIALIZED VIEW
price_marketpricefacts_materialized
db_1 | Hash Left Join  (cost=467650.55..508612.80
rows=199494 width=3302) (actual time=10.281..10.341 rows=40 loops=1)

The staggering cost implies that the statistics are badly wrong, but
given how few rows are in the result (36, and it's not an aggregate) I
would expect the query to be fast regardless of what the plan is. In
10.7 the materialized view refreshes in 150 ms or

I also don't understand why the performance would be so inconsistent
across test runs for 12.4 but not for 10.7. It is as though sometimes
it gets a good plan and sometimes it doesn't.

I can get performance almost identical to 10.7 by altering the unit
tests so that in each test that refreshes the materialized view prior
to executing the query, we execute `ANALYZE;` prior to refreshing the
view.

Is it worth us trying to debug the plan for situations with low row
counts and poor statistics? Or is this use case not really covered:
the general advice is obviously to make sure that statistics are up to
date before troubleshooting performance problems. On the other hand,
it is not easy for us to make sure that we run analyze inside the
transaction in each unit test; it also seems a bit wasteful.

Opinions and advice gratefully received.

Roger




Re: Poor Performance running Django unit tests after upgrading from 10.6

2020-10-14 Thread Laurenz Albe
On Thu, 2020-10-15 at 01:21 -0400, Roger Hunwicks wrote:
> We have a large Django application running against a Postgresql database.
> 
> When we run the test suite using Postgresql 10.7 in a Docker container
> we consistently get:
> 
> Ran 1166 tests in 1291.855s
> 
> When we first started running the same test suite against Postgresql
> 12.4 we got:
> 
> Ran 1166 tests in 8502.030s
> 
> I think that this reduction in performance is caused by the lack of
> accurate statistics [...]
> 
> We have since managed to get the performance of the test run using
> 12.4 back to approximately the normal range by customizing the
> Postgresql parameters. `seq_page_cost=0.1` and `random_page_cost=0.11`
> seem to be key, but we are also setting `shared_buffers`, etc. and all
> the other typical parameters. With Postgresql 10.7 we weren't setting
> anything and performance was fine using just the defaults, given the
> tiny data volumes.
> 
> However, even though we have similar performance for 12.4 for most
> test runs, it remains very variable. About 30% of the time we get
> something like:
> 
> I think we have narrowed down the problem to a single, very complex,
> materialized view using CTEs; the unit tests create the test data and
> then refresh the materialized view before executing the actual test
> code.
> 
> Database logging using autoexplain shows things like:
> 
> db_1 | 2020-10-14 10:27:59.692 UTC [255] LOG:  duration:
> 4134.625 ms  plan:
> db_1 | Query Text: REFRESH MATERIALIZED VIEW
> price_marketpricefacts_materialized
> db_1 | Merge Join
> (cost=14141048331504.30..9635143213364288.00 rows=116618175994107184
> width=3302) (actual time=4134.245..4134.403 rows=36 loops=1)
> 
> For comparison, the equivalent query on 10.7 has:
> 
> db_1 | 2020-10-15 03:28:58.382 UTC [163] LOG:  duration:
> 10.500 ms  plan:
> db_1 | Query Text: REFRESH MATERIALIZED VIEW
> price_marketpricefacts_materialized
> db_1 | Hash Left Join  (cost=467650.55..508612.80
> rows=199494 width=3302) (actual time=10.281..10.341 rows=40 loops=1)
> 
> I can get performance almost identical to 10.7 by altering the unit
> tests so that in each test that refreshes the materialized view prior
> to executing the query, we execute `ANALYZE;` prior to refreshing the
> view.
> 
> Is it worth us trying to debug the plan for situations with low row
> counts and poor statistics? Or is this use case not really covered:
> the general advice is obviously to make sure that statistics are up to
> date before troubleshooting performance problems. On the other hand,
> it is not easy for us to make sure that we run analyze inside the
> transaction in each unit test; it also seems a bit wasteful.
> 
> Opinions and advice gratefully received.

Yes, the query plan for the query that defines the materialized view
is the interesting data point.  Run an EXPLAIN (ANALYZE, BUFFERS) on
that query.

If your statistics are off because the data have just been imported a
second ago, run an explicit ANALYZE on the affected tables after import.

If your statistics are off because they are not calculated often enough,
consider lowering "autovacuum_analyze_scale_factor".

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com