Slow Query
Hi all, We've been struggling with a slow query! -- and it's been exploding as rows have been added to relevant tables. It seems like a fairly common workflow, so we think we're overlooking the potential for an index or rewriting the query. I've linked a document compiling the information as per the Postgresql recommendation for Slow Query Questions. Here's the link: https://docs.google.com/document/d/10qO5jkQNVtKw2Af1gcKAKiNw7tYFNQruzOQrUYXd4hk/edit?usp=sharing (we've enabled commenting) Here's a high-level summary of the issue: __ We’re trying to show a list of active conversations. Each conversation (named a spool in the database) has multiple threads, kind of like Slack channels. And the messages are stored in each thread. We want to return the 30 most recent conversations with recency determined as the most recent message in any thread of the conversation you are a participant of (you may not be a participant of certain threads in a conversation so it’s important those don’t leak sensitive data). We found that as the number of threads increases, the query slowed down dramatically. We think the issue has to do with the fact that there is no easy way to go from a thread you are a participant to its most recent message, however, it is possible the issue is elsewhere. We’ve provided the full query and a simplified query of where we think the issue is, along with the EXPLAIN ANALYZE BUFFERS result. We figure this is not exactly an uncommon use case, so it’s likely that we are overlooking the potential for some missing indices or a better way to write the query. We appreciate the help and any advice! __ We'd really appreciate any help and advice! Best, Parth
Re: Slow Query
Is there no index on thread.spool? What about notification.user? How about message.time (without thread as a leading column). Those would all seem very significant. Your row counts are very low to have a query perform so badly. Work_mem could probably be increased above 4MB, but it isn't hurting this query in particular. My primary concern is that the query is rather chaotic at a glance. It would be great to re-write and remove the unneeded keywords, double quotes, totally worthless parentheses, etc. Something like the below may help you see the crux of the query and what could be done and understand how many rows might be coming out of those subqueries. I re-ordered some joins and there might be syntax errors, but give it a shot once you've added the indexes suggested above. SELECT spool.id, handle.handle, spool.name, thread.id, case.closed, notification.read, notification2.time, message.message, message.time, message.author, thread.name, location.geo FROM spool JOIN handle ON handle.id = spool.id JOIN thread ON thread.spool = spool.id JOIN message ON message.thread = thread.id LEFT JOIN location ON location.id = spool.location LEFT JOIN case ON case.id = spool.id LEFT JOIN notification ON notification.user = 'b16690e4-a3c5-4868-945e-c2458c27a525' AND notification.id = ( SELECT notification3.id FROM notification AS notification3 JOIN notification_thread ON notification_thread.id = notification3.id JOIN thread AS thread2 ON thread2.id = notification_thread.thread WHERE thread2.spool = spool.id AND notification3.user = 'b16690e4-a3c5-4868-945e-c2458c27a525' AND notification3.time <= '2020-09-30 16:32:38.054558' ORDER BY notification3.time DESC LIMIT 1 ) LEFT JOIN notification AS notification2 ON notification2.user = 'b16690e4-a3c5-4868-945e-c2458c27a525' AND notification2.id = ( SELECT notification3.id FROM notification AS notification3 JOIN notification_thread ON notification_thread.id = notification3.id JOIN thread AS thread2 ON thread2.id = notification_thread.thread WHERE thread2.spool = spool.id AND notification3.user = 'b16690e4-a3c5-4868-945e-c2458c27a525' AND notification3.time > '2020-09-30 16:32:38.054558' ORDER BY notification3.time DESC LIMIT 1 ) WHERE message.time = ( SELECT MAX ( message2.time ) FROM message AS message2 JOIN thread AS thread2 ON thread2.id = message2.thread JOIN participant ON participant.thread = thread2.id JOIN identity ON identity.id = participant.identity LEFT JOIN relation ON relation.to = identity.id AND relation.from = 'b16690e4-a3c5-4868-945e-c2458c27a525' AND relation.manages = TRUE WHERE NOT message2.draft AND ( identity.id = 'b16690e4-a3c5-4868-945e-c2458c27a525' OR NOT relation.to IS NULL ) AND thread2.spool = spool.id LIMIT 1 ) AND notification.id IS NOT NULL ORDER BY message.time DESC LIMIT 31;
Re: Slow Query
Based on the execution plan, it looks like the part that takes 13 seconds of the total 14.4 seconds is just calculating the max time used in the where clause. Anytime I see an OR involved in a plan gone off the rails, I always always check if re-writing the query some other way may be faster. How's the plan for something like this? WHERE message.time = greatest( *sub1.time*, *sub2.time* ) /* sub1.time */ ( select MAX ( message2.time ) FROM message AS message2 JOIN thread AS thread2 ON thread2.id = message2.thread JOIN participant ON participant.thread = thread2.id WHERE NOT message2.draft AND participant.identity = 'b16690e4-a3c5-4868-945e-c2458c27a525' AND thread2.spool = spool.id ) /* sub2.time */ ( select MAX ( message2.time ) FROM message AS message2 JOIN thread AS thread2 ON thread2.id = message2.thread JOIN participant ON participant.thread = thread2.id JOIN relation ON relation.to = participant.identity AND relation.from = 'b16690e4-a3c5-4868-945e-c2458c27a525' AND relation.manages = TRUE WHERE NOT message2.draft AND thread2.spool = spool.id ) >
Poor Performance running Django unit tests after upgrading from 10.6
We have a large Django application running against a Postgresql database. The test suite for the application runs in GitLab-CI using Docker Compose to run the unit tests inside the application container against a Postgresql database running in another container. When Django runs the unit tests for the application, it starts by creating a new test database and then it runs the database migrations to create all the tables and the necessary reference data.Then for each test it opens a transaction, runs the test and then then rolls back the transaction. This ensures that the database is "clean" before each test run and reduces the risk that data created by one test will cause a different test to fail. Consequently this means that almost all the tables in the test database have zero or very few rows in them. It also means that the statistics for the tables in the test database are pretty meaningless. The statistics, if they exist, will probably say there are zero rows, and the query will actually be dealing with 0 - 10 rows that are visible in the transaction, but which will be rolled back. When we run the test suite using Postgresql 10.7 in a Docker container we consistently get: Ran 1166 tests in 1291.855s When we first started running the same test suite against Postgresql 12.4 we got: Ran 1166 tests in 8502.030s I think that this reduction in performance is caused by the lack of accurate statistics because we had a similar problem (a large reduction in performance) in a load test that we used which we cured by running ANALYZE after creating the test data and before running the load test. The load test is using the same Django application code, but creates a "large amount" of test data (in the 100s to 1000s of rows per table - it is looking for N+1 query problems rather than absolute performance". We have since managed to get the performance of the test run using 12.4 back to approximately the normal range by customizing the Postgresql parameters. `seq_page_cost=0.1` and `random_page_cost=0.11` seem to be key, but we are also setting `shared_buffers`, etc. and all the other typical parameters. With Postgresql 10.7 we weren't setting anything and performance was fine using just the defaults, given the tiny data volumes. However, even though we have similar performance for 12.4 for most test runs, it remains very variable. About 30% of the time we get something like: Ran 1166 tests in 5362.690s. We also see similar performance reductions and inconsistent results with 11.9, so whatever change is causing the problem was likely introduced in 11 rather than in 12. I think we have narrowed down the problem to a single, very complex, materialized view using CTEs; the unit tests create the test data and then refresh the materialized view before executing the actual test code. Database logging using autoexplain shows things like: db_1 | 2020-10-14 10:27:59.692 UTC [255] LOG: duration: 4134.625 ms plan: db_1 | Query Text: REFRESH MATERIALIZED VIEW price_marketpricefacts_materialized db_1 | Merge Join (cost=14141048331504.30..9635143213364288.00 rows=116618175994107184 width=3302) (actual time=4134.245..4134.403 rows=36 loops=1) For comparison, the equivalent query on 10.7 has: db_1 | 2020-10-15 03:28:58.382 UTC [163] LOG: duration: 10.500 ms plan: db_1 | Query Text: REFRESH MATERIALIZED VIEW price_marketpricefacts_materialized db_1 | Hash Left Join (cost=467650.55..508612.80 rows=199494 width=3302) (actual time=10.281..10.341 rows=40 loops=1) The staggering cost implies that the statistics are badly wrong, but given how few rows are in the result (36, and it's not an aggregate) I would expect the query to be fast regardless of what the plan is. In 10.7 the materialized view refreshes in 150 ms or I also don't understand why the performance would be so inconsistent across test runs for 12.4 but not for 10.7. It is as though sometimes it gets a good plan and sometimes it doesn't. I can get performance almost identical to 10.7 by altering the unit tests so that in each test that refreshes the materialized view prior to executing the query, we execute `ANALYZE;` prior to refreshing the view. Is it worth us trying to debug the plan for situations with low row counts and poor statistics? Or is this use case not really covered: the general advice is obviously to make sure that statistics are up to date before troubleshooting performance problems. On the other hand, it is not easy for us to make sure that we run analyze inside the transaction in each unit test; it also seems a bit wasteful. Opinions and advice gratefully received. Roger
Re: Poor Performance running Django unit tests after upgrading from 10.6
On Thu, 2020-10-15 at 01:21 -0400, Roger Hunwicks wrote: > We have a large Django application running against a Postgresql database. > > When we run the test suite using Postgresql 10.7 in a Docker container > we consistently get: > > Ran 1166 tests in 1291.855s > > When we first started running the same test suite against Postgresql > 12.4 we got: > > Ran 1166 tests in 8502.030s > > I think that this reduction in performance is caused by the lack of > accurate statistics [...] > > We have since managed to get the performance of the test run using > 12.4 back to approximately the normal range by customizing the > Postgresql parameters. `seq_page_cost=0.1` and `random_page_cost=0.11` > seem to be key, but we are also setting `shared_buffers`, etc. and all > the other typical parameters. With Postgresql 10.7 we weren't setting > anything and performance was fine using just the defaults, given the > tiny data volumes. > > However, even though we have similar performance for 12.4 for most > test runs, it remains very variable. About 30% of the time we get > something like: > > I think we have narrowed down the problem to a single, very complex, > materialized view using CTEs; the unit tests create the test data and > then refresh the materialized view before executing the actual test > code. > > Database logging using autoexplain shows things like: > > db_1 | 2020-10-14 10:27:59.692 UTC [255] LOG: duration: > 4134.625 ms plan: > db_1 | Query Text: REFRESH MATERIALIZED VIEW > price_marketpricefacts_materialized > db_1 | Merge Join > (cost=14141048331504.30..9635143213364288.00 rows=116618175994107184 > width=3302) (actual time=4134.245..4134.403 rows=36 loops=1) > > For comparison, the equivalent query on 10.7 has: > > db_1 | 2020-10-15 03:28:58.382 UTC [163] LOG: duration: > 10.500 ms plan: > db_1 | Query Text: REFRESH MATERIALIZED VIEW > price_marketpricefacts_materialized > db_1 | Hash Left Join (cost=467650.55..508612.80 > rows=199494 width=3302) (actual time=10.281..10.341 rows=40 loops=1) > > I can get performance almost identical to 10.7 by altering the unit > tests so that in each test that refreshes the materialized view prior > to executing the query, we execute `ANALYZE;` prior to refreshing the > view. > > Is it worth us trying to debug the plan for situations with low row > counts and poor statistics? Or is this use case not really covered: > the general advice is obviously to make sure that statistics are up to > date before troubleshooting performance problems. On the other hand, > it is not easy for us to make sure that we run analyze inside the > transaction in each unit test; it also seems a bit wasteful. > > Opinions and advice gratefully received. Yes, the query plan for the query that defines the materialized view is the interesting data point. Run an EXPLAIN (ANALYZE, BUFFERS) on that query. If your statistics are off because the data have just been imported a second ago, run an explicit ANALYZE on the affected tables after import. If your statistics are off because they are not calculated often enough, consider lowering "autovacuum_analyze_scale_factor". Yours, Laurenz Albe -- Cybertec | https://www.cybertec-postgresql.com
