Bad planner performance for tables with empty tuples when using JIT
Hello everyone, I'm currently in the process of upgrading our PostgreSQL installations from 9.6 to 13. I am experiencing very slow query performance for empty tables. Our test environments get build from scratch every run, and thus contain a lot of empty tables at first. We hit the issue discussed and resolved in [1]. Problem is, this fix is not included in PG13... but JIT is :) Some of our queries join a lot of tables, so the planner thinks this will result in very high costs, and the JIT feature kicks in. This gives bad performance. Executing the query is very fast of course; it's just the planning stage that is time consuming. We do not see this behavior in PG9.6, as there is no JIT. EXPLAIN gives the same high costs on PG9.6 and PG13, but on PG9.6 the planning time is low enough. Now, how can I circumvent this? - I could wait for PG14. I verified that PG14 solves my case and queries are fast. But this would take at least until the third quarter of 2021, so the website tells me. - I could ask for a back-port of [2]. The commit is API breaking; is this even a feasible option? - I could turn off JIT in the server config, but I'd like to use the JIT feature where it's appropriate! - I could turn off JIT using a GUC prior to query execution. But then I would need to detect the cases where I need to do this. (I.e. cases where I query a table that has no tuples.) It would be very cumbersome for me to write this in my application, and I feel this is more the responsibility of the database than of my program. - Another way is to fake the number of pages; set it to 1 where it is 0 everywhere. I verified that this produces fast query performance. But fiddling with pg_class does not "feel" right... Is this okay to use in a test setup (and maybe even in a production scenario)? Could I do the following query just after our database is initialized: update pg_catalog.pg_class c set relpages = 1 where c.relpages = 0 and c.relnamespace = (select ns.oid from pg_catalog.pg_namespace ns where ns.nspname = current_schema); Could someone give me advice on what would be the best strategy? With kind regads, Jurrie [1] https://postgr.es/m/f02298e0-6ef4-49a1-bcb6-c484794d9...@thebuild.com [2] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3d351d916b20534f973eda760cde17d96545d4c4
Re: Query a column with the same name as an operator
On 06-11-2020 10:08, Magnus Hagander wrote: On Fri, Nov 6, 2020 at 10:00 AM Java Developer wrote: Hello, I am trying to query a column from a table I migrated from my MYSQL into POSTGRESQL but I seem to be having issues with a few column names. the column name cast is also an operator, I think zone may also be a problem. MYSQL: OK SELECT id, start_date, local_time, country, city, region, temperature, cast, humidity, wind, weather, zone FROM w_records WHERE city = 'Edinburgh' AND start_date LIKE '%2020-11-01%' ORDER BY id DESC; I can run the SELECT * from FROM w_records WHERE city = 'Edinburgh' but the above does not work. Any idea how I can run a query that accept table name that is already a Operator? Yes, cast is a keyword in SQL (not an operator). To use it as a column name you have to quote it, like SELECT "cast" FROM test (and the same when you create the table, or indeed any references to the column) zone is not, and should be fine. There is a list of keywords available in the documentation here: https://www.postgresql.org/docs/current/sql-keywords-appendix.html Maybe it's of help to you. Jurrie
How to debug a connection that's "active" but hanging?
Hi everyone, We are in the process of upgrading from PostgreSQL 9.6 to 13. When our database gets created in our regression tests, we run some unit tests first. We see one of those tests hang. It seems the client is waiting on more data to arrive from the PostgreSQL server. A thread dump shows it waiting on the socket. On the server, I see the connection from the client. Looking at pg_stat_activity, I see it is in state 'active'. I have seen query_start be as old as eight hours ago. The documentation states that 'active' means a query is being executed. The query in question is: select distinct(constraint_name) CONSTRAINT_NAME from information_schema.key_column_usage where constraint_name in (select rc.constraint_name from information_schema.referential_constraints rc inner join information_schema.table_constraints tc on tc.constraint_name = rc.constraint_name inner join information_schema.table_constraints tc2 on tc2.constraint_name = rc.unique_constraint_name where tc2.constraint_type = 'PRIMARY KEY') and lower(column_name) like 'xxx_%'; This query should be fast to execute. When I run it myself using 'explain analyze', I get results like "(cost=4892.35..4892.35 rows=1 width=64) (actual time=1669.432..1669.447 rows=0 loops=1)". Looking at pg_locks, I only see locks with granted = true for the PID. There is nothing in the logs as far as I can see. Configuration variables log_min_messages and log_min_error_statement are on 'debug1'. This is a snippet of the logs: 2021-07-09 20:35:16.374 CEST [30399] STATEMENT: START TRANSACTION 2021-07-09 20:35:18.703 CEST [30399] WARNING: there is already a transaction in progress 2021-07-09 20:35:18.703 CEST [30399] STATEMENT: START TRANSACTION 2021-07-09 20:35:26.398 CEST [30977] DEBUG: rehashing catalog cache id 7 for pg_attribute; 257 tups, 128 buckets 2021-07-09 20:35:26.398 CEST [30977] STATEMENT: select distinct(constraint_name) CONSTRAINT_NAME from information_schema.key_column_usage where constraint_name in (select rc.constraint_name from information_schema.referential_constraints rc inner join information_schema.table_constraints tc on tc.constraint_name = rc.constraint_name inner join information_schema.table_constraints tc2 on tc2.constraint_name = rc.unique_constraint_name where tc2.constraint_type = 'PRIMARY KEY') and lower(column_name) like 'xxx_%' 2021-07-09 20:35:26.399 CEST [30977] DEBUG: rehashing catalog cache id 7 for pg_attribute; 513 tups, 256 buckets 2021-07-09 20:35:26.399 CEST [30977] STATEMENT: select distinct(constraint_name) CONSTRAINT_NAME from information_schema.key_column_usage where constraint_name in (select rc.constraint_name from information_schema.referential_constraints rc inner join information_schema.table_constraints tc on tc.constraint_name = rc.constraint_name inner join information_schema.table_constraints tc2 on tc2.constraint_name = rc.unique_constraint_name where tc2.constraint_type = 'PRIMARY KEY') and lower(column_name) like 'xxx_%' 2021-07-09 20:35:33.170 CEST [31010] DEBUG: autovacuum: processing database "wildfly" 2021-07-09 20:35:33.520 CEST [31010] DEBUG: rehashing catalog cache id 12 for pg_cast; 513 tups, 256 buckets 2021-07-09 20:35:33.520 CEST [31010] DEBUG: rehashing catalog cache id 14 for pg_opclass; 17 tups, 8 buckets 2021-07-09 20:35:33.648 CEST [31010] DEBUG: rehashing catalog cache id 7 for pg_attribute; 257 tups, 128 buckets 2021-07-09 20:36:03.182 CEST [31199] DEBUG: autovacuum: processing database "postgres" 2021-07-09 20:36:33.200 CEST [31387] DEBUG: autovacuum: processing database "wildfly" I am a bit out of ideas - does anyone have any tips where I should look to see what is causing the query to hang? With kind regards, Jurrie
Re: How to debug a connection that's "active" but hanging?
On 10-07-2021 10:26, Vijaykumar Jain wrote: On Sat, 10 Jul 2021 at 00:29, Jurrie Overgoor <mailto:postgresql-mailingl...@jurr.org>> wrote: Hi everyone, We are in the process of upgrading from PostgreSQL 9.6 to 13. When our database gets created in our regression tests, we run some unit tests first. We see one of those tests hang. It seems the client is waiting on more data to arrive from the PostgreSQL server. A thread dump shows it waiting on the socket. On the server, I see the connection from the client. Looking at pg_stat_activity, I see it is in state 'active'. I have seen query_start be as old as eight hours ago. The documentation states that 'active' means a query is being executed. The query in question is: Looking at pg_locks, I only see locks with granted = true for the PID. Is this reproducible, I mean this happens multiple times? Hi Vijaykumar, thanks for replying. Yes, this is reproducible. About 50% of the times, my connection is hanging. It's always on the same query, which I shared in the previous post. These are unit tests that are executed just after the database is created from scratch using Liquibase. can you please run the below query in a separate session and share the result, feel free to anonymize sensitive stuff. SELECT db.datname, locktype, relation::regclass, mode, transactionid AS tid, virtualtransaction AS vtid, pid, granted FROM pg_catalog.pg_locks l LEFT JOIN pg_catalog.pg_database db ON db.oid = l.database WHERE NOT pid = pg_backend_pid(); Sure! This is all testdata; there is not much that needs to be anonymized :) Here it is: |datname|locktype |relation |mode |tid|vtid |pid |granted| |---|--|-|---|---|---|-|---| |wildfly|relation |pg_constraint_contypid_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_constraint_conrelid_contypid_conname_index|AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_constraint_conname_nsp_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_constraint_conparentid_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_namespace_oid_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_namespace_nspname_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_attribute_relid_attnum_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_attribute_relid_attnam_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_depend |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_constraint |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_class |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_namespace |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_attribute |AccessShareLock| |4/46389|22928|true | |wildfly|relation |information_schema.table_constraints |AccessShareLock| |4/46389|22928|true | |wildfly|relation |information_schema.referential_constraints |AccessShareLock| |4/46389|22928|true | |wildfly|relation |information_schema.key_column_usage |AccessShareLock| |4/46389|22928|true | | |virtualxid| |ExclusiveLock | |4/46389|22928|true | |wildfly|relation |pg_depend_depender_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_class_oid_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_constraint_oid_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_class_relname_nsp_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_depend_reference_index |AccessShareLock| |4/46389|22928|true | |wildfly|relation |pg_class_tblspc_relfilenode_index |AccessShareLock| |4/46389|22928|true | Indeed, this time it's PID 22928 that's hanging. There is nothing in the logs as far as I can see. Configuration variables log_min_messages and log_min_error_statement are on 'debug1'. This is a snippet of the logs: 2021-07-09 20:35:16.374 CEST [30399] STATEMENT: START TRANSACTION 2021-07-09 20:35:18.703 CEST [30399] WARNING: there is already a transaction in progress Are you having multiple BEGIN tx not matching commit/rollbacks ? not sure related, but fixing this will help reduce noise. Actually, I'm not quite sure where those messages come from. They of course indicate starting a transaction that's already in place. But I'd have to dig into the framework we use to know where these come from. My gut feeling is that this is not part of the problem, rather just noise. I am a bit out of ideas - does anyone have any tips where I should look to see what is causing the query to hang? Although I am not an expe
Re: How to debug a connection that's "active" but hanging?
On 12-07-2021 20:56, Vijaykumar Jain wrote: On Mon, 12 Jul 2021 at 23:16, Tom Lane <mailto:t...@sss.pgh.pa.us>> wrote: The backtraces you captured look like the query is not "hung", it's just computing away. He mentioned earlier that the query was hung as 'active' for 8 hours and on. incase this is due to bad plan, @Jurrie Overgoor <mailto:postgresql-mailingl...@jurr.org> is it also possible for you to run manually `vacuumdb -a -v` from the terminal, each time before you run your test suite for some runs, do you still get the same issue? I have a feeling repeated runs may have caused a lot of bloat on some tables which might have not been reclaimed by autovacuum runs. I configured Jenkins to run that command prior to executing the tests. I got 5 successful runs, no hanging queries. Then I reverted and ran again. The first and second run were ok; the third run hung again. So your hunch might be right. On 12-07-2021 19:46, Tom Lane wrote: You might need to investigate by altering your application to capture "EXPLAIN ..." output just before the troublesome query, so you can see if it gets a different plan in the slow cases. Then I tried this. The query plans are indeed not consistent. Most of the time the first line of the query plan is: Unique (cost=4892.35..4892.35 rows=1 width=64) [1] I have seen other costs: 5818.30, 6350.85 and 6514.73. They all complete correctly. [2], [3], [4] The plan that leaves the query hanging in the 'active' state starts with: Unique (cost=241.81..241.82 rows=1 width=64) [5] That's clearly much lower than the rest. So I suspect the planner making a 'wrong' guess there, causing a bad plan, and a long time to execute. For reference, the executed query is [6]. Now, where to go from here? Is this considered a bug in PostgreSQL, or am I misusing the database engine by doing DROP DATABASE and CREATE DATABASE over and over again? I must say that I never saw this behavior on PostgreSQL 9.6, so in that regard it might be considered a bug.? What can I do to get to the bottom of this? Should I export the content of some metadata tables prior to executing the hanging query? Should I `vacuumdb -a -v` prior to logging the EXPLAIN for the hanging query? With kind regards, Jurrie [1] https://jurr.org/PostgreSQL_13_hanging_query/normal.txt [2] https://jurr.org/PostgreSQL_13_hanging_query/alt1.txt [3] https://jurr.org/PostgreSQL_13_hanging_query/alt3.txt [4] https://jurr.org/PostgreSQL_13_hanging_query/alt2.txt [5] https://jurr.org/PostgreSQL_13_hanging_query/hang.txt [6] https://jurr.org/PostgreSQL_13_hanging_query/query.txt
Re: How to debug a connection that's "active" but hanging?
On 12-07-2021 23:21, Tom Lane wrote: Jurrie Overgoor writes: Is this considered a bug in PostgreSQL, or am I misusing the database engine by doing DROP DATABASE and CREATE DATABASE over and over again? It's not a bug. I suppose in a perfect world the stats would automatically be up to date all the time, but in the real world it seems like the cost of that would be exorbitant. I see - that makes sense. To fix this issue, I made sure a `VACUUM` is done prior to running the unit tests. It's run as the database owner, so not all relations are vacuumed. But it seems to vacuum the necessary relations, so no more hanging queries. Thanks Tom and Vijaykumar for your support! With kind regards, Jurrie