Bad planner performance for tables with empty tuples when using JIT

2020-10-27 Thread Jurrie Overgoor

Hello everyone,

I'm currently in the process of upgrading our PostgreSQL installations 
from 9.6 to 13. I am experiencing very slow query performance for empty 
tables.


Our test environments get build from scratch every run, and thus contain 
a lot of empty tables at first. We hit the issue discussed and resolved 
in [1]. Problem is, this fix is not included in PG13... but JIT is :)


Some of our queries join a lot of tables, so the planner thinks this 
will result in very high costs, and the JIT feature kicks in. This gives 
bad performance. Executing the query is very fast of course; it's just 
the planning stage that is time consuming. We do not see this behavior 
in PG9.6, as there is no JIT. EXPLAIN gives the same high costs on PG9.6 
and PG13, but on PG9.6 the planning time is low enough.


Now, how can I circumvent this?

- I could wait for PG14. I verified that PG14 solves my case and queries 
are fast. But this would take at least until the third quarter of 2021, 
so the website tells me.


- I could ask for a back-port of [2]. The commit is API breaking; is 
this even a feasible option?


- I could turn off JIT in the server config, but I'd like to use the JIT 
feature where it's appropriate!


- I could turn off JIT using a GUC prior to query execution. But then I 
would need to detect the cases where I need to do this. (I.e. cases 
where I query a table that has no tuples.) It would be very cumbersome 
for me to write this in my application, and I feel this is more the 
responsibility of the database than of my program.


- Another way is to fake the number of pages; set it to 1 where it is 0 
everywhere. I verified that this produces fast query performance. But 
fiddling with pg_class does not "feel" right... Is this okay to use in a 
test setup (and maybe even in a production scenario)? Could I do the 
following query just after our database is initialized:


update pg_catalog.pg_class c set relpages = 1 where c.relpages = 0 and 
c.relnamespace = (select ns.oid from pg_catalog.pg_namespace ns where 
ns.nspname = current_schema);


Could someone give me advice on what would be the best strategy?

With kind regads,

Jurrie

[1] https://postgr.es/m/f02298e0-6ef4-49a1-bcb6-c484794d9...@thebuild.com
[2] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3d351d916b20534f973eda760cde17d96545d4c4







Re: Query a column with the same name as an operator

2020-11-06 Thread Jurrie Overgoor

On 06-11-2020 10:08, Magnus Hagander wrote:

On Fri, Nov 6, 2020 at 10:00 AM Java Developer  wrote:

Hello,

I am trying to query a column from a table I migrated from my MYSQL
into POSTGRESQL but I seem to be having issues with a few column names.

the column name cast is also an operator, I think zone may also be a
problem.

MYSQL: OK
SELECT id, start_date, local_time, country, city, region, temperature,
cast, humidity, wind, weather, zone FROM w_records WHERE city =
'Edinburgh' AND start_date LIKE '%2020-11-01%' ORDER BY id DESC;

I can run the SELECT * from FROM w_records WHERE city = 'Edinburgh' but
the above does not work.

Any idea how I can run a query that accept table name that is already a
Operator?

Yes, cast is a keyword in SQL (not an operator). To use it as a column
name you have to quote it, like
SELECT "cast" FROM test
(and the same when you create the table, or indeed any references to the column)

zone is not, and should be fine.



There is a list of keywords available in the documentation here: 
https://www.postgresql.org/docs/current/sql-keywords-appendix.html


Maybe it's of help to you.


Jurrie





How to debug a connection that's "active" but hanging?

2021-07-09 Thread Jurrie Overgoor

Hi everyone,

We are in the process of upgrading from PostgreSQL 9.6 to 13. When our 
database gets created in our regression tests, we run some unit tests 
first. We see one of those tests hang.


It seems the client is waiting on more data to arrive from the 
PostgreSQL server. A thread dump shows it waiting on the socket.


On the server, I see the connection from the client. Looking at 
pg_stat_activity, I see it is in state 'active'. I have seen query_start 
be as old as eight hours ago. The documentation states that 'active' 
means a query is being executed. The query in question is:


select distinct(constraint_name) CONSTRAINT_NAME from 
information_schema.key_column_usage where constraint_name in (select 
rc.constraint_name from information_schema.referential_constraints rc 
inner join information_schema.table_constraints tc on tc.constraint_name 
= rc.constraint_name inner join information_schema.table_constraints tc2 
on tc2.constraint_name = rc.unique_constraint_name where 
tc2.constraint_type = 'PRIMARY KEY') and lower(column_name) like 'xxx_%';


This query should be fast to execute. When I run it myself using 
'explain analyze', I get results like "(cost=4892.35..4892.35 rows=1 
width=64) (actual time=1669.432..1669.447 rows=0 loops=1)".


Looking at pg_locks, I only see locks with granted = true for the PID.

There is nothing in the logs as far as I can see. Configuration 
variables log_min_messages and log_min_error_statement are on 'debug1'. 
This is a snippet of the logs:


2021-07-09 20:35:16.374 CEST [30399] STATEMENT:  START TRANSACTION
2021-07-09 20:35:18.703 CEST [30399] WARNING:  there is already a 
transaction in progress

2021-07-09 20:35:18.703 CEST [30399] STATEMENT:  START TRANSACTION
2021-07-09 20:35:26.398 CEST [30977] DEBUG:  rehashing catalog cache id 
7 for pg_attribute; 257 tups, 128 buckets
2021-07-09 20:35:26.398 CEST [30977] STATEMENT:  select 
distinct(constraint_name) CONSTRAINT_NAME from 
information_schema.key_column_usage where constraint_name in (select 
rc.constraint_name from information_schema.referential_constraints
 rc inner join information_schema.table_constraints tc on 
tc.constraint_name = rc.constraint_name inner join 
information_schema.table_constraints tc2 on tc2.constraint_name = 
rc.unique_constraint_name  where tc2.constraint_type = 'PRIMARY

 KEY') and lower(column_name) like 'xxx_%'
2021-07-09 20:35:26.399 CEST [30977] DEBUG:  rehashing catalog cache id 
7 for pg_attribute; 513 tups, 256 buckets
2021-07-09 20:35:26.399 CEST [30977] STATEMENT:  select 
distinct(constraint_name) CONSTRAINT_NAME from 
information_schema.key_column_usage where constraint_name in (select 
rc.constraint_name from information_schema.referential_constraints
 rc inner join information_schema.table_constraints tc on 
tc.constraint_name = rc.constraint_name inner join 
information_schema.table_constraints tc2 on tc2.constraint_name = 
rc.unique_constraint_name  where tc2.constraint_type = 'PRIMARY

 KEY') and lower(column_name) like 'xxx_%'
2021-07-09 20:35:33.170 CEST [31010] DEBUG:  autovacuum: processing 
database "wildfly"
2021-07-09 20:35:33.520 CEST [31010] DEBUG:  rehashing catalog cache id 
12 for pg_cast; 513 tups, 256 buckets
2021-07-09 20:35:33.520 CEST [31010] DEBUG:  rehashing catalog cache id 
14 for pg_opclass; 17 tups, 8 buckets
2021-07-09 20:35:33.648 CEST [31010] DEBUG:  rehashing catalog cache id 
7 for pg_attribute; 257 tups, 128 buckets
2021-07-09 20:36:03.182 CEST [31199] DEBUG:  autovacuum: processing 
database "postgres"
2021-07-09 20:36:33.200 CEST [31387] DEBUG:  autovacuum: processing 
database "wildfly"


I am a bit out of ideas - does anyone have any tips where I should look 
to see what is causing the query to hang?


With kind regards,

Jurrie






Re: How to debug a connection that's "active" but hanging?

2021-07-12 Thread Jurrie Overgoor

On 10-07-2021 10:26, Vijaykumar Jain wrote:
On Sat, 10 Jul 2021 at 00:29, Jurrie Overgoor 
<mailto:postgresql-mailingl...@jurr.org>> wrote:


Hi everyone,

We are in the process of upgrading from PostgreSQL 9.6 to 13. When
our
database gets created in our regression tests, we run some unit tests
first. We see one of those tests hang.

It seems the client is waiting on more data to arrive from the
PostgreSQL server. A thread dump shows it waiting on the socket.

On the server, I see the connection from the client. Looking at
pg_stat_activity, I see it is in state 'active'. I have seen
query_start
be as old as eight hours ago. The documentation states that 'active'
means a query is being executed. The query in question is:

Looking at pg_locks, I only see locks with granted = true for the PID.


Is this reproducible, I mean this happens multiple times?



Hi Vijaykumar, thanks for replying. Yes, this is reproducible. About 50% 
of the times, my connection is hanging. It's always on the same query, 
which I shared in the previous post. These are unit tests that are 
executed just after the database is created from scratch using Liquibase.



can you please run the below query in a separate session and share the 
result, feel free to anonymize sensitive stuff.


SELECT db.datname, locktype, relation::regclass, mode, transactionid
AS tid, virtualtransaction AS vtid, pid, granted
FROM pg_catalog.pg_locks l
LEFT JOIN pg_catalog.pg_database db ON db.oid = l.database
WHERE NOT pid = pg_backend_pid();



Sure! This is all testdata; there is not much that needs to be 
anonymized :) Here it is:



|datname|locktype |relation |mode 
|tid|vtid   |pid  |granted|

|---|--|-|---|---|---|-|---|
|wildfly|relation  |pg_constraint_contypid_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation 
|pg_constraint_conrelid_contypid_conname_index|AccessShareLock| 
|4/46389|22928|true   |
|wildfly|relation  |pg_constraint_conname_nsp_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_constraint_conparentid_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_namespace_oid_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_namespace_nspname_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_attribute_relid_attnum_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_attribute_relid_attnam_index |AccessShareLock|   
|4/46389|22928|true   |

|wildfly|relation  |pg_depend |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_constraint |AccessShareLock|   
|4/46389|22928|true   |

|wildfly|relation  |pg_class |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_namespace |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |pg_attribute |AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |information_schema.table_constraints 
|AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |information_schema.referential_constraints 
|AccessShareLock|   |4/46389|22928|true   |
|wildfly|relation  |information_schema.key_column_usage 
|AccessShareLock|   |4/46389|22928|true   |

|   |virtualxid| |ExclusiveLock  |   |4/46389|22928|true   |
|wildfly|relation  |pg_depend_depender_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_class_oid_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_constraint_oid_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_class_relname_nsp_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_depend_reference_index |AccessShareLock|   
|4/46389|22928|true   |
|wildfly|relation  |pg_class_tblspc_relfilenode_index 
|AccessShareLock|   |4/46389|22928|true   |



Indeed, this time it's PID 22928 that's hanging.



There is nothing in the logs as far as I can see. Configuration
variables log_min_messages and log_min_error_statement are on
'debug1'.
This is a snippet of the logs:

2021-07-09 20:35:16.374 CEST [30399] STATEMENT: START TRANSACTION
2021-07-09 20:35:18.703 CEST [30399] WARNING:  there is already a
transaction in progress

Are you having multiple BEGIN tx not matching commit/rollbacks ? not 
sure related, but fixing this will help reduce noise.



Actually, I'm not quite sure where those messages come from. They of 
course indicate starting a transaction that's already in place. But I'd 
have to dig into the framework we use to know where these come from. My 
gut feeling is that this is not part of the problem, rather just noise.




I am a bit out of ideas - does anyone have any tips where I should
look
to see what is causing the query to hang?


Although I am not an expe

Re: How to debug a connection that's "active" but hanging?

2021-07-12 Thread Jurrie Overgoor

On 12-07-2021 20:56, Vijaykumar Jain wrote:
On Mon, 12 Jul 2021 at 23:16, Tom Lane <mailto:t...@sss.pgh.pa.us>> wrote:



The backtraces you captured look like the query is not "hung", it's
just computing away.


He mentioned earlier that the query was hung as 'active' for 8 hours 
and on.


incase this is due to bad plan,
@Jurrie Overgoor <mailto:postgresql-mailingl...@jurr.org>  is it also 
possible for you to run manually


`vacuumdb  -a -v`  from the terminal, each time before you run your 
test suite for some runs, do you still get the same issue?


I have a feeling repeated runs may have caused a lot of bloat on some 
tables which might have not been reclaimed by autovacuum runs.



I configured Jenkins to run that command prior to executing the tests. I 
got 5 successful runs, no hanging queries. Then I reverted and ran 
again. The first and second run were ok; the third run hung again. So 
your hunch might be right.



On 12-07-2021 19:46, Tom Lane wrote:

You might need
to investigate by altering your application to capture "EXPLAIN ..."
output just before the troublesome query, so you can see if it gets
a different plan in the slow cases. 



Then I tried this. The query plans are indeed not consistent.

Most of the time the first line of the query plan is: Unique  
(cost=4892.35..4892.35 rows=1 width=64) [1]


I have seen other costs: 5818.30, 6350.85 and 6514.73. They all complete 
correctly. [2], [3], [4]


The plan that leaves the query hanging in the 'active' state starts 
with: Unique  (cost=241.81..241.82 rows=1 width=64) [5]


That's clearly much lower than the rest. So I suspect the planner making 
a 'wrong' guess there, causing a bad plan, and a long time to execute. 
For reference, the executed query is [6].


Now, where to go from here? Is this considered a bug in PostgreSQL, or 
am I misusing the database engine by doing DROP DATABASE and CREATE 
DATABASE over and over again? I must say that I never saw this behavior 
on PostgreSQL 9.6, so in that regard it might be considered a bug.?


What can I do to get to the bottom of this? Should I export the content 
of some metadata tables prior to executing the hanging query? Should I 
`vacuumdb -a -v` prior to logging the EXPLAIN for the hanging query?


With kind regards,

Jurrie


[1] https://jurr.org/PostgreSQL_13_hanging_query/normal.txt
[2] https://jurr.org/PostgreSQL_13_hanging_query/alt1.txt
[3] https://jurr.org/PostgreSQL_13_hanging_query/alt3.txt
[4] https://jurr.org/PostgreSQL_13_hanging_query/alt2.txt
[5] https://jurr.org/PostgreSQL_13_hanging_query/hang.txt
[6] https://jurr.org/PostgreSQL_13_hanging_query/query.txt




Re: How to debug a connection that's "active" but hanging?

2021-07-15 Thread Jurrie Overgoor

On 12-07-2021 23:21, Tom Lane wrote:

Jurrie Overgoor  writes:

Is this considered a bug in PostgreSQL, or
am I misusing the database engine by doing DROP DATABASE and CREATE
DATABASE over and over again?

It's not a bug.  I suppose in a perfect world the stats would
automatically be up to date all the time, but in the real world
it seems like the cost of that would be exorbitant.



I see - that makes sense.

To fix this issue, I made sure a `VACUUM` is done prior to running the 
unit tests. It's run as the database owner, so not all relations are 
vacuumed. But it seems to vacuum the necessary relations, so no more 
hanging queries.


Thanks Tom and Vijaykumar for your support!

With kind regards,

Jurrie