Troubleshooting a long running delete statement

2021-10-06 Thread Dirschel, Steve
Question:

How would one troubleshoot this issue in Postgres as to why the delete was 
running so long?  My background is Oracle and there are various statistics I 
may look at:
•   One could estimate the number of logical reads the delete should do 
based on expected number of rows to delete, expected logical reads against the 
table per row, expected logical reads against each index per row.
•   One could look in V$SQL and see how many logical reads the query was 
actually doing.
•   One could look at V$SESS_IO and see how many logical reads the session 
was doing.

In this case you would see the query was doing way more logical reads that 
expected and then try and think of scenarios that would cause that.

Here is what I could see in Postgres:
•   When I did an explain on the delete I could see it was full scanning 
the table. I did a full scan of the table interactively in less than 1 second 
so the long runtime was not due to the full tablescan.
•   I could not find the query in pg_stat_statements to see how many shared 
block reads/hits the query was doing to see if the numbers were extremely high. 
 Based on documentation queries do not show up in pg_stat_statements until 
after they complete.
•   pg_stat_activity showed wait_event_type and wait_event were null for 
the session every time I looked.  So the session was continually using CPU.

I started looking at table definitions (indexes, FK's, etc.) and comparing to 
Oracle and noticed some indexes missing.  I then could see the table being 
deleted from was a child table with a FK pointing to a parent table.  Finally I 
was able to see that the parent table was missing an index on the FK column so 
for every row being deleted from the child it was full scanning the parent.  
All makes sense after the fact but I'm looking for a more methodical way to 
come to that conclusion by looking at database statistics.

Are there other statistics in Postgres I may have looked at to methodically 
come to the conclusion that the problem was the missing index on the parent FK 
column?

Thanks

This e-mail is for the sole use of the intended recipient and contains 
information that may be privileged and/or confidential. If you are not an 
intended recipient, please notify the sender by return e-mail and delete this 
e-mail and any attachments. Certain required legal entity disclosures can be 
accessed on our website: 
https://www.thomsonreuters.com/en/resources/disclosures.html


Re: Troubleshooting a long running delete statement

2021-10-06 Thread Mladen Gogala

On 10/6/21 14:00, Dirschel, Steve wrote:


Question:
How would one troubleshoot this issue in Postgres as to why the delete 
was running so long?  My background is Oracle and there are various 
statistics I may look at:


  * One could estimate the number of logical reads the delete should
do based on expected number of rows to delete, expected logical
reads against the table per row, expected logical reads against
each index per row.
  * One could look in V$SQL and see how many logical reads the query
was actually doing.
  * One could look at V$SESS_IO and see how many logical reads the
session was doing.

In this case you would see the query was doing way more logical reads 
that expected and then try and think of scenarios that would cause that.

Here is what I could see in Postgres:

  * When I did an explain on the delete I could see it was full
scanning the table. I did a full scan of the table interactively
in less than 1 second so the long runtime was not due to the full
tablescan.
  * I could not find the query in pg_stat_statements to see how many
shared block reads/hits the query was doing to see if the numbers
were extremely high.  Based on documentation queries do not show
up in pg_stat_statements until after they complete.
  * pg_stat_activity showed wait_event_type and wait_event were null
for the session every time I looked.  So the session was
continually using CPU.

I started looking at table definitions (indexes, FK's, etc.) and 
comparing to Oracle and noticed some indexes missing.  I then could 
see the table being deleted from was a child table with a FK pointing 
to a parent table.  Finally I was able to see that the parent table 
was missing an index on the FK column so for every row being deleted 
from the child it was full scanning the parent.  All makes sense after 
the fact but I'm looking for a more methodical way to come to that 
conclusion by looking at database statistics.
Are there other statistics in Postgres I may have looked at to 
methodically come to the conclusion that the problem was the missing 
index on the parent FK column?

Thanks
This e-mail is for the sole use of the intended recipient and contains 
information that may be privileged and/or confidential. If you are not 
an intended recipient, please notify the sender by return e-mail and 
delete this e-mail and any attachments. Certain required legal entity 
disclosures can be accessed on our website: 
https://www.thomsonreuters.com/en/resources/disclosures.html



Hi Steve,

First, check whether you have any triggers on the table. The best way of 
doing it is to use information_schema.triggers. I have seen triggers 
introduce some "mysterious" functionality in Oracle as well. Second, 
check constraints. Is the table you're deleting from the parent table of 
a foreign key constraint(s)? If the constraints are defined with "ON 
DELETE CASCADE", you maybe deleting more than you think. If it is not 
defined with "ON DELETE CASCADE" or "ON DELETE SET NULL", you would get 
an error. If that passes the muster, then check the processes doing the 
most of IO using "iotop" or "atop". I like the latter. You can then 
check what the busy processes are doing using strace -e trace=file and, 
for good measure, 'perf top".


Regards

--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com


Re: Troubleshooting a long running delete statement

2021-10-06 Thread Justin Pryzby
On Wed, Oct 06, 2021 at 06:00:07PM +, Dirschel, Steve wrote:
> •   When I did an explain on the delete I could see it was full scanning 
> the table. I did a full scan of the table interactively in less than 1 second 
> so the long runtime was not due to the full tablescan.

> I started looking at table definitions (indexes, FK's, etc.) and comparing to 
> Oracle and noticed some indexes missing.  I then could see the table being 
> deleted from was a child table with a FK pointing to a parent table.  Finally 
> I was able to see that the parent table was missing an index on the FK column 
> so for every row being deleted from the child it was full scanning the 
> parent.  All makes sense after the fact but I'm looking for a more methodical 
> way to come to that conclusion by looking at database statistics.
> 
> Are there other statistics in Postgres I may have looked at to methodically 
> come to the conclusion that the problem was the missing index on the parent 
> FK column?

I think explain (analyze on) would've helped you.

If I understand your scenario, it'd look like this:

|postgres=# explain (analyze) delete from t;
| Delete on t  (cost=0.00..145.00 rows=1 width=6) (actual 
time=10.124..10.136 rows=0 loops=1)
|   ->  Seq Scan on t  (cost=0.00..145.00 rows=1 width=6) (actual 
time=0.141..2.578 rows=1 loops=1)
| Planning Time: 0.484 ms
| Trigger for constraint u_i_fkey: time=4075.123 calls=1
| Execution Time: 4087.764 ms

You can see the query plan used for the FK trigger with autoexplain.

postgres=*# SET auto_explain.log_min_duration='0s'; SET 
client_min_messages=debug; SET auto_explain.log_nested_statements=on;
postgres=*# explain (analyze) delete from t;
|...
|Query Text: DELETE FROM ONLY "public"."u" WHERE $1 OPERATOR(pg_catalog.=) "i"
|Delete on u  (cost=0.00..214.00 rows=1 width=6) (actual rows=0 loops=1)
|  Buffers: shared hit=90
|  ->  Seq Scan on u  (cost=0.00..214.00 rows=1 width=6) (actual rows=1 loops=1)
|Filter: ($1 = i)
|Rows Removed by Filter: 8616
|Buffers: shared hit=89
|...




Re: Troubleshooting a long running delete statement

2021-10-06 Thread Laurenz Albe
On Wed, 2021-10-06 at 18:00 +, Dirschel, Steve wrote:
> Are there other statistics in Postgres I may have looked at to methodically 
> come to the conclusion that the problem was the missing index on the parent 
> FK column?

You could use the query from my article to find the missing indexes:
https://www.cybertec-postgresql.com/en/index-your-foreign-key/

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com





RE: [EXT] Re: Troubleshooting a long running delete statement

2021-10-06 Thread Dirschel, Steve


On 10/6/21 14:00, Dirschel, Steve wrote:
Question:

How would one troubleshoot this issue in Postgres as to why the delete was 
running so long?  My background is Oracle and there are various statistics I 
may look at:
· One could estimate the number of logical reads the delete should do 
based on expected number of rows to delete, expected logical reads against the 
table per row, expected logical reads against each index per row.
· One could look in V$SQL and see how many logical reads the query was 
actually doing.
· One could look at V$SESS_IO and see how many logical reads the 
session was doing.

In this case you would see the query was doing way more logical reads that 
expected and then try and think of scenarios that would cause that.

Here is what I could see in Postgres:
· When I did an explain on the delete I could see it was full scanning 
the table. I did a full scan of the table interactively in less than 1 second 
so the long runtime was not due to the full tablescan.
· I could not find the query in pg_stat_statements to see how many 
shared block reads/hits the query was doing to see if the numbers were 
extremely high.  Based on documentation queries do not show up in 
pg_stat_statements until after they complete.
· pg_stat_activity showed wait_event_type and wait_event were null for 
the session every time I looked.  So the session was continually using CPU.

I started looking at table definitions (indexes, FK's, etc.) and comparing to 
Oracle and noticed some indexes missing.  I then could see the table being 
deleted from was a child table with a FK pointing to a parent table.  Finally I 
was able to see that the parent table was missing an index on the FK column so 
for every row being deleted from the child it was full scanning the parent.  
All makes sense after the fact but I'm looking for a more methodical way to 
come to that conclusion by looking at database statistics.

Are there other statistics in Postgres I may have looked at to methodically 
come to the conclusion that the problem was the missing index on the parent FK 
column?

Thanks

This e-mail is for the sole use of the intended recipient and contains 
information that may be privileged and/or confidential. If you are not an 
intended recipient, please notify the sender by return e-mail and delete this 
e-mail and any attachments. Certain required legal entity disclosures can be 
accessed on our website: 
https://www.thomsonreuters.com/en/resources/disclosures.html



Hi Steve,

First, check whether you have any triggers on the table. The best way of doing 
it is to use information_schema.triggers. I have seen triggers introduce some 
"mysterious" functionality in Oracle as well. Second, check constraints. Is the 
table you're deleting from the parent table of a foreign key constraint(s)? If 
the constraints are defined with "ON DELETE CASCADE", you maybe deleting more 
than you think. If it is not defined with "ON DELETE CASCADE" or "ON DELETE SET 
NULL", you would get an error. If that passes the muster, then check the 
processes doing the most of IO using "iotop" or "atop". I like the latter. You 
can then check what the busy processes are doing using strace -e trace=file 
and, for good measure, 'perf top".

Regards

--

Mladen Gogala

Database Consultant

Tel: (347) 321-1217

https://dbwhisperer.wordpress.com



Thanks for the reply and I hope I’m replying to this e-mail correctly at the 
bottom of the chain.  We are running on AWS aurora postgres.  I assume strace 
-e isn’t an option given we don’t have access to the server or are you aware of 
a method I could still do that without server access?



Regards

Steve






RE: [EXT] Re: Troubleshooting a long running delete statement

2021-10-06 Thread Dirschel, Steve

On Wed, Oct 06, 2021 at 06:00:07PM +, Dirschel, Steve wrote:
> •   When I did an explain on the delete I could see it was full 
scanning the table. I did a full scan of the table interactively in less than 1 
second so the long runtime was not due to the full tablescan.

> I started looking at table definitions (indexes, FK's, etc.) and 
comparing to Oracle and noticed some indexes missing.  I then could see the 
table being deleted from was a child table with a FK pointing to a   parent 
table.  Finally I was able to see that the parent table was missing an index on 
the FK column so for every row being deleted from the child it was full 
scanning the parent.  All makes sense after the fact but I'm looking 
for a more methodical way to come to that conclusion by looking at database 
statistics.
>
> Are there other statistics in Postgres I may have looked at to 
methodically come to the conclusion that the problem was the missing index on 
the parent FK column?

I think explain (analyze on) would've helped you.

If I understand your scenario, it'd look like this:

|postgres=# explain (analyze) delete from t;  Delete on t
|(cost=0.00..145.00 rows=1 width=6) (actual time=10.124..10.136 
rows=0 loops=1)
|   ->  Seq Scan on t  (cost=0.00..145.00 rows=1 width=6) (actual
|time=0.141..2.578 rows=1 loops=1)  Planning Time: 0.484 ms  Trigger
|for constraint u_i_fkey: time=4075.123 calls=1  Execution Time:
|4087.764 ms

You can see the query plan used for the FK trigger with autoexplain.

postgres=*# SET auto_explain.log_min_duration='0s'; SET 
client_min_messages=debug; SET auto_explain.log_nested_statements=on;
postgres=*# explain (analyze) delete from t;
|...
|Query Text: DELETE FROM ONLY "public"."u" WHERE $1 
OPERATOR(pg_catalog.=) "i"
|Delete on u  (cost=0.00..214.00 rows=1 width=6) (actual rows=0 loops=1)
|  Buffers: shared hit=90
|  ->  Seq Scan on u  (cost=0.00..214.00 rows=1 width=6) (actual rows=1 
loops=1)
|Filter: ($1 = i)
|Rows Removed by Filter: 8616
|Buffers: shared hit=89
|...


Thanks for the reply and the info above.  My question was more directed at how 
can you troubleshoot the active session running the query.  In the examples 
above you are actually executing the query.  If an application is executing the 
query I can't go in an re-execute it.  I also found by trial and error if I do 
execute it interactively and CTRL C out if it a message is returned which would 
give hints where to look next:

ERROR:  canceling statement due to user request
CONTEXT:  SQL statement "SELECT 1 FROM ONLY "cf4"."category_page" x WHERE 
$1::pg_catalog.text OPERATOR(pg_catalog.=) "prism_guid"::pg_catalog.text FOR 
KEY SHARE OF x"

That query above is hitting the parent table and my delete is against the child 
so that info would be very helpful, but it was only available if I 
interactively ran it.  I’m mostly interested in what stats/info is available 
while it’s running to troubleshoot it.

Regards
Steve



Re: [EXT] Re: Troubleshooting a long running delete statement

2021-10-06 Thread Mladen Gogala


On 10/6/21 16:26, Dirschel, Steve wrote:
Thanks for the reply and I hope I’m replying to this e-mail correctly 
at the bottom of the chain.



Hey, it's not me, it's rules and regulations. And that's incredibly 
important on this group, or so I was lead to believe :)


We are running on AWS aurora postgres.  I assume strace -e isn’t an 
option given we don’t have access to the server or are you aware of a 
method I could still do that without server access?

Regards
Steve
No, access to the OS is not an option with RDS. However, RDS comes with 
support. You will have to contact support. They may use strace for you.


--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com


Re: [EXT] Re: Troubleshooting a long running delete statement

2021-10-06 Thread Mladen Gogala


On 10/6/21 16:32, Dirschel, Steve wrote:
postgres=# explain (analyze) delete from t;  Delete on t 


I would try explain (analyze, timing, buffers). That would also give you 
the timing of each step so you can figure which one takes the longes.


Regards


--
Mladen Gogala
Database Consultant
Tel: (347) 321-1217
https://dbwhisperer.wordpress.com


Re: Troubleshooting a long running delete statement

2021-10-06 Thread Michael Lewis
On Wed, Oct 6, 2021 at 12:00 PM Dirschel, Steve <
[email protected]> wrote:

> Here is what I could see in Postgres:
>
>- When I did an explain on the delete I could see it was full scanning
>the table. I did a full scan of the table interactively in less than 1
>second so the long runtime was not due to the full tablescan.
>
>
If finding the rows is fast, but actually deleting them is slow and perhaps
won't even finish, I would strongly consider adding a where clause such
that a small fraction of the deletes would be done (perhaps in a
transaction that gets rolled back) and do the explain (analyze, buffers) on
that modified command. Yes, the planner may decide to use an index to find
which rows to delete, but if finding the rows was already fast and it is
the non-obvious work that we want to profile, then it should be fine to do
1% of the deletes and see how it performs and where the time goes.

>


Re: [EXT] Re: Troubleshooting a long running delete statement

2021-10-06 Thread Maciek Sakrejda
On Wed, Oct 6, 2021 at 1:32 PM Dirschel, Steve <
[email protected]> wrote:

> Thanks for the reply and the info above.  My question was more directed at
> how can you troubleshoot the active session running the query.  In the
> examples above you are actually executing the query.  If an application is
> executing the query I can't go in an re-execute it.
>

This won't be immediately useful, but there's been a patch proposed for
Postgres 15 to allow logging the plan of a running query [1]. Progress
seems to have stalled a bit, but it seems like there was a fair amount of
interest, so I wouldn't count it out yet. If you have thoughts on the
proposed functionality, I'm sure thoughtful feedback would be appreciated.

Thanks,
Maciek

[1]:
https://www.postgresql.org/message-id/flat/cf8501bcd95ba4d727cbba886ba9eea8%40oss.nttdata.com


>