from:"Robert Creager"

Query hitting empty tables taking 48 minutes

2018-06-07 Thread Robert Creager


I have a system running FreeBSD 11.1-STABLE, PostgreSQL 9.6.8,Java OpenJDK 
1.8.0_131, jdbc 9.3-1104-jdbc41 which is exhibiting very bizarre behavior.  A 
query is executing against a couple of tables that have 1 and 0 records in 
them.  ds3.job_entry has 0 rows, ds3.blob has 1 row.  If I execute the query 
manually via command line, it executes fine.  There are no other active 
queries, no locks.  The system is idle, in our Dev Test group, so I can leave 
it this way for a bit of time.  The general software setup is in production and 
I’ve seen nothing like this before.  Even a system with 300M ds3.blob entries 
executes this query fine.

Jun  7 17:24:21 blackpearl postgres[10670]: [7737-1] 
db=tapesystem,user=Administrator,app=[unknown],client=127.0.0.1 LOG:  duration: 
2903612.206 ms  execute fetch from S_2037436/C_2037437: SELECT * FROM ds3.blob 
WHERE EXISTS (SELECT * FROM ds3.job_entry WHERE blob_id = ds3.blob.id AND 
(job_id = $1))

tapesystem=# explain analyze SELECT * FROM ds3.blob WHERE EXISTS (SELECT * FROM 
ds3.job_entry WHERE blob_id = ds3.blob.id AND (job_id = 
'b51357cd-e07a-4c87-a50b-999c347a5c71'));   
  QUERY PLAN
---

 Nested Loop  (cost=0.84..6.89 rows=1 width=77) (actual time=0.044..0.044 
rows=0 loops=1)
   ->  Index Scan using job_entry_job_id_idx on job_entry  (cost=0.42..2.44 
rows=1 width=16) (actual time=0.042..0.042 rows=0 loops=1)
 Index Cond: (job_id = 'b51357cd-e07a-4c87-a50b-999c347a5c71'::uuid)
   ->  Index Scan using blob_pkey on blob  (cost=0.42..4.44 rows=1 width=77) 
(never executed)
 Index Cond: (id = job_entry.blob_id)  
 Planning time: 0.388 ms
 Execution time: 0.118 ms   




pid client_port runtime query_start datname state   query   usename
10670   11211   0 years 0 mons 0 days 0 hours 43 mins 28.852273 secs
2018-06-07 17:24:22.026384  tapesystem  active  SELECT * FROM ds3.blob 
WHERE EXISTS (SELECT * FROM ds3.job_entry WHERE blob_id = ds3.blob.id AND 
(job_id = $1)) Administrator


From the system with 300M ds3.blob entries, which works fine (along with every 
other system in house):

QUERY PLAN
Nested Loop  (cost=0.57..738052.90 rows=164379 width=75) (actual 
time=1.001..1947.029 rows=5 loops=1)
  ->  Seq Scan on job_entry  (cost=0.00..10039.50 rows=164379 width=16) (actual 
time=0.871..56.442 rows=5 loops=1)
Filter: (job_id = 'ef27d2fa-2727-424e-8f44-da9e33a5ca05'::uuid)
Rows Removed by Filter: 60001
  ->  Index Scan using blob_pkey on blob  (cost=0.57..4.42 rows=1 width=75) 
(actual time=0.037..0.037 rows=1 loops=5)
Index Cond: (id = job_entry.blob_id)
Planning time: 6.330 ms
Execution time: 1951.564 ms


Please keep my CC of my work e-mail present.

Best,
Robert

Re: Query hitting empty tables taking 48 minutes

2018-06-07 Thread Robert Creager



> On Jun 7, 2018, at 12:40 PM, Adrian Klaver  wrote:
> 
> On 06/07/2018 11:17 AM, Robert Creager wrote:
>> I have a system running FreeBSD 11.1-STABLE, PostgreSQL 9.6.8,Java OpenJDK 
>> 1.8.0_131, jdbc 9.3-1104-jdbc41 which is exhibiting very bizarre behavior.  
>> A query is executing against a couple of tables that have 1 and 0 records in 
>> them.  ds3.job_entry has 0 rows, ds3.blob has 1 row.   If I execute the 
>> query manually via command line, it executes fine.   There are no other 
>> active queries, no locks.  The system is idle, in our Dev Test group, so I 
>> can leave it this way for a bit of time.  The general software setup is in 
>> production and I’ve seen nothing like this before.  Even a system with 300M 
>> ds3.blob entries executes this query fine.
> 
> So I am assuming the problem query is being run through Java/jdbc, correct?

Yes.

> 
> There is also the below in the log:
> 
> " ... execute fetch from S_2037436/C_2037437 …"

So, that means nothing to me.  Something to you?

> 
> My guess is that we will need to see the Java code that sets up and runs the 
> query. Is that possible?

OMG, you had to ask ;-)  Not really, for two reasons.  It’s an actual shipping 
product, and I’d have to send you so much code to get from the source of the 
query down through the execute…. Now, the execute, on it’s own, is below.  
m_statement is a java.sql.PreparedStatement.  Keep in mind this code is 
literally executing millions of times a day on a busy system.  Except for this 
special snowflake system...

private Executor executeQuery()
{
final MonitoredWork monitoredWork = new MonitoredWork( 
StackTraceLogging.NONE, m_readableSql.getLogMessagePreExecution() );
try
{
m_closeTransactionUponClose = ( null == m_transactionNumber );
m_statement.setFetchSize( 1 );
final Duration duration = new Duration();
m_resultSet = m_statement.executeQuery();
m_readableSql.log( duration, null );
return this;
}
catch ( final SQLException ex )
{
throw new DaoException( 
"Failed to execute: " + 
m_readableSql.getLogMessagePreExecution(),
ex );
}
finally
{
monitoredWork.completed();
}
}

> 
>> Jun  7 17:24:21 blackpearl postgres[10670]: [7737-1] 
>> db=tapesystem,user=Administrator,app=[unknown],client=127.0.0.1 LOG:  
>> duration: *2903612.206* ms  execute fetch from S_2037436/C_2037437: SELECT * 
>> FROM ds3.blob WHERE EXISTS (SELECT * FROM ds3.job_entry WHERE blob_id = 
>> ds3.blob.id <http://ds3.blob.id> AND (job_id = $1))
>> tapesystem=# explain analyze SELECT * FROM ds3.blob WHERE EXISTS (SELECT * 
>> FROM ds3.job_entry WHERE blob_id = ds3.blob.id <http://ds3.blob.id> AND 
>> (job_id = 'b51357cd-e07a-4c87-a50b-999c347a5c71'));
>>   QUERY PLAN
>> ---
>>Nested Loop  (cost=0.84..6.89 rows=1 width=77) (actual time=0.044..0.044 
>> rows=0 loops=1)
>>->  Index Scan using job_entry_job_id_idx on job_entry  (cost=0.42..2.44 
>> rows=1 width=16) (actual time=0.042..0.042 rows=0 loops=1)
>>  Index Cond: (job_id = 'b51357cd-e07a-4c87-a50b-999c347a5c71'::uuid)
>>->  Index Scan using blob_pkey on blob  (cost=0.42..4.44 rows=1 width=77) 
>> (never executed)
>>  Index Cond: (id = job_entry.blob_id)
>>  Planning time: 0.388 ms
>>  Execution time: 0.118 ms
>> pid  client_port runtime query_start datname state   query   usename
>> 1067011211   0 years 0 mons 0 days 0 hours 43 mins 28.852273 secs 
>> 2018-06-07 17:24:22.026384 tapesystem  active  SELECT * FROM ds3.blob 
>> WHERE EXISTS (SELECT * FROM ds3.job_entry WHERE blob_id = ds3.blob.id 
>> <http://ds3.blob.id> AND (job_id = $1))Administrator
>> From the system with 300M ds3.blob entries, which works fine (along with 
>> every other system in house):
>> QUERY PLAN
>> Nested Loop  (cost=0.57..738052.90 rows=164379 width=75) (actual 
>> time=1.001..1947.029 rows=5 loops=1)
>>   ->  Seq Scan on job_entry  (cost=0.00..10039.50 rows=164379 width=16) 
>> (actual time=0.871..56.442 rows=5 loops=1)
>> Filter: (job_id = 'ef27d2fa-2727-424e-8f44-da9e33a5ca05'::uuid)
>> Rows Removed by Filter: 60001
>>   ->  Index Scan using blob_pkey on blob  (cost=0.57..4.42 rows=1 width=75) 
>> (actual time=0.037..0.037 rows=1 loops=5)
>> Index Cond: (id = job_entry.blob_id)
>> Planning time: 6.330 ms
>> Execution time: 1951.564 ms
>> Please keep my CC of my work e-mail present.
>> Best,
>> Robert
> 
> 
> -- 
> Adrian Klaver
> adrian.kla...@aklaver.com

Re: Query hitting empty tables taking 48 minutes

2018-06-07 Thread Robert Creager




> On Jun 7, 2018, at 1:14 PM, Adrian Klaver  wrote:
> 
> On 06/07/2018 11:55 AM, Robert Creager wrote:
>>> On Jun 7, 2018, at 12:40 PM, Adrian Klaver >> <mailto:adrian.kla...@aklaver.com>> wrote:
>>> 
>>> On 06/07/2018 11:17 AM, Robert Creager wrote:
>>>> I have a system running FreeBSD 11.1-STABLE, PostgreSQL 9.6.8,Java OpenJDK 
>>>> 1.8.0_131, jdbc 9.3-1104-jdbc41 which is exhibiting very bizarre behavior. 
>>>>  A query is executing against a couple of tables that have 1 and 0 records 
>>>> in them.  ds3.job_entry has 0 rows, ds3.blob has 1 row.   If I execute the 
>>>> query manually via command line, it executes fine.   There are no other 
>>>> active queries, no locks.  The system is idle, in our Dev Test group, so I 
>>>> can leave it this way for a bit of time.  The general software setup is in 
>>>> production and I’ve seen nothing like this before.  Even a system with 
>>>> 300M ds3.blob entries executes this query fine.
>>> 
>>> So I am assuming the problem query is being run through Java/jdbc, correct?
>> Yes.
>>> 
>>> There is also the below in the log:
>>> 
>>> " ... execute fetch from S_2037436/C_2037437 …"
>> So, that means nothing to me.  Something to you?
> 
> Just that the query being run from the code is going through a different path 
> then the version you ran at the command line. Just trying to get to apples to 
> apples if possible.
> 
>>> 
>>> My guess is that we will need to see the Java code that sets up and runs 
>>> the query. Is that possible?
>> OMG, you had to ask ;-)  Not really, for two reasons.  It’s an actual 
>> shipping product, and I’d have to send you so much code to get from the 
>> source of the query down through the execute…. Now, the execute, on it’s 
>> own, is below.  m_statement is a java.sql.PreparedStatement.  Keep in mind 
>> this code is literally executing millions of times a day on a busy system.  
>> Except for this special snowflake system...
> 
> I am not a Java programmer(hence the 'we' in my previous post), so someone 
> who is will need to comment on the below. Though I have to believe the:
> 
> m_statement.setFetchSize( 1 );
> 
> has got to do with:
> 
> " ... execute fetch from S_2037436/C_2037437 …"
> 
> In your OP you said the tables involved have 1 and 0 rows in them.

Yes.

> 
> Is that from count(*) on each table?

Yes.

> Or is it for for job_id = 'b51357cd-e07a-4c87-a50b-999c347a5c71’?

There is no job_id entry, that table is empty.

Best,
Robert

> 
>> private Executor executeQuery()
>> {
>> final MonitoredWork monitoredWork = new MonitoredWork(
>> StackTraceLogging.NONE, 
>> m_readableSql.getLogMessagePreExecution() );
>> try
>> {
>> m_closeTransactionUponClose = ( null == m_transactionNumber );
>> m_statement.setFetchSize( 1 );
>> final Duration duration = new Duration();
>> m_resultSet = m_statement.executeQuery();
>> m_readableSql.log( duration, null );
>> return this;
>> }
>> catch ( final SQLException ex )
>> {
>> throw new DaoException(
>> "Failed to execute: " + 
>> m_readableSql.getLogMessagePreExecution(),
>> ex );
>> }
>> finally
>> {
>> monitoredWork.completed();
>> }
>> }
>>> 
> 
>>> -- 
>>> Adrian Klaver
>>> adrian.kla...@aklaver.com <mailto:adrian.kla...@aklaver.com>
> 
> 
> -- 
> Adrian Klaver
> adrian.kla...@aklaver.com

Re: Query hitting empty tables taking 48 minutes

2018-06-07 Thread Robert Creager



> On Jun 7, 2018, at 3:34 PM, Tom Lane  wrote:
> 
> Robert Creager  writes:
>> Jun  7 17:24:21 blackpearl postgres[10670]: [7737-1] 
>> db=tapesystem,user=Administrator,app=[unknown],client=127.0.0.1 LOG:  
>> duration: 2903612.206 ms  execute fetch from S_2037436/C_2037437: SELECT * 
>> FROM ds3.blob WHERE EXISTS (SELECT * FROM ds3.job_entry WHERE blob_id = 
>> ds3.blob.id AND (job_id = $1))
> 
>> tapesystem=# explain analyze SELECT * FROM ds3.blob WHERE EXISTS (SELECT * 
>> FROM ds3.job_entry WHERE blob_id = ds3.blob.id AND (job_id = 
>> 'b51357cd-e07a-4c87-a50b-999c347a5c71'));   
>>  QUERY PLAN
>> ---
>> 
>> Nested Loop  (cost=0.84..6.89 rows=1 width=77) (actual time=0.044..0.044 
>> rows=0 loops=1)
>>   ->  Index Scan using job_entry_job_id_idx on job_entry  (cost=0.42..2.44 
>> rows=1 width=16) (actual time=0.042..0.042 rows=0 loops=1)
>> Index Cond: (job_id = 'b51357cd-e07a-4c87-a50b-999c347a5c71'::uuid)
>>   ->  Index Scan using blob_pkey on blob  (cost=0.42..4.44 rows=1 width=77) 
>> (never executed)
>> Index Cond: (id = job_entry.blob_id) 
>>  
>> Planning time: 0.388 ms
>> Execution time: 0.118 ms 
>>  
>>  
> 
> That's fairly bizarre, but important to notice here is that you don't have
> an apples-to-apples comparison.  The query in the log is evidently
> parameterized, whereas your EXPLAIN isn't; it's got a hard-wired constant
> to compare to job_id.  I'd suggest trying this in psql:
> 
> PREPARE foo(uuid) AS SELECT * FROM ds3.blob WHERE EXISTS (SELECT * FROM 
> ds3.job_entry WHERE blob_id = ds3.blob.id AND (job_id = $1));
> 
> EXPLAIN ANALYZE EXECUTE foo('b51357cd-e07a-4c87-a50b-999c347a5c71');
> 
> Assuming you get a sane plan the first time, repeat the EXPLAIN ANALYZE
> half a dozen times, and note whether the plan changes after six
> executions.  (The point here is to see if the plancache will shift to
> a "generic" plan, and if so whether that's worse than a "custom" plan
> for the specific parameter value.)

The plan didn’t change after bunches (> 6) executions.

tapesystem=# EXPLAIN ANALYZE EXECUTE 
foo('b51357cd-e07a-4c87-a50b-999c347a5c71');
  QUERY PLAN
 
---
 Nested Loop  (cost=0.84..6.89 rows=1 width=77) (actual time=0.032..0.032 
rows=0 loops=1)  
   ->  Index Scan using job_entry_job_id_idx on job_entry  (cost=0.42..2.44 
rows=1 width=16) (actual time=0.031..0.031 rows=0 loops=1)  
   
 Index Cond: (job_id = 'b51357cd-e07a-4c87-a50b-999c347a5c71'::uuid)
  
   ->  Index Scan using blob_pkey on blob  (cost=0.42..4.44 rows=1 width=77) 
(never executed)
  
 Index Cond: (id = job_entry.blob_id)   
 
 Execution time: 0.096 ms   
 
(6 rows) 

Executing with the job_id shown in the stats of the empty table below (didn’t 
change after bunches of executions).  The job_entry table has very ephemeral 
data in general.

tapesystem=# EXPLAIN ANALYZE EXECUTE 
foo('cc54ca5d-0dca-4b35-acd9-e0fe69c6b247');
 QUERY PLAN 


 Hash Join  (cost=9582.63..21191.13 rows=300019 width=77) (actual 
time=22.679..22.679 rows=0 loops=1)
   Hash Cond: (job_entry.blob_id = blob.id)
   ->  Seq Scan on job_entry  (cost=0.00..7483.24 rows=300019 width=16) (actual 
time=22.677..22.677 rows=0 loops=1)
 Filter: (job_id = 'cc54ca5d-0dca-4b35-acd9-e0fe69c6b247'::uuid)
   ->  Hash  (cost=5832.28..5832.28 rows=300028 width=77) (never executed)
 ->  Se

Re: Query hitting empty tables taking 48 minutes

2018-06-07 Thread Robert Creager


Re-sending from a subscribed address (sigh).

On Jun 7, 2018, at 4:18 PM, Robert  wrote:


> On Jun 7, 2018, at 2:15 PM, Laurenz Albe  <mailto:laurenz.a...@cybertec.at>> wrote:
> 
> Robert Creager wrote:
>> I have a system running FreeBSD 11.1-STABLE, PostgreSQL 9.6.8,Java OpenJDK 
>> 1.8.0_131,
>> jdbc 9.3-1104-jdbc41 which is exhibiting very bizarre behavior.
>> A query is executing against a couple of tables that have 1 and 0 records in 
>> them.
>> ds3.job_entry has 0 rows, ds3.blob has 1 row.
>> If I execute the query manually via command line, it executes fine.  There 
>> are no other active queries, no locks.
>> The system is idle, in our Dev Test group, so I can leave it this way for a 
>> bit of time.
>> The general software setup is in production and I’ve seen nothing like this 
>> before.
>> Even a system with 300M ds3.blob entries executes this query fine.
>> 
>> Jun  7 17:24:21 blackpearl postgres[10670]: [7737-1] 
>> db=tapesystem,user=Administrator,app=[unknown],client=127.0.0.1 LOG:  
>> duration: 2903612.206 ms  execute fetch from S_2037436/C_2037437: SELECT * 
>> FROM ds3.blob WHERE EXISTS (SELECT * FROM ds3.job_entry WHERE blob_id = 
>> ds3.blob.id <http://ds3.blob.id/> AND (job_id = $1))
>> 
>> tapesystem=# explain analyze SELECT * FROM ds3.blob WHERE EXISTS (SELECT * 
>> FROM ds3.job_entry WHERE blob_id = ds3.blob.id <http://ds3.blob.id/> AND 
>> (job_id = 'b51357cd-e07a-4c87-a50b-999c347a5c71'));   
>>  QUERY PLAN
>> ---
>> 
>> Nested Loop  (cost=0.84..6.89 rows=1 width=77) (actual time=0.044..0.044 
>> rows=0 loops=1)
>>   ->  Index Scan using job_entry_job_id_idx on job_entry  (cost=0.42..2.44 
>> rows=1 width=16) (actual time=0.042..0.042 rows=0 loops=1)
>> Index Cond: (job_id = 'b51357cd-e07a-4c87-a50b-999c347a5c71'::uuid)
>>   ->  Index Scan using blob_pkey on blob  (cost=0.42..4.44 rows=1 width=77) 
>> (never executed)
>> Index Cond: (id = job_entry.blob_id) 
>>  
>> Planning time: 0.388 ms
>> Execution time: 0.118 ms 
>>  
>>   
>> 
>> 
>> pid  client_port runtime query_start datname state   query   usename
>> 1067011211   0 years 0 mons 0 days 0 hours 43 mins 28.852273 secs
>> 2018-06-07 17:24:22.026384  tapesystem  active  SELECT * FROM 
>> ds3.blob WHERE EXISTS (SELECT * FROM ds3.job_entry WHERE blob_id = 
>> ds3.blob.id <http://ds3.blob.id/> AND (job_id = $1))   Administrator
>> 
>> 
>> From the system with 300M ds3.blob entries, which works fine (along with 
>> every other system in house):
>> 
>> QUERY PLAN
>> Nested Loop  (cost=0.57..738052.90 rows=164379 width=75) (actual 
>> time=1.001..1947.029 rows=5 loops=1)
>>  ->  Seq Scan on job_entry  (cost=0.00..10039.50 rows=164379 width=16) 
>> (actual time=0.871..56.442 rows=5 loops=1)
>>Filter: (job_id = 'ef27d2fa-2727-424e-8f44-da9e33a5ca05'::uuid)
>>Rows Removed by Filter: 60001
>>  ->  Index Scan using blob_pkey on blob  (cost=0.57..4.42 rows=1 width=75) 
>> (actual time=0.037..0.037 rows=1 loops=5)
>>Index Cond: (id = job_entry.blob_id)
>> Planning time: 6.330 ms
>> Execution time: 1951.564 ms
> 
> The JDBC query is using a cursor since you are using setFetchSize(1).
> 
> I can see two reasons why the plan might be different:
> 
> 1. It is a prepared statement that has been executed often enough
>   for the generic plan to kick in, and the generic plan is bad.

Tom brought up that, see my response to him.  It is a bad plan, but on a table 
with no entries.

> 2. PostgreSQL chooses a fast startup plan because a cursor is used,
>   and that plan performs much worse.

The parameterized plan Tom had me look at starts with sequence scans, which 
would be bad on the job_entry table (topping out at ~9M entries worst case), 
horrendous on the blob table (open ended size, testing with 300M entries on one 
of our systems).

> To see if 1. causes the problem, you could set
>  m_statement.setPrepareThreshold(0);
> and see if that makes a difference.

If I can keep this machine and spin some code, I could

Re: Query hitting empty tables taking 48 minutes

2018-06-07 Thread Robert Creager



> On Jun 7, 2018, at 4:58 PM, David G. Johnston  
> wrote:
> 
> On Thu, Jun 7, 2018 at 3:02 PM, Robert Creager  <mailto:rob...@logicalchaos.org>> wrote:
> [...] 
>  job_id  | f |  1 | cc54ca5d-0dca-4b35-acd9-e0fe69c6b247
> 
> IIUC, the system believes your job_entry table has 300k records ALL of them 
> having the UUID value ending in "*b247" - so it is unsurprising that it 
> chooses to sequentially scan job_entry when its given that ID to search for.  
> And if its given a different ID is realizes it can accurately confirm the 
> absence of the supplied value in the table by using the index.

Yes, I agree. And this explain execution time is fine.  It’s the ’same’ query 
running in the app for 43 minutes at a shot on the same tables that’s giving me 
heartburn.

> I would suspect that vacuuming these tables would solve your problem.  
> Whether there is an issue beyond a lack of vacuuming, or related to 
> auto-vacuum, I am unsure.  Though at this point it may take a vacuum full to 
> recover back to a sane state.  Though ANALYZE by itself should clear up the 
> statistical discrepancy.

Auto-vacuum is on, and I was thinking the same re VACUUM fixing it, but, it 
seems there may be a deeper problem here, hence calling in the people who know 
things :-)

autovacuum_vacuum_threshold = 5000  # min number of row updates before 
vacuum
autovacuum_analyze_threshold = 5000 # min number of row updates before
autovacuum_vacuum_scale_factor = 0.1# fraction of table size before vacuum
autovacuum_analyze_scale_factor = 0.05  # fraction of table size before analyze

(Auto)Vacuum was run (no analyze) ~4.5 hours ago (if this query is correct), 
but maybe another should of been triggered?

SELECT
  *,
  n_dead_tup > av_threshold AS "av_needed",
  CASE WHEN reltuples > 0
THEN round(100.0 * n_dead_tup / (reltuples))
  ELSE 0 END AS pct_dead
FROM (SELECT
N.nspname,
C.relname,
pg_stat_get_tuples_inserted(C.oid) AS n_tup_ins,
pg_stat_get_tuples_updated(C.oid) AS n_tup_upd,
pg_stat_get_tuples_deleted(C.oid) AS n_tup_del,
pg_stat_get_live_tuples(C.oid) AS n_live_tup,
pg_stat_get_dead_tuples(C.oid) AS n_dead_tup,
C.reltuples AS reltuples,
round(current_setting('autovacuum_vacuum_threshold') :: INTEGER + 
current_setting('autovacuum_vacuum_scale_factor') :: NUMERIC * C.reltuples) AS 
av_threshold,
date_trunc('minute', greatest(pg_stat_get_last_vacuum_time(C.oid), 
pg_stat_get_last_autovacuum_time(C.oid))) AS last_vacuum,
date_trunc('minute', greatest(pg_stat_get_last_analyze_time(C.oid), 
pg_stat_get_last_analyze_time(C.oid))) AS last_analyze
  FROM pg_class C LEFT JOIN pg_namespace N ON (N.oid = C.relnamespace)
  WHERE C.relkind IN ('r', 't') AND N.nspname NOT IN ('pg_catalog', 
'information_schema') AND N.nspname !~ '^pg_toast') AS av
WHERE reltuples > 0
ORDER BY av_needed DESC, n_dead_tup DESC;

nspname relname n_tup_ins   n_tup_upd   n_tup_del   n_live_tup  
n_dead_tup  reltuples   av_thresholdlast_vacuum last_analyze
av_needed   pct_dead
ds3 job_entry   303658  815 303658  0   300022  300022  35002   
2018-06-07 23:09:00.00  NULLtrue100

> But, I may be missing something, my experience and skill here is somewhat 
> limited.

Ditto, at this level.

Best,
Robert

Re: Query hitting empty tables taking 48 minutes

2018-06-08 Thread Robert Creager



On Jun 7, 2018, at 4:58 PM, David G. Johnston 
mailto:david.g.johns...@gmail.com>> wrote:

I would suspect that vacuuming these tables would solve your problem.  Whether 
there is an issue beyond a lack of vacuuming, or related to auto-vacuum, I am 
unsure.  Though at this point it may take a vacuum full to recover back to a 
sane state.  Though ANALYZE by itself should clear up the statistical 
discrepancy.

A nightly VACUUM FULL which ran based on heuristics resolved the problem.  This 
would seem to point to a db problem more than an app problem?  I’m unsure how 
the app could have an affect of this magnitude on the database, although I’d 
love to be told otherwise.

Best,
Robert

Re: Query hitting empty tables taking 48 minutes

2018-06-08 Thread Robert Creager

> On Jun 8, 2018, at 10:23 AM, David G. Johnston  
> wrote:
> 
> Not sure what the right answer is but its seems your database (those tables 
> at least) are mis-configured for the workload being executed against them.  
> Significantly increasing the aggressiveness of the auto-vacuum process and/or 
> inserting manual vacuum analyze commands into your application at appropriate 
> times are probably necessary.
> 

I’m fine with changing up table parameters, which is the option that would make 
sense for us (thanks for pointing that out).  I have the auto vacuum threshold 
high because of other huge tables, and was not aware of the per table settings. 
 I’ll use this excuse one time, I inherited this setup, now I own it :-)

I’m concerned about a query that’s going against two tables that have had 300k 
entries in them (ie now empty and 2 entries) taking so long.  Even if those 
tables where full, the query should of taken no time at all.  The machine has 
64GB memory, 12 physical cores (+12 hyper threads) and the storage is on a ZFS 
pool with 5 mirrored vdevs of 7.2k SAS drives.  The entire db size is 2.63GB, 
easily fitting into memory.  This is a production appliance, and is build to 
handle the load.  Obviously needs some intelligent tuning though.

nspname relname n_tup_ins   n_tup_upd   n_tup_del   n_live_tup  
n_dead_tup  reltuples   av_thresholdlast_vacuum last_analyze
av_needed   pct_dead
ds3 blob303498  2559303496  2   0   2   5000
2018-06-08 04:35:00.00  NULLfalse   0
ds3 job_entry   303659  815 303659  0   0   0   5000
2018-06-08 04:35:00.00  NULLfalse   0

Best,
Robert

Re: Query hitting empty tables taking 48 minutes

2018-06-18 Thread Robert Creager



> On Jun 7, 2018, at 4:18 PM, Robert  wrote:
> 
>> You can capture the execution plan of the bad statement by using 
>> auto_explain,
>> that would certainly shed more light on the problem.
> 

A different query started showing up as the problem, the auto_explain with 
analyze shows an oddity, the total query duration is 11k seconds, while the 
query itself is 3 seconds.  I captured a ZFS snapshot as soon as the problem 
was noticed.

Jun 16 23:15:30 blackpearl postgres[9860]: [79-1] 
db=tapesystem,user=Administrator,app=[unknown],client=127.0.0.1 LOG:  duration: 
10856644.336 ms  plan:   
Jun 16 23:15:30 blackpearl postgres[9860]: [79-2]   Query Text: DELETE FROM 
ds3.blob WHERE EXISTS (SELECT * FROM ds3.s3_object WHERE id = 
ds3.blob.object_id AND (bucket_id = $1)) 
Jun 16 23:15:30 blackpearl postgres[9860]: [79-3]   Delete on blob  
(cost=1308.79..1312.82 rows=1 width=12) (actual time=3465.919..3465.919 rows=0 
loops=1)
Jun 16 23:15:30 blackpearl postgres[9860]: [79-4] ->  Nested Loop  
(cost=1308.79..1312.82 rows=1 width=12) (actual time=50.293..2435.271 
rows=30 loops=1)  
Jun 16 23:15:30 blackpearl postgres[9860]: [79-5]   ->  Bitmap Heap 
Scan on s3_object  (cost=634.39..636.41 rows=1 width=22) (actual 
time=50.269..153.885 rows=30 loops=1) 
Jun 16 23:15:30 blackpearl postgres[9860]: [79-6] Recheck 
Cond: (bucket_id = 'bc6e6b10-80ad-4329-9fb9-1a66d8c1505e'::uuid)
 
Jun 16 23:15:30 blackpearl postgres[9860]: [79-7] Heap 
Blocks: exact=3704
Jun 16 23:15:30 blackpearl postgres[9860]: [79-8] ->  
Bitmap Index Scan on ds3_s3_object__bucket_id  (cost=0.00..634.39 rows=1 
width=0) (actual time=49.552..49.552 rows=30 loops=1)
Jun 16 23:15:30 blackpearl postgres[9860]: [79-9]   
Index Cond: (bucket_id = 'bc6e6b10-80ad-4329-9fb9-1a66d8c1505e'::uuid)  
   
Jun 16 23:15:30 blackpearl postgres[9860]: [79-10]  ->  Bitmap Heap 
Scan on blob  (cost=674.39..676.41 rows=1 width=22) (actual time=0.005..0.006 
rows=1 loops=30)
Jun 16 23:15:30 blackpearl postgres[9860]: [79-11]Recheck 
Cond: (object_id = s3_object.id)
Jun 16 23:15:30 blackpearl postgres[9860]: [79-12]Heap 
Blocks: exact=30
Jun 16 23:15:30 blackpearl postgres[9860]: [79-13]->  
Bitmap Index Scan on ds3_blob__object_id  (cost=0.00..674.39 rows=1 width=0) 
(actual time=0.004..0.004 rows=1 loops=30)
Jun 16 23:15:30 blackpearl postgres[9860]: [79-14]  
Index Cond: (object_id = s3_object.id)


Doing a ZFS rollback and executing the query shows what is happening, although 
not to the extent above.  If I read this correctly, it’s the constraint checks 
that are causing the query to take so long.  I don’t see any server 
configuration that might allow those checks to be parallelized.  Is that 
possible?

tapesystem=# PREPARE foo(uuid) AS DELETE FROM ds3.blob WHERE EXISTS (SELECT * 
FROM ds3.s3_object WHERE id = ds3.blob.object_id AND (bucket_id = $1)); EXPLAIN 
ANALYZE EXECUTE foo('bc6e6b10-80ad-4329-9fb9-1a66d8c1505e');
PREPARE
   QUERY PLAN   

   

   
 Delete on blob  (cost=9555.07..21134.01 rows=23 width=12) (actual 
time=1516.140..1516.140 rows=0 loops=1)
   ->  Hash Join  (cost=9555.07..21134.01 rows=23 width=12) (actual 
time=237.816..621.306 rows=30 loops=1)
 Hash Cond: (s3_object.id = blob.object_id)
 ->  Seq Scan on s3_object  (cost=0.00..7454.04 rows=23 width=22) 
(actual time=0.027..148.503 rows=30 loops=1)
 
   Filter: (bucket_id = 
'bc6e6b10-80ad-4329-9fb9-1a66d8c1505e'::uuid)
   Rows Removed by Filter: 3
 ->  Hash  (cost=5805.03..5805.03 rows=33 width=22) (actual 
time=235.219..235.219 rows=33 loops=1)
   Buckets: 524288  Batches: 1  Memory Usage: 19917kB
   ->  Seq Scan on blob  (cost=0.00..5805.03 rows=33 width=22) 
(actual time=0.038..114.107 rows=33 loops=1)

 Trigger for constraint blob_tape_blob_id_fkey: time=5389.627 calls=30
 Trigger for constraint multi_part_upload_placeholder_blob_id_fkey: 
time=4566.305 calls=30
 Trigger for constraint mul

Re: Query hitting empty tables taking 48 minutes

2018-06-18 Thread Robert Creager



> On Jun 18, 2018, at 4:04 PM, Laurenz Albe  wrote:
> 
> Robert Creager wrote:
>> A different query started showing up as the problem, the auto_explain with 
>> analyze shows an oddity,
>> the total query duration is 11k seconds, while the query itself is 3 
>> seconds.  I captured a ZFS
>> snapshot as soon as the problem was noticed.
>> 
>>   db=tapesystem,user=Administrator,app=[unknown],client=127.0.0.1 LOG:  
>> duration: 10856644.336 ms  plan: 
>>   
>> Query Text: DELETE FROM ds3.blob WHERE EXISTS (SELECT * FROM 
>> ds3.s3_object WHERE id = ds3.blob.object_id AND (bucket_id = $1))
>>  
>> Delete on blob  (cost=1308.79..1312.82 rows=1 width=12) (actual 
>> time=3465.919..3465.919 rows=0 loops=1)  
>>   
>> [...]
>> 
>> Doing a ZFS rollback and executing the query shows what is happening, 
>> although not to the extent above.
>> If I read this correctly, it’s the constraint checks that are causing the 
>> query to take so long.
>> I don’t see any server configuration that might allow those checks to be 
>> parallelized.  Is that possible?
>> 
>> tapesystem=# PREPARE foo(uuid) AS DELETE FROM ds3.blob WHERE EXISTS (SELECT 
>> * FROM ds3.s3_object WHERE id = ds3.blob.object_id AND (bucket_id = $1)); 
>> EXPLAIN ANALYZE EXECUTE foo('bc6e6b10-80ad-4329-9fb9-1a66d8c1505e');
>> PREPARE
> 
> Are we missing an "EXECUTE foo('bc6e6b10-80ad-4329-9fb9-1a66d8c1505e')" here?

Nope, just hidden on the same line, this is the plan for that EXECUTE

> 
>>   QUERY PLAN 
>>  
>> 
>> 
>>
>> Delete on blob  (cost=9555.07..21134.01 rows=23 width=12) (actual 
>> time=1516.140..1516.140 rows=0 loops=1)
>> [...]
>> Trigger for constraint blob_tape_blob_id_fkey: time=5389.627 calls=30
>> [...]
>> Execution time: 85175.581 ms
>> 
>> I executed a vacuum analyze, then a vacuum full analyze, neither changed the 
>> timing.
>> Other than removing constraints, is there any way to address this?
> 
> I cannot explain the discrepancy between the runtimes of 85 seconds versus 
> 10857 seconds.

It would be nice if the auto_explain analyze did include the other information 
like the psql analyze does.

> But other than that, it sure looks like the foreign keys are missing an index 
> on
> the source columns, leading to a sequential scan for each deletion and table.

You’d think, but they are present.  As an example, the first constraint 
blob_tape_blob_id_fkey is indexed.

create table blob
(
[…]
  iduuid   not null
constraint blob_pkey
primary key,
[…]
);

create table blob_tape
(
  blob_id uuidnot null
constraint blob_tape_blob_id_fkey
references ds3.blob
on update cascade on delete cascade,
  id  uuidnot null
constraint blob_tape_pkey
primary key,
[…]
);

create index tape_blob_tape__blob_id
  on blob_tape (blob_id);

> 
> Yours,
> Laurenz Albe
> -- 
> Cybertec | https://www.cybertec-postgresql.com

Re: Query hitting empty tables taking 48 minutes

2018-06-18 Thread Robert Creager

> On Jun 18, 2018, at 4:33 PM, Robert Creager  wrote:
> 
>> I cannot explain the discrepancy between the runtimes of 85 seconds versus 
>> 10857 seconds.
> 
> It would be nice if the auto_explain analyze did include the other 
> information like the psql analyze does.

Like this.  I’ll try again.

auto_explain.log_triggers (boolean)
auto_explain.log_triggers causes trigger execution statistics to be included 
when an execution plan is logged. This parameter has no effect unless 
auto_explain.log_analyze is enabled. This parameter is off by default. Only 
superusers can change this setting.

Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

Re: Query hitting empty tables taking 48 minutes

11 matches

Site Navigation

Mail list logo

Footer information