Extremely slow when query uses GIST exclusion index

2018-08-28 Thread David
Hi. My databases make heavy use of timestamp ranges, and they rely on GIST
exclusion constraints to ensure that the ranges are disjoint. I've noticed
that queries that hit the GIST indexes are EXTREMELY slow, and the queries
run much faster if I make trivial changes to avoid the GIST indexes.

Here's the setup for a test case. (Timings were collected on PostgreSQL
9.5.4 on x86_64-pc-linux-gnu, Intel Xeon E5 in a VM with 3.5 GB RAM):

CREATE TABLE app (
  pk SERIAL PRIMARY KEY,
  group_id TEXT NOT NULL,
  app_time TIMESTAMPTZ NOT NULL
);

CREATE TABLE group_span (
  pk SERIAL PRIMARY KEY,
  group_id TEXT NOT NULL,
  valid_period TSTZRANGE NOT NULL,
  EXCLUDE USING GIST (group_id WITH =, valid_period WITH &&)
);

CREATE TABLE member_span (
  pk SERIAL PRIMARY KEY,
  member_id TEXT NOT NULL,
  group_id TEXT NOT NULL,
  valid_period TSTZRANGE NOT NULL,
  EXCLUDE USING GIST
(member_id WITH =, group_id WITH =, valid_period WITH &&)
);

-- Fill tables with some random data

INSERT INTO app (group_id, app_time)
SELECT
  MD5(CONCAT(GENERATE_SERIES(1, 1), RANDOM())),
  DATE_TRUNC('month', TIMESTAMPTZ '2000-01-01' +
 INTERVAL '3 years' * RANDOM());

-- Give groups a 1-year span, and give some groups a 2nd-year span:
INSERT INTO group_span (group_id, valid_period)
(SELECT
   group_id,
   TSTZRANGE(app_time, app_time + INTERVAL '1 year')
 FROM app)
UNION ALL
(SELECT
   group_id,
   TSTZRANGE(app_time + INTERVAL '1 year',
 app_time + INTERVAL '2 year')
 FROM app LIMIT 2000);

-- Create members with a random span within their group_span:
INSERT INTO member_span (member_id, group_id, valid_period)
SELECT
  MD5(RANDOM()::TEXT),
  group_id,
  TSTZRANGE(
LOWER(valid_period),
UPPER(valid_period) - DATE_TRUNC(
  'days',
  (UPPER(valid_period) - LOWER(valid_period)) * RANDOM()
)
  )
FROM group_span;


Given this setup, here's a query that hits the GIST exclusion index on the
"member_span" table. It takes 38 sec on my machine:

SELECT *
FROM app
JOIN group_span ON
  app.group_id = group_span.group_id AND
  app.app_time <@ group_span.valid_period
JOIN member_span ON
  group_span.group_id = member_span.group_id AND
  group_span.valid_period && member_span.valid_period;

Here's the query plan for that query:

Nested Loop  (cost=319.27..776.39 rows=1 width=196) (actual
time=15.370..38406.466 rows=1 loops=1)
  Join Filter: (app.group_id = member_span.group_id)
  ->  Hash Join  (cost=319.00..771.00 rows=12 width=104) (actual
time=5.790..130.613 rows=1 loops=1)
Hash Cond: (group_span.group_id = app.group_id)
Join Filter: (app.app_time <@ group_span.valid_period)
Rows Removed by Join Filter: 2000
->  Seq Scan on group_span  (cost=0.00..257.00 rows=12000 width=59)
(actual time=0.005..16.282 rows=12000 loops=1)
->  Hash  (cost=194.00..194.00 rows=1 width=45) (actual
time=5.758..5.758 rows=1 loops=1)
  Buckets: 16384  Batches: 1  Memory Usage: 910kB
  ->  Seq Scan on app  (cost=0.00..194.00 rows=1 width=45)
(actual time=0.002..2.426 rows=1 loops=1)
  ->  Index Scan using member_span_member_id_group_id_valid_period_excl on
member_span  (cost=0.28..0.44 rows=1 width=92) (actual time=1.988..3.817
rows=1 loops=1)
Index Cond: ((group_id = group_span.group_id) AND
(group_span.valid_period && valid_period))
Planning time: 0.784 ms
Execution time: 38410.227 ms

We can make a small tweak to the query to make it complicated enough that
the execution planner avoids the GIST index. In this particular case, we
can replace "app.app_time <@ group_span.valid_period" with the
equivalent "app.app_time
>= LOWER(group_span.valid_period) AND app.app_time <
UPPER(group_span.valid_period)". This equivalent query is MUCH faster:

SELECT *
FROM app
JOIN group_span ON
  app.group_id = group_span.group_id AND
  app.app_time >= LOWER(group_span.valid_period) AND
  app.app_time < UPPER(group_span.valid_period)
JOIN member_span ON
  group_span.group_id = member_span.group_id AND
  group_span.valid_period && member_span.valid_period;

It only takes 86 ms, even though it's doing 3 seq scans instead of using
the index:

Hash Join  (cost=953.71..1186.65 rows=8 width=196) (actual
time=58.364..84.706 rows=1 loops=1)
  Hash Cond: (app.group_id = group_span.group_id)
  Join Filter: ((app.app_time >= lower(group_span.valid_period)) AND
(app.app_time < upper(group_span.valid_period)))
  Rows Removed by Join Filter: 2000
  ->  Seq Scan on app  (cost=0.00..194.00 rows=1 width=45) (actual
time=0.007..2.391 rows=1 loops=1)
  ->  Hash  (cost=952.81..952.81 rows=72 width=151) (actual
time=58.343..58.343 rows=12000 loops=1)
Buckets: 16384 (originally 1024)  Batches: 1 (originally 1)  Memory
Usage: 2285kB
->  Hash Join  (cost=407.00..952.81 rows=72 width=151) (actual
time=15.048..44.103 rows=12000 loops=1)
  Hash Cond: (member_span.group_id = group_span.group_id)
  Join Filter: (group_s

Re: Extremely slow when query uses GIST exclusion index

2018-08-29 Thread David
Thanks for your help investigating this! Follow-up below:

On Wed, Aug 29, 2018 at 7:25 AM, Andreas Kretschmer  wrote:
>
> Okay, other solution. The problem is the nested loop, we can disable that:
>>
> test=*# set enable_nestloop to false;


Is it OK to keep this off permanently in production? I thought these
settings were just for debugging, and once we've identified the culprit,
we're supposed to take other steps (?) to avoid the suboptimal execution
plan.

your GiST-Index contains (member_id,group_id,valid_period), but your query
> is only on the latter 2 fields.


Yeah, I didn't really want GIST index in the first place -- PostgreSQL
created it automatically as a side effect of the exclusion constraint that
I need.

Your suggestion to create *another* GIST index is an interesting
workaround. But we've seen that the query runs even faster if we didn't
have the GIST index(es) at all. So is there any way to tell the planner to
avoid the GIST index altogether?

(Alternatively, could there be a bug that's causing PostgreSQL to
underestimate the cost of using the GIST index?)


>  Nested Loop  (cost=319.27..776.18 rows=1 width=196) (actual
> time=3.156..334.963 rows=1 loops=1)
>Join Filter: (app.group_id = member_span.group_id)
>->  Hash Join  (cost=319.00..771.00 rows=12 width=104) (actual
> time=3.100..14.040 rows=1 loops=1)


Hm, also, it looks like one of the oddities of this query is that
PostgreSQL is severely underestimating the cardinality of the join. It
seems to think that the join will result in only 1 row, when the join
actually produces 10,000 rows. Maybe that's why the planner thinks that
using the GIST index is cheap? (I.e., the planner thought that it would
only need to do 1 GIST index lookup, which is cheaper than a sequential
scan; but in reality it has to do 10,000 GIST index lookups, which is much
more expensive than a scan.) Is there any way to help the planner better
estimate how big the join output going to be?


Re: Batch insert heavily affecting query performance.

2017-12-27 Thread David Miller
Jean,
It is very likely you are running out of IOPS with that size of server. We have 
several Postgres databases running at AWS. We consistently run out of IOPS on 
our development servers due to the types queries and sizing of our development 
databases. I would check the AWS monitoring graphs to determine the cause. We 
typically see low CPU and high IOPS just prior to our degraded performance. Our 
production environment runs provisioned IOPS to avoid this very issue.
Regards, David 

  From: Jean Baro 
 To: Jeremy Finzel  
Cc: Danylo Hlynskyi ; [email protected]
 Sent: Wednesday, December 27, 2017 11:03 AM
 Subject: Re: Batch insert heavily affecting query performance.
   
Sorry guys,
The performance problem is not caused by PG. 
'Index Scan using idx_user_country on public.old_card  (cost=0.57..1854.66 
rows=460 width=922) (actual time=3.442..76.606 rows=200 loops=1)''  Output: id, 
user_id, user_country, user_channel, user_role, created_by_system_key, 
created_by_username, created_at, last_modified_at, date_start, date_end, 
payload, tags, menu, deleted, campaign, correlation_id''  Index Cond: 
(((old_card.user_id)::text = '1234'::text) AND (old_card.user_country = 
'BR'::bpchar))''  Buffers: shared hit=11 read=138 written=35''Planning time: 
7.748 ms''Execution time: 76.755 ms'
77ms on an 8GB database with 167MM rows and almost 500GB in size is amazing!!
Now we are investigating other bottlenecks, is it the creation of a new 
connection to PG  (no connection poller at the moment, like PGBouncer), is it 
the Lambda start up time? Is it the network performance  between PG and Lambda?
I am sorry for wasting your time guys, it helped us to find the problem though, 
even if it wasn't a PG problem. 
BTW, what a performance! I am impressed. 
Thanks PG community! 
Em 27 de dez de 2017 14:34, "Jean Baro"  escreveu:

Thanks Jeremy,
We will provide a more complete EXPLAIN as other people have suggested. 
I am glad we might end up with a much better performance (currently each query 
takes around 2 seconds!).
Cheers

Em 27 de dez de 2017 14:02, "Jeremy Finzel"  escreveu:




The EXPLAIN
'Index Scan using idx_user_country on card  (cost=0.57..1854.66 rows=460 
width=922)''  Index Cond: (((user_id)::text = '4684'::text) AND (user_country = 
'BR'::bpchar))'

Show 3 runs of the full explain analyze plan on given condition so that we can 
also see cold vs warm cache performance.
There is definitely something wrong as there is no way a query like that should 
take 500ms.  Your instinct is correct there.




   

Re: pgaudit and create postgis extension logs a lot inserts

2018-01-19 Thread David Steele
On 1/19/18 6:05 AM, Magnus Hagander wrote:
> 
> 
> On Thu, Jan 18, 2018 at 6:54 PM, Joe Conway  <mailto:[email protected]>> wrote:
> 
> On 01/18/2018 04:12 AM, Svensson Peter wrote:
> > When I then install  postgis extension in a database it writes a huge
> > amount of logs which slow down the server a lot.
> > Not only table creation and functions are logged,  even  all inserts in 
> > spatial_ref_sys are written to the audit-log.
> >
> > LOG:  AUDIT: SESSION,1,1,DDL,CREATE FUNCTION,,,"
> > ..
> > INSERT INTO ""spatial_ref_sys"" (""srid"",""auth_name"
> > 
> >
> > This behaviour make pgaudit useless in our environment due to the
> > overhead in log-file write.
> 
> How often do you intend to install PostGIS? Disable pgaudit, install
> PostGIS, enable pgaudit?
> 
> 
> Would it make sense for pgaudit to, at least by option, not include DDL
> statements that are generated as "sub-parts" of a CREATE EXTENSION? It
> should still log the CREATE EXTENSION of course, but not necessarily all
> the contents of it, since that's actually defined in the extension
> itself already? 
That's doable, but I think it could be abused if it was always on and
installing extensions is generally not a daily activity.

It seems in this case the best action is to disable pgaudit before
installing postgis or install postgis first.

Regards,
-- 
-David
[email protected]



Re: pgaudit and create postgis extension logs a lot inserts

2018-01-19 Thread David Steele
Hi Peter,

On 1/18/18 7:12 AM, Svensson Peter wrote:
> 
> Also noticed that setting a session log to none (set
> pgaudit.log='none';)  overrides parameter from postgresql.conf,  but
> does not get logged, and then you can do whatever you want without any
> audit.
> I supposed this changing of  audit session log parameter should be
> logged to file?

pgaudit is not intended to audit the superuser and only a superuser can
set pgaudit.log.

However, you can limit superuser access with the setuser extension:
https://github.com/pgaudit/set_user

Regards,
-- 
-David
[email protected]



Slow performance after restoring a dump

2018-03-19 Thread David Osborne
=685 width=4) (actual time=0.014..0.014 rows=0
loops=1)
 Index Cond: (product_id = 2466420)
 Filter: (credit_stock_account_id = 3)
 Buffers: shared hit=3
   ->  Index Scan using account_trans_pkey on account_trans a
(cost=0.00..9.49 rows=1 width=8) (never executed)
 Index Cond: (account_trans_id = s.account_trans_id)
 Total runtime: 0.050 ms
(15 rows)


Regards,
-- 
David


Re: Slow performance after restoring a dump

2018-03-19 Thread David Osborne
Hi, yes I've run "analyse" against the newly restored database. Should that
be enough?

On 19 March 2018 at 15:35, Tom Lane  wrote:

> David Osborne  writes:
>
> The first question people will ask is did you re-ANALYZE the new
> database?  pg_dump doesn't take care of that for you, and auto-analyze
> might not think it needs to process the smaller tables.
>
>


Re: Slow performance after restoring a dump

2018-03-19 Thread David Osborne
That did the trick... thanks!
yes perhaps a minor planner difference just tipped us over the edge
previously

=> alter table stock_trans alter column product_id set STATISTICS 1000;


QUERY PLAN
--
 Limit  (cost=3461.10..3461.10 rows=1 width=4) (actual time=0.014..0.014
rows=0 loops=1)
   Buffers: shared hit=3
   ->  Sort  (cost=3461.10..3461.75 rows=260 width=4) (actual
time=0.013..0.013 rows=0 loops=1)
 Sort Key: a.trans_date DESC
 Sort Method: quicksort  Memory: 25kB
 Buffers: shared hit=3
 ->  Nested Loop  (cost=0.87..3459.80 rows=260 width=4) (actual
time=0.008..0.008 rows=0 loops=1)
   Buffers: shared hit=3
   ->  Index Scan using stock_trans_product_idx on stock_trans
s  (cost=0.43..1263.55 rows=260 width=4) (actual time=0.007..0.007 rows=0
loops=1)
 Index Cond: (product_id = 2466420)
 Filter: (credit_stock_account_id = 3)
 Buffers: shared hit=3
   ->  Index Scan using account_trans_pkey on account_trans a
(cost=0.43..8.44 rows=1 width=8) (never executed)
 Index Cond: (account_trans_id = s.account_trans_id)
 Planning time: 0.255 ms
 Execution time: 0.039 ms
(16 rows)




On 19 March 2018 at 16:22, Tom Lane  wrote:

> David Osborne  writes:
> > Hi, yes I've run "analyse" against the newly restored database. Should
> that
> > be enough?
>
> My apologies, you did say that further down in the original message.
> It looks like the core of the problem is the poor rowcount estimation
> here:
>
>  ->  Bitmap Index Scan on stock_trans_product_idx
> (cost=0.00..31.42 rows=1465 width=0) (actual time=0.009..0.009 rows=0
> loops=1)
>Index Cond: (product_id = 2466420)
>Buffers: shared hit=3
>
> You might be able to improve that by raising the statistics target
> for stock_trans.product_id.  I'm not sure why you weren't getting
> bitten by the same issue in 9.1; but the cost estimates aren't
> that far apart for the two plans, so maybe you were just lucky ...
>
> regards, tom lane
>



-- 
David Osborne
Qcode Software Limited
http://www.qcode.co.uk
T: +44 (0)1463 896484


Re: Slow planning time for custom function

2018-03-23 Thread David Rowley
On 24 March 2018 at 14:35, Andres Freund  wrote:
> How long does planning take if you repeat this? I wonder if a good chunk
> of those 1.8s is initial loading of plv8.

Maybe, but it also could be the execution of the function, after all,
the planner does invoke immutable functions:

# explain verbose select lower('TEST');
QUERY PLAN
---
 Result  (cost=0.00..0.01 rows=1 width=32)
   Output: 'test'::text
(2 rows)

Would be interesting to see what changes without the IMMUTABLE flag.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: functions: VOLATILE performs better than STABLE

2018-03-25 Thread David Rowley
On 25 March 2018 at 18:00, Laurenz Albe  wrote:
> Peter wrote:
>> Over all, VOLATILE performs better than STABLE.
>
> The reason is that the subquery with the VOLATILE function can be
> flattened; see the EXPLAIN (VERBOSE) output.
>
> There is not guarantee that less volatility means better performance.

Although, it would be nice.

TPC-H Q1 does appear to be crafted to allow database with smarter
expression evaluation to get a better score.

It would probably require some sort of recursive expression evaluation
where at each level we check if that expression has already been seen,
if it has, then replace it with some sort of placeholder, then
evaluate each placeholder in the required order.

Probably the first part could be done during planning.  It would mean
targetlists would need to carry a bit more weight.

It would be an interesting project to work on, but not planning to personally.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: help in analysis of execution plans

2018-05-05 Thread David Rowley
On 6 May 2018 at 03:16, Neto pr  wrote:
> Execution Plan 1 (query execution time 4.7 minutes):
> - https://explain.depesz.com/s/Ughh
> - Postgresql version 10.1 (default) with index on l_shipdate (table lineitem)
>
> Execution Plan 2 (query execution time 2.95 minutes):
> - https://explain.depesz.com/s/7Zb7
> - Postgresql version 9.5 (version with source code changed by me) with
> index on l_orderkey (table lineitem).
>
> Some doubts
> - Difference between GroupAggregate and Finalize GroupAggregate

A Finalize Aggregate node is required to combine the Partially
Aggregated records.  A Partially Aggregated result differs from normal
aggregation as the final function of the aggregate is not called
during partial aggregation.  This allows the finalize aggregate to
combine the partially aggregated results then call the final function.
Imagine an aggregate like AVG() where it goes and internally
calculates the sum and the count of non-null records.  A partial
aggregate node would return {sum, count}, but a normal aggregate would
return {sum / count}.  Having {sum, count} allows each partial
aggregated result to be combined allowing the average to be calculated
with the total_sum / total_count.

> - because some algorithms show measurements on "Disk" and others on
> "Memory" example:
>  - External sort Disk: 52784kB
>  - quicksort Memory: 47770kB

Please read about work_mem in
https://www.postgresql.org/docs/current/static/runtime-config-resource.html

The reason 10.1 is slower with the parallel query is down to the
bitmap heap scan going lossy and scanning many more heap pages than it
expected.  You could solve this in various ways:

1. Increase work_mem enough to prevent the scan from going lossy (see
lossy=1605531 in your plan)
2. turn off enable_bitmapscans (set enable_Bitmapscan = off);
3. Cluster the table on l_shipdate

Unfortunately, parallel query often will choose to use a parallel plan
utilising multiple workers on a less efficient plan when it estimates
the cost / n_workers is lower than the cheapest serial plan's cost.
This appears to also be a contributor to your problem. You may get the
9.5 performance if you disabled parallel query, or did one of the 3
options above.

You may also want to consider using a BRIN index on the l_shipdate
instead of a BTREE index. The cost estimation for BRIN may realise
that the bitmap heap scan is not a good option, although I'm not sure
it'll be better than what the current v10 plan is using.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Performance differential when 0 values present vs when 1 values present. Planner return 52k rows when 0 expected.

2022-04-27 Thread David Rowley
On Wed, 27 Apr 2022 at 19:54, Emil Iggland  wrote:
>
>  > You've got the wrong column order (for this query anyway) in that
>  > index.  It'd work a lot better if dataview were the first column;

> I might be misunderstanding you, but I assume that you are suggesting an
> index on (dataview, valuetimestamp).
> We have that index, it is the primary key. For some reason it isn't
> being selected.

I don't think that index can be used for your original query. It could
only be used if "channel" is unique in "valueseries" and you'd written
the query as:

select * from datavalue
where dataview = (select id from valueseries where channel = 752433)
ORDER BY VALUETIMESTAMP DESC
FETCH FIRST ROW only;

that would allow a backwards index scan using the (dataview,
valuetimestamp) index.  Because you're using the IN clause to possibly
look for multiple "dataview" values matching the given "channel", the
index range scan does not have a single point to start at.  What
you've done with the LATERAL query allows the index to be scanned once
for each "valueseries" row with a "channel" value matching your WHERE
clause.

I guess "channel" must not be the primary key to "valueseries" and
that's why you use an IN().

The above query would return an error if multiple rows were returned
by the subquery.

David




Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread David Rowley
On Tue, 3 May 2022 at 23:05, Benjamin Coutu  wrote:
>   ->  Memoize  (cost=0.58..0.67 rows=1 width=16) (actual time=0.002..0.002 
> rows=0 loops=4067215)
> Cache Key: e2.field, e2.index
> Cache Mode: logical
> Hits: 0  Misses: 4067215  Evictions: 3228355  Overflows: 0  Memory 
> Usage: 65537kB
> Buffers: shared hit=16268863
> ->  Index Only Scan using pk_extdataregular on extdataregular e1  
> (cost=0.57..0.66 rows=1 width=16) (actual time=0.001..0.001 rows=0 
> loops=4067215)
>   Index Cond: ((field = e2.field) AND (index = e2.index))

> Why would it ever make sense to generate a memoize plan for a unique join?

It wouldn't ever make sense.

The problem is that estimate_num_groups() is used to estimate the
number of distinct values and that function does not know about
primary keys. There's no way the costing of Memoize would allow a
Memoize plan to be used if it thought all values were unique, so the
only possibility here is that ndistinct is being underestimated by
some amount that makes Memoize look like the most favourable plan.

You could see what the planner thinks about the ndistinct estimate on
field, index by doing:

EXPLAIN SELECT field,index FROM extdataregular GROUP BY 1,2;

Whatever you see in the final row estimate for that plan is what's
being fed into the Memoize costing code.

> I think this issue might tie in with the current discussion over on the 
> hackers mailing list [1]

I'd say it's a pretty different problem. The cache hit ratio
discussion on that thread talks about underestimating the hit ratio.
That particular problem could only lead to Memoize plans *not* being
chosen when they maybe should be. Not the other way around, which is
your case.

create statistics extdataregular_field_index_stats (ndistinct) on
field, index from extdataregular;
analyze extdataregular;

would likely put that right.

David




Re: Useless memoize path generated for unique join on primary keys

2022-05-03 Thread David Rowley
On Wed, 4 May 2022 at 00:21, Benjamin Coutu  wrote:
> Thanks David, using extended statistics for both (and only for both) tables 
> solved this problem.

Oh, whoops. I did get that backwards.  The estimate used by the
Memoize costing code is from the outer side of the join, which is the
extdataempty in this case. I don't think the
extdataregular_field_index_stats will do anything. It'll be the ones
you added on extdataempty that are making it work.

> BTW, thank you for all your work on performance in recent releases.

Thanks for the feedback :)

David




Re: Window partial fetch optimization

2022-05-03 Thread David Rowley
On Wed, 4 May 2022 at 06:11, Levi Aul  wrote:
> It is our expectation that this query “should” be able to be cheap-to-compute 
> and effectively instantaneous. (It’s clear to us how we would make it so, 
> given a simple LMDB-like sorted key-value store: prefix-match on 
> holder_address; take the first row you find for the contract-address you’re 
> on; build a comparator key of (holder_address, contract_address, 
> highest-possible-version) and traverse to find the lowest row that sorts 
> greater than it; repeat.)
>
> Which might, in SQL, be expressed as something like this:
>
> WITH ranked_holder_balances AS (
> SELECT
> *,
> row_number() OVER w AS balance_rank
> FROM contract_balance_updates
> WHERE holder_address = '\x'::bytea
> WINDOW w AS (
> PARTITION BY holder_address, contract_address
> ORDER BY start_block_height DESC
> )
> ORDER BY holder_address ASC, contract_address ASC, start_block_height DESC
> )
> SELECT *
> FROM ranked_holder_balances
> WHERE balance_rank = 1

Yes, PostgreSQL 14 is not very smart about realising that WHERE
balance_rank = 1 is only going to match the first row of each window
partition. PostgreSQL 15 (coming later this year) should be better in
this regard as some work was done to teach the query planner about
monotonic window functions [1].  However, that change likely does not
do all you'd like here as the WindowAgg node still must consume and
throw away all tuples until it finds the first tuple belonging to the
next window partition.  It sounds like you really want "Skip Scans" or
"Loose Index Scans" which are implemented by some other RDBMS'.  I
imagine that even with the change to PostgreSQL 15 that it still
wouldn't be as fast as your DISTINCT ON example.

> WITH bup1 AS (
> SELECT DISTINCT bup.holder_address, bup.contract_address
> FROM contract_balance_updates bup
> WHERE bup.holder_address = 
> '\xe03c23519e18d64f144d2800e30e81b0065c48b5'::bytea
> ORDER BY bup.contract_address ASC
> )
> SELECT
>   bup1.holder_address,
>   bup1.contract_address,
>   (
> SELECT balance
> FROM contract_balance_updates bup2
> WHERE bup2.holder_address = bup1.holder_address
> AND bup2.contract_address = bup1.contract_address
> ORDER BY bup2.holder_address ASC, bup2.contract_address ASC, 
> bup2.start_block_height DESC
> LIMIT 1
>   ) AS balance
> FROM bup1

> I really don’t like this last approach; it scans twice, it’s surprising / 
> confusing for people maintaining the query, etc. I believe that, due to the 
> correlated subquery, the planning overhead is also O(N) with the number of 
> matched entities increases (though I don’t have a good test-case for this.)

No, the subquery is not replanned each time it is rescanned. It's
planned once and that same plan will be executed each time. So no O(n)
planning overhead.

> Is there any way to get PG to do what this last query is doing, purely using 
> window-functions / distinct on / etc.? Because, if there is, I can’t find it.
>
> It seems that PG can in fact do index-range-seeking (since that’s what it’s 
> doing when gathering the distinct contract_addresses in the last query.) It 
> seems intuitive to me that it should be using such an approach to filter for 
> rows in window/group-partitions, when a criteria+index that can be combined 
> to limit the size of the window/group are available to the planner. And that, 
> even when not able to be automatically inferred, it would make sense for 
> there to be control over such behaviour in SQL, using hypothetical syntax 
> like:

Unfortunately, DISTINCT can only be implemented with Hash Aggregate or
Sort / Index Scan + Unique.  We don't have anything currently which
will jump to the next highest key in an index.  There has been some
work on what we're starting to call "Skip Scans", but it's all still
work in progress.

You might find something useful in [2] which might help speed up your
DISTINCT query.

> -- for windows
> row_number() OVER (PARTITION BY x ORDER BY x LIMIT 10 OFFSET 3)
>
> -- for groups
> GROUP BY x, y, z (APPLYING LIMIT 20 OFFSET 5 PER GROUP)
>
>
> Does this make sense? Or is this something PG is already doing, and I just 
> haven’t found the right magic words / built my index correctly to unlock it? 
> (I notice that the last example is an index-only scan; would I get this 
> behaviour from the previous two queries if I made the index a covering index 
> such that those could be index-only scans as well?)

Unfortunately, there is no magic words here. PostgreSQL 14 simply has
no ability to know that row_number() is monotonically increasing,
therefore has no ability to skip any processing for rows that are
never needed.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9d9c02ccd1aea8e9131d8f4edb21bf1687e40782
[2] https://wiki.postgresql.org/wiki/Loose_indexscan




Re: Why is there a Sort after an Index Only Scan?

2022-05-04 Thread David Rowley
On Thu, 5 May 2022 at 11:15, André Hänsel  wrote:
>
> Quick(?) question... why is there a Sort node after an Index Only Scan?
> Shouldn't the index already spit out sorted tuples?
>
> CREATE INDEX ON orders_test(shipping_date, order_id);
>
> EXPLAIN ANALYZE SELECT
> FROM orders_test
> WHERE TRUE
> AND shipping_date >= '2022-05-01'
> AND shipping_date <= '2022-05-01'
> ORDER BY order_id
> LIMIT 50;

Unfortunately, the query planner is not quite smart enough to realise
that your shipping_date clauses can only match a single value.
There's quite a bit more we could do with the planner's
EquivalanceClasses. There is a patch around to help improve things in
this area but it requires some more infrastructure to make it more
practical to do from a performance standpoint in the planner.

You'll get the plan you want if you requite the query and replace your
date range with shipping_date = '2022-05-01'.  Your use of WHERE TRUE
indicates to me that you might be building this query in an
application already, so maybe you can just tweak that application to
test if the start and end dates are the same and use equality when
they are.

David

[1] https://commitfest.postgresql.org/38/3524/




Re: Postgresql 13 partitioning advice

2022-08-02 Thread David Rowley
On Tue, 2 Aug 2022 at 19:48, Ameya Bidwalkar
 wrote:
> We have a Postgresql 13 database where we have a single table with several 
> millions of rows  . We plan to partition it based on timestamp .
> We have  been seeking  advice for best practices for building this.
> This table will get lots of updates for the same rows during a short period 
> of time.During this time rows would be in a single partition .
> After this short time these rows would move to  another partition  .Where no 
> more updates take place on these rows.But might  have some SELECT queries  
> running.
> We plan to l have  partitions based on months and then roll them up in a year 
>  and then archive these older partitions
> One consultant we talked with told us this  row movement between the 
> partitions will have
> huge complications .But  this was an issue during the Postgres 10 version .

Define "huge complications"?

The capabilities of partitioned tables have changed quite a bit since
the feature was added. It's very easy for knowledge to get out-dated
in this area.  I did quite a bit of work on them and I struggle to
remember off the top of my head which versions saw which improvements.
PG12 saw lots. See [1], search for "partition".

One possible complication is what is mentioned in [2] about
"serialization failure error".  UPDATEs that cause a tuple to move to
another partition can cause a serialization failure at transaction
isolation level, not just serializable transactions. If it's not
already, you might want to have your application retry transactions on
SQL:40001 errors.

Apart from that, assuming there's comparatively a small number of rows
in the partition being updated compared to the partition with the
static rows, then it sounds fairly efficient. As you describe it,  the
larger static partition is effectively INSERT only and auto-vacuum
will need to touch it only for tuple freezing work.  The smaller of
the two tables will receive more churn but will be faster to vacuum.
PG13 got a new feature that makes sure auto-vacuum also does the
rounds on INSERT-only tables too, so the static partition is not going
to be neglected until anti-wrap-around-autovacuums trigger, like they
would have in PG12 and earlier.

Another thing to consider is that an UPDATE of a non-partitioned table
has a chance at being a HOT update. That's possible if the tuple can
fit on the same page and does not update any of the indexed columns. A
HOT update means no indexes need to be updated so these perform faster
and require less space in WAL than a non-HOT update.  An UPDATE that
moves a tuple to another partition can never be a HOT update.  That's
something you might want to consider. If you're updating indexed
columns already then it's not a factor to consider.There's also
overhead to postgres having to find the partition for the newly
updated version of the tuple.  That's not hugely expensive, but it's
generally measurable. RANGE partitioned tables with a large number of
partitions will have the most overhead for this. HASH partitioned
tables, the least.

The best thing you can likely do is set up a scenario with pgbench and
compare the performance. pgbench is a pretty flexible tool that will
allow you to run certain queries X% of the time and even throttle the
workload at what you expect your production server to experience. You
could then run it overnight on a test server, or even for weeks and
see how auto-vacuum keeps up when compared to the non-partitioned
case. You can also check how much extra WAL is generated vs the
non-partitioned case.

> So we are seeking advice on the performance perspective and things we should 
> take care of along with manual vacuums on a regular schedule and indexing.
> Are there any tunables I should experiment with in particular ?

Perhaps if you want to keep a small high-chun table in check you might
want to consider if autovacuum_naptime is set low enough. You may not
care if the space being consumed in the standard 1min
autovacuum_naptime is small enough not to be of concern.

David

[1] https://www.postgresql.org/docs/release/12.0/
[2] https://www.postgresql.org/docs/13/sql-update.html




Re: Catching up with performance & PostgreSQL 15

2022-11-29 Thread David Rowley
On Wed, 30 Nov 2022 at 03:31, Tom Lane  wrote:
>
> Alvaro Herrera  writes:
> > IMO it was a mistake to turn JIT on in the default config, so that's one
> > thing you'll likely want to change.
>
> I wouldn't necessarily go quite that far, but I do think that the
> default cost thresholds for invoking it are enormously too low,
> or else there are serious bugs in the cost-estimation algorithms
> for deciding when to use it.  A nearby example[1] of a sub-1-sec
> partitioned query that took 30sec after JIT was enabled makes me
> wonder if we're accounting correctly for per-partition JIT costs.

I'm very grateful for JIT. However, I do agree that the costs need to work.

The problem is that the threshold to turn JIT on does not consider how
many expressions need to be compiled. It's quite different to JIT
compile a simple one-node plan with a total cost of 10 than to JIT
compile a plan that  costs the same but queries 1000 partitions. I
think we should be compiling expressions based on the cost of the
individial node rather than the total cost of the plan. We need to
make some changes so we can more easily determine the number of times
a given node will be executed before we can determine how worthwhile
JITting an expression in a node will be.

David

> [1] 
> https://www.postgresql.org/message-id/B6025887-D73F-4B5B-9925-4DA4B675F7E5%40elevated-dev.com




Re: wrong rows and cost estimation when generic plan

2022-12-05 Thread David Rowley
On Tue, 6 Dec 2022 at 18:28, James Pang (chaolpan)  wrote:
>->  Index Scan using idx_x_time on x  (cost=0.44..8.48 rows=1 
> width=2923) (actual time=8136.242..8136.242 rows=0 loops=1)
>  Index Cond: ((starttime = $7) AND (endtime = $8))
>  Filter: ((password IS NULL) AND ...(aid = $4) AND (bid = $5) AND 
> (btype = $6) AND...
>  Rows Removed by Filter: 5534630

I wonder if you did:

create statistics x_starttime_endtime_stats  (ndistinct) on
starttime,endtime from x;
analyze x;

if the planner would come up with a higher estimate than what it's
getting for the above and cause it to use the other index instead.

> optimzer is very complicated, could you direct me how optimizer to do 
> selectivity estimation when building generic plan, for this case? for 
> custom_plan, optimizer knows boundparams values, but when generic_plan, 
> planner() use boundparams=NULL,  it try to calculate average value based on 
> mcv list of the index attributes (starttime,endtime)  ?

IIRC, generic plan estimates become based on distinct estimations
rather than histograms or MCVs.

David




Re: wrong rows and cost estimation when generic plan

2022-12-06 Thread David Rowley
On Tue, 6 Dec 2022 at 20:17, James Pang (chaolpan)  wrote:
>  Could you provide the function name for generic plan selectivity estimation?

If you look at eqsel_internal(), you'll see there are two functions
that it'll call var_eq_const() for Consts and otherwise
var_eq_non_const(). It'll take the non-Const path for planning generic
plans.

David




Re: DML sql execution time slow down PGv14 compared with PGv13

2022-12-15 Thread David Rowley
On Thu, 15 Dec 2022 at 21:12, James Pang (chaolpan)  wrote:
>We had some load test ( DML inserts/deletes/updates/ on tens of hash 
> partition tables)  and found that PGV14 slow down 10-15% compared with PGV13. 
>  Same test server, same schema tables and data. From pg_stat_statements, sql 
> exec_time, we did found similar mean_exec_time increased from 5%-25% with 
> same SQL statements. Both v14 and v13 give very fast sql response time, just 
> compare the %diff from sql statements mean_exec_time.

I tried this out on the tip of the PG13 and PG14 branch with the same
scale of pgbench as you mentioned and I don't see the same slowdown as
you do.

PG13:
tps = 1711.980109 (excluding connections establishing)

PG14:
tps = 1736.466835 (without initial connection time)

As for why yours might be slower.  You might want to have a look at
the EXPLAIN ANALYZE output for the UPDATE statements. You can recreate
the -M prepared by using PREPARE and EXECUTE. You might want to
execute the statements 6 times and see if the plan changes on the 6th
execution.  It's likely not impossible that PG14 is using custom
plans, whereas PG13 might be using generic plans for these updates.
There were some quite significant changes made to the query planner in
PG14 that changed how planning works for UPDATEs and DELETEs from
partitioned tables.  Perhaps there's some reason there that the
custom/generic plan choice might differ. I see no reason why INSERT
would have become slower. Both the query planning and execution is
very different for INSERT.

You might also want to have a look at what perf says. If you have the
debug symbols installed, then you could just watch "perf top --pid=". Maybe that will show you
something interesting.




Re: Postgres12 looking for possible HashAggregate issue workarounds?

2022-12-18 Thread David Rowley
On Sun, 18 Dec 2022 at 23:44, João Paulo Luís  wrote:
> Meanwhile, as a one-time workaround I've disabled the hashagg algorithm,

The way the query planner determines if Hash Aggregate's hash table
will fit in work_mem or not is based on the n_distinct estimate of the
columns being grouped on.  You may want to review what analyze set
n_distinct to on this table. That can be done by looking at:

select attname,n_distinct from pg_Stats where tablename =
'sentencesource' and attname = 'sentence';

If what that's set to does not seem realistic, then you can overwrite this with:

ALTER TABLE sentencesource ALTER COLUMN sentence SET (n_distinct = N);

Please see the paragraph in [1] about n_distinct.  Using an absolute
value is likely not a great idea if the table is going to grow. You
could maybe give it a better estimate about how many times values are
repeated by setting some negative value, as described in the
documents. You'll need to analyze the table again after changing this
setting.

David

[1] https://www.postgresql.org/docs/12/sql-altertable.html




Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread David Rowley
On Wed, 1 Feb 2023 at 18:39, Alex Kaiser  wrote:
> postgres=# set force_parallel_mode = on;

There's been a bit of debate about that GUC and I'm wondering how you
came to the conclusion that it might help you. Can you share details
of how you found out about it and what made you choose to set it to
"on"?

David




Re: Getting an index scan to be a parallel index scan

2023-02-01 Thread David Rowley
On Thu, 2 Feb 2023 at 14:49, Thomas Munro  wrote:
> If I had more timerons myself, I'd like to try to make parallel
> function scans, or parallel CTE scans, work...

I've not really looked in detail but I thought parallel VALUES scan
might be easier than those two.

David




Re: Window Functions & Table Partitions

2023-02-08 Thread David Rowley
On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
> Basically- window partition functions don't take advantage of existing table 
> partitions. I use window functions as a more powerful GROUP BY clause that 
> preserves row-by-row information- super handy for a lot of things.
>
> In particular, I want to use window functions on already partitioned tables, 
> like the below example:
>
> create table abb (a int, b int, g int) partition by hash(b)
> /* populate table etc... */
> select a, b, min(a) over (partition by b) as g from abb
>
> Ideally with a query plan like this:
>
> Window:
> Append:
> Sort on table_p0
> Sort on table_p1
> Sort on table_p2

There was some effort [1] in version 12 to take advantage of the order
defined by the partitioning scheme. The release notes [2] mention:

"Avoid sorting when partitions are already being scanned in the necessary order"

However, it's not 100% of what you need as there'd have to be a btree
index on abb(b) for the planner to notice.

Likely this could be made better so that add_paths_to_append_rel()
added the pathkeys defined by the partitioned table into
all_child_pathkeys if they didn't exist already. In fact, I've
attached a very quickly hacked together patch against master to do
this.  I've given it very little thought and it comes complete with
failing regression tests.

If you're interested in pursuing this then feel free to take the patch
to the pgsql-hackers mailing list and propose it. It's unlikely I'll
get time to do that for a while, but I will keep a branch locally with
it to remind me in case I do at some point in the future.

David

[1] 
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
[2] https://www.postgresql.org/docs/release/12.0/
diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index ae0f9bdc8a..c4271c9179 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -1365,6 +1365,27 @@ add_paths_to_append_rel(PlannerInfo *root, RelOptInfo 
*rel,
}
}
 
+   if (rel->part_scheme != NULL && IS_SIMPLE_REL(rel) &&
+   partitions_are_ordered(rel->boundinfo, rel->live_parts))
+   {
+   List   *partition_pathkeys;
+   boolpartial;
+
+   partition_pathkeys = build_partition_pathkeys(root, rel,
+   
  ForwardScanDirection,
+   
  &partial);
+
+   if (!partial)
+   all_child_pathkeys = 
lappend(all_child_pathkeys, partition_pathkeys);
+
+   partition_pathkeys = build_partition_pathkeys(root, rel,
+   
  BackwardScanDirection,
+   
  &partial);
+
+   if (!partial)
+   all_child_pathkeys = 
lappend(all_child_pathkeys, partition_pathkeys);
+   }
+
/*
 * Collect lists of all the available path orderings and
 * parameterizations for all the children.  We use these as a


Re: Window Functions & Table Partitions

2023-02-20 Thread David Rowley
On Fri, 10 Feb 2023 at 06:40, Benjamin Tingle  wrote:
> Thanks for the helpful response david! I'll have a shot at getting the patch 
> to work myself & submitting to pgsql-hackers.

I took some time today for this and fixed up a few mistakes in the
patch and added it to the March commitfest [1].  Time is ticking away
for v16, so given this is a fairly trivial patch, I thought it might
be nice to have it.

Any discussion on the patch can be directed at [2]

David

[1] https://commitfest.postgresql.org/42/4198/
[2] 
https://www.postgresql.org/message-id/flat/caaphdvojkdbr3mr59jxmacybyhb6q_5qpru+dy93en8wm+x...@mail.gmail.com

> Ben
>
> On Wed, Feb 8, 2023 at 2:36 PM David Rowley  wrote:
>>
>> On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
>> > Basically- window partition functions don't take advantage of existing 
>> > table partitions. I use window functions as a more powerful GROUP BY 
>> > clause that preserves row-by-row information- super handy for a lot of 
>> > things.
>> >
>> > In particular, I want to use window functions on already partitioned 
>> > tables, like the below example:
>> >
>> > create table abb (a int, b int, g int) partition by hash(b)
>> > /* populate table etc... */
>> > select a, b, min(a) over (partition by b) as g from abb
>> >
>> > Ideally with a query plan like this:
>> >
>> > Window:
>> > Append:
>> > Sort on table_p0
>> > Sort on table_p1
>> > Sort on table_p2
>>
>> There was some effort [1] in version 12 to take advantage of the order
>> defined by the partitioning scheme. The release notes [2] mention:
>>
>> "Avoid sorting when partitions are already being scanned in the necessary 
>> order"
>>
>> However, it's not 100% of what you need as there'd have to be a btree
>> index on abb(b) for the planner to notice.
>>
>> Likely this could be made better so that add_paths_to_append_rel()
>> added the pathkeys defined by the partitioned table into
>> all_child_pathkeys if they didn't exist already. In fact, I've
>> attached a very quickly hacked together patch against master to do
>> this.  I've given it very little thought and it comes complete with
>> failing regression tests.
>>
>> If you're interested in pursuing this then feel free to take the patch
>> to the pgsql-hackers mailing list and propose it. It's unlikely I'll
>> get time to do that for a while, but I will keep a branch locally with
>> it to remind me in case I do at some point in the future.
>>
>> David
>>
>> [1] 
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
>> [2] https://www.postgresql.org/docs/release/12.0/
>
>
>
> --
>
> Ben(t).>
> Ben
>
> On Wed, Feb 8, 2023 at 2:36 PM David Rowley  wrote:
>>
>> On Thu, 9 Feb 2023 at 10:45, Benjamin Tingle  wrote:
>> > Basically- window partition functions don't take advantage of existing 
>> > table partitions. I use window functions as a more powerful GROUP BY 
>> > clause that preserves row-by-row information- super handy for a lot of 
>> > things.
>> >
>> > In particular, I want to use window functions on already partitioned 
>> > tables, like the below example:
>> >
>> > create table abb (a int, b int, g int) partition by hash(b)
>> > /* populate table etc... */
>> > select a, b, min(a) over (partition by b) as g from abb
>> >
>> > Ideally with a query plan like this:
>> >
>> > Window:
>> > Append:
>> > Sort on table_p0
>> > Sort on table_p1
>> > Sort on table_p2
>>
>> There was some effort [1] in version 12 to take advantage of the order
>> defined by the partitioning scheme. The release notes [2] mention:
>>
>> "Avoid sorting when partitions are already being scanned in the necessary 
>> order"
>>
>> However, it's not 100% of what you need as there'd have to be a btree
>> index on abb(b) for the planner to notice.
>>
>> Likely this could be made better so that add_paths_to_append_rel()
>> added the pathkeys defined by the partitioned table into
>> all_child_pathkeys if they didn't exist already. In fact, I've
>> attached a very quickly hacked together patch against master to do
>> this.  I've given it very little thought and it comes complete with
>> failing regression tests.
>>
>> If you're interested in pursuing this then feel free to take the patch
>> to the pgsql-hackers mailing list and propose it. It's unlikely I'll
>> get time to do that for a while, but I will keep a branch locally with
>> it to remind me in case I do at some point in the future.
>>
>> David
>>
>> [1] 
>> https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=959d00e9dbe4cfcf4a63bb655ac2c29a5e579246
>> [2] https://www.postgresql.org/docs/release/12.0/
>
>
>
> --
>
> Ben(t).




Re: multicolumn partitioning help

2023-03-15 Thread David Rowley
On Wed, 15 Mar 2023 at 10:41, Laurenz Albe  wrote:
> I think you want subpartitioning, like
>
>   CREATE TABLE humans (
>  hash bytea,
>  fname text,
>  dob date
>   ) PARTITION BY LIST (EXTRACT (YEAR FROM dob));

This may be perfectly fine, but it is also important to highlight that
partitioning in this way may hinder partition pruning.

If the first level partitioned table was to be BY RANGE (dob); then
the partitions could be defined like FOR VALUES FROM ('2023-01-01') TO
('2024-01-01'). For a query that had something like WHERE dob =
'2023-03-16', then PostgreSQL could prune away all the partitions for
the other years.  The same wouldn't occur if the table was partitioned
by LIST (EXTRACT (YEAR FROM dob)) unless you added a AND EXTRACT (YEAR
FROM dob) = 2023 to the query's WHERE clause.

Robert, there are a few tips about partitioning in [1] that you may
wish to review.

David

[1] https://www.postgresql.org/docs/devel/ddl-partitioning.html




Re: multicolumn partitioning help

2023-03-15 Thread David Rowley
On Thu, 16 Mar 2023 at 00:47, James Robertson  wrote:
> or do we get?
>
> TopLevelTable
> |
> |> worker-thread 1 (default catch)
> |> worker thread 2 -> sub-table 1.1
> |> worker thread 3 -> sub-table 1.2
> |> worker thread 4 -> sub-table 1.n
> |
> |> worker-thread 5 (default catch)
> |> worker thread 6 -> sub-table 2.1
> |> worker thread 7 -> sub-table 2.2
> |> worker thread 8 -> sub-table 2.n

The planner generally flattens out the scans to each partition into a
single Append or MergeAppend node.  Append nodes can be parallelised.
Assuming there's no reason that a particular partition can't support
it, the parallel workers can be distributed to assist without
restriction to which partition they help with. Multiple workers can
even help with a single partition. Workers can then move over to
helping with other partitions when they're done with the partition
they've been working on. I believe some other databases do or did at
least restrict parallelism to 1 worker per partition (which perhaps is
why you raised this).  There's no such restriction with PostgreSQL.

> Summary:
> 1) if we create nested partitions, do we create performance issues:

If you create too many partitions, it can create performance issues.
You should look at the partitioning best practices section of the
documents for details about that. I recommend a careful read of those.

> 2) if nested partitions are the solutions, what is the point of multi-column 
> partitioning?

There are many reasons.  If you have multiple levels of partitioning,
then the partition pruning done during planning is going to have more
work to do as it'll be executed once, then once again for each
partitioned table remaining after running it for the first level.
Also, it seems much easier to PARTITION BY HASH(a,b) than to first do
HASH(a) then another level to HASH(b).  However, there may be
advantages to having multiple levels here as the planner would still
be able to prune partitions if the WHERE clause didn't contain any
quals like "b = ".  The key take away here is that they're
different, so we support both.

> wish list) wouldn't it be neat if we can do mult-mode multi-column? like 
> PARTITION BY RANGE (EXTRACT(YEAR FROM dob)) LIST (SUBSTRING(hash, 1, 1));

Effectively, multi-level partitioning gives you that, It's just the
DDL is different from how you wrote it.

David




Re: speeding up grafana sensor-data query on raspberry pi 3

2023-04-16 Thread David Rowley
On Mon, 17 Apr 2023 at 05:00, Clemens Eisserer  wrote:
> Example:
> table smartmeter with non-null column ts (timestamp with time zone)
> and brinc index on ts, no pk to avoid a btree index.
> Sensor values are stored every 5s, so for 1 month there are about 370k
> rows - and in total the table currently holds about 3M rows.
> The query to display the values for 1 month takes ~3s, with the bitmap
> heap scan as well as aggregation taking up most of the time, with
> sorting in between.

I know you likely don't have much RAM to spare here, but more work_mem
might help, even just 16MBs might be enough. This would help the Sort
and to a lesser extent the Bitmap Heap Scan too.

Also, if you'd opted to use PostgreSQL 14 or 15, then you could have
performed CREATE STATISTICS on your GROUP BY clause expression and
then run ANALYZE.  That might cause the planner to flip to a Hash
Aggregate which would eliminate the Sort before aggregation.  You'd
only need to sort 236 rows after the Hash Aggregate for the ORDER BY.

Plus, what Justin said.

David




Re: High QPS, random index writes and vacuum

2023-04-17 Thread David Rowley
On Tue, 18 Apr 2023 at 12:35, peter plachta  wrote:
> I increased work_mem to 2Gb

maintenance_work_mem is the configuration option that vacuum uses to
control how much memory it'll make available for storage of dead
tuples. I believe 1GB would allow 178,956,970 tuples to be stored
before multiple passes would be required. The chunk of memory for dead
tuple storage is capped at 1GB.

David




Re: Partitioning update-heavy queue with hash partitions vs partial indexes

2023-08-10 Thread David Rowley
On Thu, 10 Aug 2023 at 20:36, Dorian Hoxha  wrote:
> > Do Not Use Partial Indexes as a Substitute for Partitioning
> > While a search in this larger index might have to descend through a couple 
> > more tree levels than a search in a smaller index, that's almost certainly 
> > going to be cheaper than the planner effort needed to select the 
> > appropriate one of the partial indexes. The core of the problem is that the 
> > system does not understand the relationship among the partial indexes, and 
> > will laboriously test each one to see if it's applicable to the current 
> > query.
>
> Would this be true in my case too?

Yes.  The process of determining which partial indexes are valid for
the given query must consider each index one at a time and validate
the index's WHERE clause against the query's WHERE clause to see if it
can be used.  There is no shortcut that sees you have a series of
partial indexes with WHERE id % 10 = N; which just picks 1 index
without searching all of them.

> Is it faster for the planner to select a correct partition(hash partitioning 
> on `id` column) instead of a correct partial index like in my case? I don't 
> think I'll need more than ~32 partitions/partial-indexes in an extreme 
> scenario.

I mean, test it and find out, but probably, yes, the partition pruning
code for hash partitioning is an O(1) operation and is very fast.
Once the given Constants have been hashed, finding the partition is
just a single divide operation away.

David




Re: Slow query, possibly not using index

2023-08-28 Thread David Rowley
On Mon, 28 Aug 2023 at 19:21, Les  wrote:
> More important question is, how can I find out why the index was not auto 
> vacuumed.

You should have a look at pg_stat_user_tables.  It'll let you know if
the table is being autovacuumed and how often.  If you're concerned
about autovacuum not running properly, then you might want to lower
log_autovacuum_min_duration.  Generally, anything that takes a
conflicting lock will cause autovacuum to cancel so that the
conflicting locker can get through.  Things like ALTER TABLE or even
an ANALYZE running will cancel most autovacuum runs on tables.

Also, this is a fairly large table and you do have the standard
autovacuum settings.  Going by pgstattuple, the table has 39652836
tuples.  Autovacuum will trigger when the statistics indicate that 20%
of tuples are dead, which is about 8 million tuples. Perhaps that's
enough for the index scan to have to skip over a large enough number
of dead tuples to make it slow. You might want to consider lowering
the autovacuum scale factor for this table.

Also, ensure you're not doing anything like calling pg_stat_reset();

It might be worth showing us the output of:

select * from pg_stat_user_tables where relid = 'media.block'::regclass;

David




Re: Range partitioning query performance with date_trunc (vs timescaledb)

2023-08-29 Thread David Rowley
On Tue, 29 Aug 2023 at 19:40, Philippe Pepiot  wrote:
> I'm trying to implement some range partitioning on timeseries data. But it
> looks some queries involving date_trunc() doesn't make use of partitioning.
>
> BEGIN;
> CREATE TABLE test (
> time TIMESTAMP WITHOUT TIME ZONE NOT NULL,
> value FLOAT NOT NULL
> ) PARTITION BY RANGE (time);
> CREATE INDEX test_time_idx ON test(time DESC);
> CREATE TABLE test_y2010 PARTITION OF test FOR VALUES FROM ('2020-01-01') TO 
> ('2021-01-01');
> CREATE TABLE test_y2011 PARTITION OF test FOR VALUES FROM ('2021-01-01') TO 
> ('2022-01-01');
> CREATE VIEW vtest AS SELECT DATE_TRUNC('year', time) AS time, SUM(value) AS 
> value FROM test GROUP BY 1;
> EXPLAIN (COSTS OFF) SELECT * FROM vtest WHERE time >= TIMESTAMP '2021-01-01';
> ROLLBACK;
>
> The plan query all partitions:

> I wonder if there is a way with a reasonable amount of SQL code to achieve 
> this
> with vanilla postgres ?

The only options I see for you are

1) partition by LIST(date_Trunc('year', time)), or;
2) use a set-returning function instead of a view and pass the date
range you want to select from the underlying table via parameters.

I imagine you won't want to do #1. However, it would at least also
allow the aggregation to be performed before the Append if you SET
enable_partitionwise_aggregate=1.

#2 isn't as flexible as a view as you'd have to create another
function or expand the parameters of the existing one if you want to
add items to the WHERE clause.

Unfortunately, date_trunc is just a black box to partition pruning, so
it's not able to determine that DATE_TRUNC('year', time) >=
'2021-01-01'  is the same as time >= '2021-01-01'.  It would be
possible to make PostgreSQL do that, but that's a core code change,
not something that you can do from SQL.

David




Re: Queries containing ORDER BY and LIMIT started to work slowly

2023-08-30 Thread David Rowley
On Thu, 31 Aug 2023 at 06:32, Rondat Flyag  wrote:
> I tried VACUUM ANALYZE for three tables, but without success. I also tried to 
> set enable_seqscan=off and the query took even more time. If I set 
> enable_sort=off then the query takes a lot of time and I cancel it.
>
> Please see the attached query plans.

It's a little hard to comment here as I don't see what the plan was
before when you were happy with the performance. I also see the
queries you mentioned in the initial email don't match the plans.
There's no table called "isbns" in the query. I guess this is "asins"?

Likely you could get a faster plan if there was an index on
asins_statistics (asin_id).  That would allow a query plan that scans
the isbns_value_key index and performs a parameterised nested loop on
asins_statistics using the asins_statistics (asin_id) index.  Looking
at your schema, I don't see that index, so it's pretty hard to guess
why the plan used to be faster.  Even if the books/asins merge join
used to take place first, there'd have been no efficient way to join
to the asins_statistics table and preserve the Merge Join's order (I'm
assuming non-parameterized nested loops would be inefficient in this
case). Doing that would have also required the asins_statistics
(asin_id) index.  Are you sure that index wasn't dropped?

However, likely it's a waste of time to try to figure out what the
plan used to be. Better to focus on trying to make it faster. I
suggest you create the asins_statistics (asin_id) index. However, I
can't say with any level of confidence that the planner would opt to
use that index if it did exist.   Lowering random_page_cost or
increasing effective_cache_size would increase the chances of that.

David




Re: Planning time is time-consuming

2023-09-11 Thread David Rowley
On Mon, 11 Sept 2023 at 18:16, Laurenz Albe  wrote:
> Also, there are quite a lot of indexes on "test_db_bench_1".  On a test 
> database, drop some
> indexes and see if that makes a difference.

Yeah, I count 3 that either have the key columns as some prefix of
another index or are just a duplicate of some other index.

Getting rid of those 3 will save some time in create_index_paths().

David




Re: Planning time is time-consuming

2023-09-11 Thread David Rowley
On Mon, 11 Sept 2023 at 21:54, Mikhail Balayan  wrote:
> Could it be a regression? I'll check it on PG14 when I get a chance.

I'm not sure if you're asking for help here because you need planning
to be faster than it currently is, or if it's because you believe that
planning should always be faster than execution. If you think the
latter, then you're mistaken. It seems to me that the complexity of
planning this query is much more complex than executing it.  The outer
side of the inner-most nested loop finds 0 rows, so it need not scan
the inner side, which results in that nested loop producing 0 rows,
therefore the outer side of none of the subsequent nested loops find
any rows. This is why you see "(never executed)" in the EXPLAIN
ANALYZE.

You could use perf record or perf top to dig into what's slow.

On the other hand, please report back if you find PG14 to be much faster here.

You could also experiment with a set of tables which are empty.  It's
possible getting the relation sizes are a factor to consider here.
mdnblocks() needs to do a bit more work when the relation has multiple
segments.

David




Re: Planning time is time-consuming

2023-09-11 Thread David Rowley
On Tue, 12 Sept 2023 at 02:27, Tom Lane  wrote:
>
> David Rowley  writes:
> > I'm not sure if you're asking for help here because you need planning
> > to be faster than it currently is, or if it's because you believe that
> > planning should always be faster than execution. If you think the
> > latter, then you're mistaken.
>
> Yeah.  I don't see anything particularly troubling here.  Taking
> circa three-quarters of a millisecond (on typical current hardware)
> to plan a four-way join on large tables is not unreasonable.

I took a few minutes to reverse engineer the tables in question (with
assistance from an AI bot) and ran the query in question.
Unsurprisingly, I also see planning as slower than execution, but with
a ratio of about planning being 12x slower than execution vs the
reported ~18x.

Planning Time: 0.581 ms
Execution Time: 0.048 ms

Nothing alarming in perf top of executing the query in pgbench with -M
simple.  I think this confirms the problem is just with expectations.

   5.09%  postgres  [.] AllocSetAlloc
   2.99%  postgres  [.] SearchCatCacheInternal
   2.52%  postgres  [.] palloc
   2.38%  postgres  [.] expression_tree_walker_impl
   1.82%  postgres  [.] add_path_precheck
   1.78%  postgres  [.] add_path
   1.73%  postgres  [.] MemoryContextAllocZeroAligned
   1.63%  postgres  [.] base_yyparse
   1.61%  postgres  [.] CatalogCacheComputeHashValue
   1.38%  postgres  [.] try_nestloop_path
   1.36%  postgres  [.] stack_is_too_deep
   1.33%  postgres  [.] add_paths_to_joinrel
   1.19%  postgres  [.] core_yylex
   1.18%  postgres  [.] lappend
   1.15%  postgres  [.] initial_cost_nestloop
   1.13%  postgres  [.] hash_search_with_hash_value
   1.01%  postgres  [.] palloc0
   0.95%  postgres  [.] get_memoize_path
   0.90%  postgres  [.] equal
   0.88%  postgres  [.] get_eclass_for_sort_expr
   0.81%  postgres  [.] compare_pathkeys
   0.80%  postgres  [.] bms_is_subset
   0.77%  postgres  [.] ResourceArrayRemove
   0.77%  postgres  [.] check_stack_depth
   0.77%  libc.so.6 [.] __memmove_avx_unaligned_erms
   0.74%  libc.so.6 [.] __memset_avx2_unaligned
   0.73%  postgres  [.] AllocSetFree
   0.71%  postgres  [.] final_cost_nestloop
   0.69%  postgres  [.] compare_path_costs_fuzzily
   0.68%  postgres  [.] initial_cost_mergejoin
   0.64%  libc.so.6 [.] __memset_avx2_unaligned_erms
   0.61%  postgres  [.] create_nestloop_path
   0.61%  postgres  [.] examine_variable
   0.59%  postgres  [.] hash_bytes
   0.56%  postgres  [.] truncate_useless_pathkeys
   0.56%  postgres  [.] bms_overlap

David




Re: [EXTERNAL] Re: Performance down with JDBC 42

2023-11-05 Thread David Rowley
On Mon, 6 Nov 2023 at 08:37, Abraham, Danny  wrote:
>
> Both plans refer to the same DB.

JDBC is making use of PREPARE statements, whereas psql, unless you're
using PREPARE is not.

> #1 – Fast – using psql or old JDBC driver

The absence of any $1 type parameters here shows that's a custom plan
that's planned specifically using the parameter values given.

> Slow – when using JDBC 42

Because this query has $1, $2, etc, that's a generic plan. When
looking up statistics histogram bounds and MCV slots cannot be
checked. Only ndistinct is used. If you have a skewed dataset, then
this might not be very good.

You might find things run better if you adjust postgresql.conf and set
plan_cache_mode = force_custom_plan then select pg_reload_conf();

Please also check the documentation so that you understand the full
implications for that.

David




Re: I don't understand that EXPLAIN PLAN timings

2024-01-23 Thread David Rowley
On Tue, 23 Jan 2024 at 20:45, Jean-Christophe Boggio
 wrote:
> explain says actual time between 1.093→1.388 but final execution time says 
> 132.880ms?!?

The 1.388 indicates the total time spent in that node starting from
just before the node was executed for the first time up until the node
returned no more tuples.

The 132.88 ms is the total time spent to start up the plan, which
includes doing things like obtaining locks (if not already obtained
during planning), opening files and allocating memory for various
things needed for the plan.  After the top-level node returns NULL it
then will print out the plan before shutting the plan down. This
shutdown time is also included in the 132.88 ms.

I don't know where exactly the extra time is spent with your query,
but it must be in one of the operations that I mentioned which takes
place either before the top node is first executed or after the top
node returns NULL.

If you're using psql, if you do \timing on, how long does EXPLAIN take
without ANALYZE? That also goes through executor startup and shutdown.
It just skips the running the executor part.

David




Re: I don't understand that EXPLAIN PLAN timings

2024-01-25 Thread David Rowley
On Fri, 26 Jan 2024 at 02:31, Jean-Christophe Boggio
 wrote:
> You are absolutely correct : the EXPLAIN without ANALYZE gives about the same 
> results. Also, minimizing the amount of workmem in postgresql.conf changes 
> drastically the timings. So that means memory allocation is eating up a lot 
> of time _PER_QUERY_ ?

We do reuse pallocs to create memory context, but only for I believe
1k and 8k blocks.  That likely allows most small allocations in the
executor to be done without malloc.  Speaking in vague terms as I
don't have the exact numbers to hand, but larger allocations will go
directly to malloc.

There was a bug fixed in [1] that did cause behaviour like this, but
you seem to be on 14.10 which will have that fix.  Also, the 2nd plan
you sent has no Memoize nodes.

I do wonder now if it was a bad idea to make Memoize build the hash
table on plan startup rather than delaying that until we fetch the
first tuple. I see Hash Join only builds its table during executor
run.

> Since we have quite some RAM on our machines, I dedicated as much as possible 
> to workmem (initially I was allocating 1GB) but this looks quite 
> counterproductive (I didn't think that memory was allocated every time, I 
> thought it was "available" for the current query but not necessarily used). 
> Is this an issue specific to that version of PostgreSQL? (I guess no) Or can 
> this be hardware-related? Or OS-related (both systems on which I have done 
> tests are running Ubuntu, I will try on Debian)?

It would be good to narrow down which plan node is causing this.  Can
you try disabling various planner enable_* GUCs before running EXPLAIN
(SUMMARY ON)  with \timing on and see if you can find
which enable_* GUC causes the EXPLAIN to run more quickly?  Just watch
out for variations in the timing of "Planning Time:". You're still
looking for a large portion of time not accounted for by planning
time.

I'd start with:

SET enable_memoize=0;
EXPLAIN (SUMMARY ON) ;
RESET enable_memoize;

SET enable_hashjoin=0;
EXPLAIN (SUMMARY ON) ;
RESET enable_hashjoin;

The following will show others that you could try.
select name,setting from pg_settings where name like 'enable%';

David

[1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=1e731ed12aa




Re: I don't understand that EXPLAIN PLAN timings

2024-01-25 Thread David Rowley
On Fri, 26 Jan 2024 at 17:23, Jean-Christophe Boggio
 wrote:
> Let me know if I can do anything to provide you with more useful
> benchmark. The DB is still very small so it is easy to do tests.

What I was looking to find out was if there was some enable_* GUC that
you could turn off that would make the unaccounted time that you were
complaining about go away.

Because it seems when you reduce work_mem this unaccounted for time
goes away, it makes me think that some executor node is allocating a
load of memory during executor startup.  I was hoping to find out
which node is the offending one by the process of elimination.

Are you still seeing this unaccounted for time with your notebook?
i.e. the relatively high "Execution Time" reported by EXPLAIN ANALYZE
and low total actual execution time on the plan's top-level node.

I probably didn't need to mention the planning time as it seems
unlikely that disabling an enable* GUC would result in increased
planning time. However, it does not seem impossible that that *could*
happen.

David




Re: Slow query in table where many rows were deleted. VACUUM FULL fixes it

2024-01-30 Thread David Rowley
On Wed, 31 Jan 2024 at 09:09, Philip Semanchuk
 wrote:
> So in your case those 5m rows that you deleted were probably still clogging 
> up your table until you ran VACUUM FULL.

It seems more likely to me that the VACUUM removed the rows and just
left empty pages in the table.  Since there's no index on expires_at,
the only way to answer that query is to Seq Scan and Seq Scan will
need to process those empty pages.  While that processing is very fast
if the page's item pointers array is empty, it could still be slow if
the page needs to be read from disk. Laurenz's request for the explain
(analyze, buffers) output with track_io_timing on will help confirm
this.

If it is just reading empty pages that's causing this issue then
adding that missing index would improve the situation after running
just plain VACUUM each time there's a bulk delete.

David




Table Partitioning and Indexes Performance Questions

2024-02-29 Thread David Kelly
I was told that partitioned table indexed must always start with the
partition key columns.

Is this always the case or does it depend on use case? When would you want
to create indexes in this way?

The documentation just mentions that it is strictly unnecessary but can be
helpful. My understanding is partitions behave like normal tables. Each
gets their own index. So, I'd expect the reasoning behind creating the
index on the partition should be the same as if it were just a normal table
(assuming it has the same subset of data as the individual partition). Is
this a correct understanding?

Any other performance considerations when it comes to partitioned table
indexing? Specifically, partitioning by range where the range is a single
value.


Re: Plan selection based on worst case scenario

2024-05-29 Thread David Rowley
On Thu, 30 May 2024 at 13:03, Darwin O'Connor  wrote:
> Is there a PostgreSQL setting that can control how it judges plans?

There's nothing like that, unfortunately.

> Here is a recent example of a query that finds the last time at a stop 
> filtered for a certain route it has to look up another table to find. 
> PostgreSQL initially chose the plan that cost "37357.45..37357.45" rather 
> than the one that cost "1.15..61088.32".
>
> transsee=# explain analyze select t.time+coalesce(t.timespent,0)*interval '1 
> second' from trackstopscurr t join tracktrip r on r.a=0 and r.route='501' and 
> r.id=t.trackid and t.stopid='4514' order by t.time desc limit 1;

>  Limit  (cost=37357.45..37357.45 rows=1 width=16) (actual 
> time=2667.674..2694.047 rows=1 loops=1)

>->  Nested Loop  (cost=182.60..36357.34 rows=1 width=16) 
> (actual time=381.913..2659.412 rows=1342 loops=3)
>  ->  Parallel Bitmap Heap Scan on trackstopscurr t  
> (cost=182.03..19048.63 rows=2014 width=14) (actual time=380.467..1231.788 
> rows=8097 loops=3)
>Recheck Cond: ((stopid)::text = '4514'::text)
>Heap Blocks: exact=8103
>->  Bitmap Index Scan on trackstopscurr_2  
> (cost=0.00..180.82 rows=4833 width=0) (actual time=382.653..382.653 
> rows=24379 loops=1)
>  Index Cond: ((stopid)::text = '4514'::text)
>  ->  Index Scan using tracktrip_0 on tracktrip r  
> (cost=0.57..8.59 rows=1 width=4) (actual time=0.175..0.175 rows=0 loops=24290)
>Index Cond: (id = t.trackid)
>Filter: ((a = 0) AND ((route)::text = '501'::text))
>Rows Removed by Filter: 1
>  Planning Time: 0.228 ms
>  Execution Time: 2694.077 ms

The problem here is primarily down to the poor estimates for the scan
on tracktrip.  You can see that the Nested Loop estimates 1 row, so
therefore the LIMIT costing code thinks LIMIT 1 will require reading
all rows, all 1 of them.  In which case that's expected to cost
36357.34, which is cheaper than the other plan which costs 61088.32 to
get one row.

If you can fix the row estimate to even estimate 2 rows rather than 1,
then it'll choose the other plan.  An estimate of 2 rows would mean
the total cost of the best path after the limit would be 61088.32 / 2
= 30544.16, which is cheaper than the 36357.34 of the bad plan.

You could try ANALYZE on tracktrip, or perhaps increasing the
statistics targets on the columns being queried here.

If there's a correlation between the "a" and "route" columns then you
might want to try CREATE STATISTICS:

CREATE STATISTICS ON a,route FROM tracktrip;
ANALYZE tracktrip;

David




Re: a lot of shared buffers hit when planning for a simple query with primary access path

2024-07-01 Thread David Rowley
On Mon, 1 Jul 2024 at 21:45, James Pang  wrote:
>Buffers: shared hit=110246 <<< here planning need access a lot of 
> buffers
>  Planning Time: 81.850 ms
>  Execution Time: 0.034 ms
>
>could you help why planning need a lot of shared buffers access ?

Perhaps you have lots of bloat in your system catalogue tables. That
could happen if you make heavy use of temporary tables. There are many
other reasons too. It's maybe worth doing some vacuum work on the
catalogue tables.

David




Re: a lot of shared buffers hit when planning for a simple query with primary access path

2024-07-01 Thread David Rowley
On Mon, 1 Jul 2024 at 22:20, Pavel Stehule  wrote:
> The planners get min/max range from indexes. So some user's indexes can be 
> bloated too with similar effect

I considered that, but it doesn't apply to this query as there are no
range quals.

David




Re: Hash Right join and seq scan

2024-07-04 Thread David Rowley
On Fri, 5 Jul 2024 at 12:50, James Pang  wrote:
>we have a daily vacuumdb and analyze job, generally speaking it's done in 
> seconds, sometimes it suddenly running more than tens of minutes with same 
> bind variable values and huge temp space got used  and at that time, explain 
> show  "Hash Anti join, Hash Right join" with seq scan two tables.

There was talk about adding costing for run-time partition pruning
factors but nothing was ever agreed, so nothing was done.  It's just
not that obvious to me how we'd do that.  If the Append had 10
partitions as subnodes, with an equality join condition, you could
assume we'll only match to 1 of those 10, but we've no idea at plan
time which one that'll be and the partitions might drastically vary in
size.  The best I think we could do is take the total cost of those 10
and divide by 10 to get the average cost.  It's much harder for range
conditions as those could match anything from 0 to all partitions. The
best suggestion I saw for that was to multiply the costs by
DEFAULT_INEQ_SEL.

I think for now, you might want to lower the random_page_cost or
increase effective_cache_size to encourage the nested loop -> index
scan plan. Good ranges for effective_cache_size is anywhere between 50
- 75% of your servers's RAM.  However, that might not be ideal if your
server is under memory pressure from other running processes. It also
depends on how large shared_buffers are as a percentage of total RAM.

David




Re: Hash Right join and seq scan

2024-07-05 Thread David Rowley
On Sat, 6 Jul 2024 at 02:43, James Pang  wrote:
>for nest loop path, since the first one estimated only "8" rows , 
> and they use partitionkeyid as joinkey and all are hash partitions , is it 
> better to estimate cost to  8 (loop times) * 1600 = 12800 (each one loop  map 
> to only 1 hash partition bitmap scan ,avg one partition cost), that's much 
> less than 398917.29 of all partitions ?

I'm not really sure where you're getting the numbers from here. The
outer side of the deepest nested loop has an 8 row estimate, not the
nested loop itself.  I'm unsure where the 1600 is from. I only see
1669.

As of now, we don't do a great job of costing for partition pruning
that will happen during execution.  We won't be inventing anything to
fix that in existing releases of PostgreSQL, so you'll need to either
adjust the code yourself, or find a workaround.

You've not shown us your schema, but perhaps enable_partitionwise_join
= on might help you. Other things that might help are further lowering
random_page_cost or raising effective_cache_size artificially high.
It's hard to tell from here how much random I/O is being costed into
the index scans.  You could determine this by checking if the nested
loop plan costs change as a result of doing further increases to
effective_cache_size. You could maybe nudge it up enough for it to win
over the hash join plan. It is possible that this won't work, however.

>  for secondary Nest Loop Anti join could be rows 299118 rows *  15.78(avg 
> index scan cost of one partition) = 4,720,082 that still much less than  
> 132168227.57 ?
>  for Hash Right join, is it possible to estimate by  8 seq partition 
> scan instead of all 32 hash partitions since the first query estimated 8 rows 
> only ?
>  extend statistics may help estimate count(partitionkeyid) based on 
> other columns bind variables, but looks like that did not help table join 
> case.

I can't quite follow this. You'll need to better explain where you're
getting these numbers for me to be able to understand.

David




Re: Has gen_random_uuid() gotten much slower in v17?

2024-09-11 Thread David Mullineux
Good idea. Thanks.I did check. It's not enabled by default but just in
case I did another build. This time explicitly defining --disable-debug and
--disable-cassert. And I tested. Still slower than old versions.

This feels like a build configuration problem. Just can't put my finger on
it yet.

On Wed, 11 Sept 2024, 10:40 Peter Eisentraut,  wrote:

> On 10.09.24 15:58, David Mullineux wrote:
> > I'm getting a bit concerned by the slow performance of generating uidds
> > on latest dev code versus older versions. Here I compare the time to
> > generate 50k random uuids. Both son the same machine.
> > I must be missing something.
>
> Are you sure that the 18devel installation isn't compiled with
> assertions enabled?
>
> The underlying code for gen_random_uuid() is virtually unchanged between
> PG14 and current.
>
>


Re: "set primary keys..." is missing when using hight values for transactions / scaling factor with pgbench

2018-06-26 Thread David Rowley
On 27 June 2018 at 00:21, Elvir Kurić  wrote:
> I have strange issue with pgbench where it fails to execute step to create
> primary keys when I specify scaling factor / transactions to some reasonable
> high value - eg. 8k.

The primary keys are only created in -i mode, which can't be used in
conjunction with the options you've mentioned.

pgbench will perform a vacuum before an actual test run, so perhaps
that's what you're seeing.  You may also have noticed it also didn't
perform the create tables and data population too without -i.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Queue table that quickly grows causes query planner to choose poor plan

2018-06-27 Thread David Wheeler
Hi All,

I’m having performance trouble with a particular set of queries. It goes a bit 
like this

1) queue table is initially empty, and very narrow (1 bigint column)
2) we insert ~30 million rows into queue table
3) we do a join with queue table to delete from another table (delete from a 
using queue where a.id<http://a.id> = queue.id<http://queue.id>), but postgres 
stats say that queue table is empty, so it uses a nested loop over all 30 
million rows, taking forever

If I kill the query in 3 and let it run again after autoanalyze has done it’s 
thing then it is very quick

This queue table is empty 99% of the time, and the query in 3 runs immediately 
after step 2. Is there any likelyhood that tweaking the autoanalyze params 
would help in this case? I don’t want to explicitly analyze the table between 
steps 2 and three either as there are other patterns of use where for example 0 
rows are inserted in step 2 and this is expected to run very very quickly. Do I 
have any other options?

Postgres 9.5 ATM, but an upgrade is in planning.


Thanks in advance

David Wheeler
Software developer

[cid:2C4D0888-9F8B-463F-BD54-2B60A322210C]


E [email protected]<mailto:[email protected]>
D +61 3 9663 3554  W http://dgitsystems.com
Level 8, 620 Bourke St, Melbourne VIC 3000.




Re: Queue table that quickly grows causes query planner to choose poor plan

2018-06-27 Thread David Wheeler
Hi Tom,

Thanks for your reply, that’s very helpful and informative.

Although there's no way to have any useful pg_statistic stats if you won't
do an ANALYZE, the planner nonetheless can see the table's current
physical size, and what it normally does is to multiply the last-reported
tuple density (reltuples/relpages) by the current size.  So if you're
getting an "empty table" estimate anyway, I have to suppose that the
table's state involves reltuples = 0 and relpages > 0.  That's not a
good place to be in; it constrains the planner to believe that the table
is in fact devoid of tuples, because that's what the last ANALYZE saw.

That appears to be correct. I assumed that because the table was analyzed and 
found to be empty then the autovacuum would probably have cleared all the 
tuples too, but that’s not the case.

 relpages |  reltuples
--+-
0 | 2.33795e+06

I am not following your aversion to sticking an ANALYZE in there,
either.  It's not like inserting 30 million rows would be free.

There are many usage profiles for these tables. Sometimes there will be a 
single insert of 30 million rows, sometimes there will be several inserts of up 
to 100 million rows each in different threads, sometimes there will be many 
(~80 000) inserts of 0 rows (for which an ANALYSE is simply a waste) - I don’t 
want to cause undue performance penalty on the other usage profiles.

But as Justin rightly points out I can selectively ANALYSE only when > x rows 
are inserted, which I think is the best way forward.

David Wheeler
Software developer

[cid:2C4D0888-9F8B-463F-BD54-2B60A322210C]


E [email protected]<mailto:[email protected]>
D +61 3 9663 3554  W http://dgitsystems.com
Level 8, 620 Bourke St, Melbourne VIC 3000.


On 28 Jun 2018, at 4:27 am, Tom Lane 
mailto:[email protected]>> wrote:

David Wheeler mailto:[email protected]>> 
writes:
I'm having performance trouble with a particular set of queries. It goes a bit 
like this

1) queue table is initially empty, and very narrow (1 bigint column)
2) we insert ~30 million rows into queue table
3) we do a join with queue table to delete from another table (delete from a 
using queue where a.id<http://a.id><http://a.id> = 
queue.id<http://queue.id><http://queue.id>), but postgres stats say that queue 
table is empty, so it uses a nested loop over all 30 million rows, taking 
forever

Although there's no way to have any useful pg_statistic stats if you won't
do an ANALYZE, the planner nonetheless can see the table's current
physical size, and what it normally does is to multiply the last-reported
tuple density (reltuples/relpages) by the current size.  So if you're
getting an "empty table" estimate anyway, I have to suppose that the
table's state involves reltuples = 0 and relpages > 0.  That's not a
good place to be in; it constrains the planner to believe that the table
is in fact devoid of tuples, because that's what the last ANALYZE saw.

Now, the initial state for a freshly-created or freshly-truncated table
is *not* that.  It is reltuples = 0 and relpages = 0, representing an
undefined tuple density.  Given that, the planner will make some guess
about average tuple size --- which is likely to be a very good guess,
for a table with only fixed-width columns --- and then compute a rowcount
estimate using that plus the observed physical size.

So I think your problem comes from oscillating between really-empty
and not-at-all-empty, and not using an idiomatic way of going back
to the empty state.  Have you tried using TRUNCATE instead of DELETE?

This queue table is empty 99% of the time, and the query in 3 runs immediately 
after step 2. Is there any likelyhood that tweaking the autoanalyze params 
would help in this case? I don't want to explicitly analyze the table between 
steps 2 and three either as there are other patterns of use where for example 0 
rows are inserted in step 2 and this is expected to run very very quickly. Do I 
have any other options?

I am not following your aversion to sticking an ANALYZE in there,
either.  It's not like inserting 30 million rows would be free.

regards, tom lane




Re: Performance difference in accessing differrent columns in a Postgres Table

2018-07-29 Thread David Rowley
On 29 July 2018 at 17:38, Dinesh Kumar  wrote:
> I found performance variance between accessing int1 and int200 column which
> is quite large.

Have a look at slot_deform_tuple and heap_deform_tuple. You'll see
that tuples are deformed starting at the first attribute. If you ask
for attribute 200 then it must deform 1-199 first.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: To keep indexes in memory, is large enough effective_cache_size enough?

2018-09-19 Thread David Rowley
On 19 September 2018 at 21:18, Sam R.  wrote:
> Ok. So, we should set also shared_buffers big.

It might not be quite as beneficial as you might think. If your
database is larger than RAM often having a smaller shared_buffers
setting yields better performance.  The reason is that if you have a
very large shared_buffers that the same buffers can end up cached in
the kernel page cache and shared buffers. If you have a smaller shared
buffers setting then the chances of that double buffering are reduced
and the chances of finding a page cached somewhere increases.

However, if your database is quite small and you can afford to fit all
your data in shared buffers, with enough free RAM for everything else,
then you might benefit from a large shared buffers, but it's important
to also consider that some operations, such as DROP TABLE can become
slow of shared buffers is very large.

You might get more specific recommendations if you mention how much
RAM the server has and how big the data is now and will be in the
future.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: To keep indexes in memory, is large enough effective_cache_size enough?

2018-09-19 Thread David Rowley
On 19 September 2018 at 22:12, Kaixi Luo  wrote:
> Does a large shared_buffers impact checkpoint performance negatively? I was
> under the impression that everything inside shared_buffers must be written
> during a checkpoint.

Only the dirty buffers get written.

Also having too small a shared buffers can mean that buffers must be
written more than they'd otherwise need to be. If a buffer must be
evicted from shared buffers to make way for a new buffer then the
chances of having to evict a dirty buffer increases with smaller
shared buffers. Obviously, this dirty buffer needs to be written out
before the new buffer can be loaded in. In a worst-case scenario, a
backend performing a query would have to do this.  pg_stat_bgwriter is
your friend.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: To keep indexes in memory, is large enough effective_cache_size enough?

2018-09-19 Thread David Rowley
On 20 September 2018 at 15:19, Wei Shan  wrote:
> I believe you can use pg_prewarm to pin index or table to cache.
>
> https://www.postgresql.org/docs/current/static/pgprewarm.html

I think the key sentence in the document you linked to is:

"Prewarmed data also enjoys no special protection from cache
evictions, so it is possible that other system activity may evict the
newly prewarmed blocks shortly after they are read"

So this is not pinning. It's merely loading buffers into shared
buffers in the hope that they might be around long enough for you to
make the most of that effort.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: To keep indexes in memory, is large enough effective_cache_size enough?

2018-09-25 Thread David Rowley
On Tue, 25 Sep 2018 at 18:36, Sam R.  wrote:
> Regarding double buffering: I do not know how much double buffering would 
> slow down operations.
> It could also be possible to turn off kernel page cache on our DB server, to 
> avoid double buffering. Although, we may still keep it in use.

I think you've misunderstood double buffering.  The double buffering
itself does not slow anything down. If the buffer is in shared buffers
already then it does not need to look any further for it. Double
buffering only becomes an issue when buffers existing 2 times in
memory causes other useful buffers to appear 0 times.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: To keep indexes in memory, is large enough effective_cache_size enough?

2018-09-28 Thread David Rowley
On 28 September 2018 at 16:45, Sam R.  wrote:
> That was what I was suspecting a little. Double buffering may not matter in
> our case, because the whole server is meant for PostgreSQL only.
>
> In our case, we can e.g. reserve almost "all memory" for PostgreSQL (shared
> buffers etc.).
>
> Please correct me if I am wrong.

You mentioned above:

> RAM: 64 GB
> Data: 500 GB - 1.5 TB, for example.

If most of that data just sits on disk and is never read then you
might be right, but if the working set of the data is larger than RAM
then you might find you get better performance from smaller shared
buffers.

I think the best thing you can go and do is to go and test this. Write
some code that mocks up a realistic production workload and see where
you get the best performance.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: NOT IN vs. NOT EXISTS performance

2018-11-08 Thread David Rowley
On 9 November 2018 at 08:35, Lincoln Swaine-Moore
 wrote:
> My primary question is: why is this approach only possible (for data too
> large for memory) when using NOT EXISTS, and not when using NOT IN?
>
> I understand that there is a slight difference in the meaning of the two
> expressions, in that NOT IN will produce NULL if there are any NULL values
> in the right hand side (in this case there are none, and the queries should
> return the same COUNT). But if anything, I would expect that to improve
> performance of the NOT IN operation, since a single pass through that data
> should reveal if there are any NULL values, at which point that information
> could be used to short-circuit. So I am a bit baffled.

The problem is that the planner makes the plan and would have to know
beforehand that no NULLs could exist on either side of the join. For
more simple cases it could make use of NOT NULL constaints, but more
complex cases exist, such as:

SELECT * FROM t1 LEFT JOIN t2 ON t1.x = t2.y WHERE t2.y NOT IN(SELECT
z FROM t3);

There's a bit more reading about the complexity of this in [1]

[1] 
https://www.postgresql.org/message-id/flat/CAMkU%3D1zPVbez_HWao781L8PzFk%2Bd1J8VaJuhyjUHaRifk6OcUA%40mail.gmail.com#7c6d3178c18103d8508f3ec5982b1b8e

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Query with high planning time at version 11.1 compared versions 10.5 and 11.0

2018-11-27 Thread David Rowley
On Wed, 28 Nov 2018 at 03:16, Sanyo Moura  wrote:
> 11.0
> Planning Time: 7.238 ms
> Planning Time: 2.638 ms
>
> 11.5
> Planning Time: 15138.533 ms
> Execution Time: 2.310 ms

Does it still take that long after running ANALYZE on the partitioned table?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: SQL Perfomance during autovacuum

2018-12-18 Thread David Rowley
On Wed, 19 Dec 2018 at 19:04, anand086  wrote:
> We have noticed this increase in execution times for the sqls only when
> autovacuum runs and it runs with prevent wraparound mode. I think during the
> autovacuum process the Buffers: shared hit are increasing causing increase
> in execution time.
>
> I need help with the approach to debug this issue. Is this expected
> behaviour wherein sql execution timing incease during the autovacuum? If so
> , what is the reason for the same?

This is unsurprising. There are various GUC settings designed to
throttle vacuum to help minimise this problem. The auto-vacuum process
is competing for the same resources as your query is, and is likely
loading many new buffers, therefore flushing buffers out of cache that
might be useful for your query.

Showing the output of:

select name,setting from pg_Settings where name like '%vacuum%';

may be of use here.

You'll particularly want to pay attention to the settings of
autovacuum_vacuum_cost_delay, autovacuum_vacuum_cost_limit and
vacuum_cost_limit.  The settings of vacuum_cost_page_dirty,
vacuum_cost_page_hit, vacuum_cost_page_miss matter too, but these are
less often changed by users.

You may be able to learn exactly what's going on with the query by doing:

set track_io_timing = on;
explain (analyze, buffers, timing) 

both during the auto-vacuum run, and at a time when it's not running.

If the query plans of each match, then pay attention to the number of
buffers read and how long they took to read. If you find that these
don't explain the variation then something else is at fault, perhaps
CPU contention, or perhaps swapping due to high memory usage.

It also seems pretty strange that you should need to use DISTINCT on a
column that's named "id".

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Query Performance Issue

2018-12-28 Thread David Rowley
On Sat, 29 Dec 2018 at 04:32, Justin Pryzby  wrote:
> I think the solution is to upgrade (at least) to PG10 and CREATE STATISTICS
> (dependencies).

Unfortunately, I don't think that'll help this situation. Extended
statistics are currently only handled for base quals, not join quals.
See dependency_is_compatible_clause().

It would be interesting to see how far out the estimate is without the
version = 1 clause.  If just the recommended_content_id clause is
underestimated enough it could be enough to have the planner choose
the nested loop. Perhaps upping the stats on that column may help, but
it may only help so far as to reduce the chances of a nested loop. If
the number of distinct recommended_content_id values is higher than
the statistic targets and is skewed enough then there still may be
some magic values in there that end up causing a bad plan.

It would also be good to know what random_page_cost is set to, and
also if effective_cache_size isn't set too high.  Increasing
random_page_cost would help reduce the chances of this nested loop
plan, but it's a pretty global change and could also have a negative
effect on other queries.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Query Performance Issue

2018-12-29 Thread David Rowley
On Sat, 29 Dec 2018 at 20:15, Justin Pryzby  wrote:
> On Sat, Dec 29, 2018 at 07:58:28PM +1300, David Rowley wrote:
> > Unfortunately, I don't think that'll help this situation. Extended
> > statistics are currently only handled for base quals, not join quals.
> > See dependency_is_compatible_clause().
>
> Right, understand.
>
> Corrrect me if I'm wrong though, but I think the first major misestimate is in
> the scan, not the join:

I should have checked more carefully. Of course, they are base quals.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-02 Thread David Rowley
On Thu, 3 Jan 2019 at 05:28, Abadie Lana  wrote:
> I would appreciate any hints as this problem looks to me rather strange…I 
> tried to google it but in vain.
>
> select t.name, 
> c.smpl_time,c.nanosecs,c.float_val,c.num_val,c.str_val,c.datatype,c.array_val 
> from sample c, channel t where t.channel_id=c.channel_id and 
> t.name='BUIL-B36-VA-RT-RT1:CL0001-2-ABW'   order by c.smpl_time desc limit 5;
>
> takes 20mn to execute because it picks up the wrong index…see explain analyse 
> below. I would expect this query to use the (channel_id,smpl_time) but it 
> uses the smpl_time index.

[...]

> Any ideas, why the planner is not taking the right index?

The planner assumes that the required channel values are evenly
distributed through the scan of the index on smpl_time.  If your
required 5 rows were found quickly (i.e channels with recent sample
values), then the plan would have worked out well.  It looks like
'BUIL-B36-VA-RT-RT1:CL0001-2-ABW' is probably a channel which has some
very old sample values. I can see that from "Rows Removed by Join
Filter: 322099471",  meaning that on backwards scanning the smpl_time
index, that many rows were found not to match the channel you
requested.

The planner, by default only has statistics to say how common each
channel is in the sample table. I think in this case since the planner
has no knowledge of which channel_id it will be searching for (that's
only determined during execution), then I suppose it must be using the
n_distinct of the sample.channel_id table.  It would be interesting to
know how far off the n_distinct estimation is. You can find out with:

select stadistinct from pg_statistic where starelid='sample'::regclass
and staattnum = 1;
select count(*) from (select distinct channel_id from sample) s; --
this may take a while to run...

If the stadistinct estimate is far out from the reality, then you
could consider setting this manually with:

alter table sample alter column channel_id set (n_distinct = );

but keep in mind, that as the table evolves, whatever you set there
could become outdated.

Another method to fix you could try would be to coax the planner into
doing something different would be to give it a better index to work
with.

create index on channel(name, channel_id);

You didn't show us the details from the channel table, but if there's
not an index like this then this might reduce the cost of a Merge
Join, but since the order rows output from that join would be in
channel_id order, a Sort would be required, which would require
joining all matching rows, not just the first 5 matches. Depending on
how many rows actually match will determine if that's faster or not.

If you don't have luck with either of the above then, one other thing
you could try would be to disallow the planner from using the
smpl_time index by changing the order by to "ORDER BY c.smpl_time +
INTERVAL '0 sec';   that's a bit of a hack, but we don't have anything
we officially call "query hints" in PostgreSQL, so often we're left to
solve issues like this with ugly tricks like that.

Also, going by:

> ->  Seq Scan on channel t  (cost=0.00..915.83 rows=1 width=41) (actual 
> time=4.683..7.885 rows=1 loops=1)

perhaps "name" is unique on the channel table?  (I doubt there's an
index/constraint to back that up, however, since such an index would
have likely been used here instead of the Seq Scan)

If so, and you can add a constraint to back that up, you might  be
able to reform the query to be:

select 'BUIL-B36-VA-RT-RT1:CL0001-2-ABW',
c.smpl_time,c.nanosecs,c.float_val,c.num_val,c.str_val,c.datatype,c.array_val
from sample c
WHERE c.channel_id = (SELECT channel_id FROM channel WHERE
name='BUIL-B36-VA-RT-RT1:CL0001-2-ABW')
order by c.smpl_time desc limit 5;

If you can do that then it's highly likely to be *very* fast to
execute since I see there's an index on (channel_id, smpl_time) on
each of the inherited tables.

(If our planner was smarter then in the presence of the correct unique
index, we could have rewritten the query as such automatically but
it's not / we don't.  I believe I've mentioned about improving this
somewhere in the distant past of the -hackers mailing list, although I
can't find it right now. I recall naming the idea "scalar value lookup
joins", but development didn't get much beyond thinking of that name)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-03 Thread David Rowley
On Fri, 4 Jan 2019 at 01:57, Abadie Lana  wrote:
> 4) name is unique, constraint and index created. Right index is picked up and 
> query time is rather constant there 40sec.

That's surprisingly slow. Can you share the EXPLAIN (ANALYZE, BUFFERS) of that?


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-03 Thread David Rowley
> From: David Rowley 
> Sent: 03 January 2019 14:01
> That's surprisingly slow. Can you share the EXPLAIN (ANALYZE, BUFFERS) of 
> that?
>
> explain (analyze,buffers) select 
> 'BUIL-B36-VA-RT-RT1:CL0001-2-ABW',c.smpl_time,c.nanosecs,c.float_val,c.num_val,c.str_val,c.datatype,c.array_val
>  from sample c WHERE c.channel_id = (SELECT channel_id FROM channel WHERE 
> name='BUIL-B36-VA-RT-RT1:CL0001-2-ABW') order by c.smpl_time desc limit 5;


>->  Index Scan Backward using smpl_time_bx2_idx on 
> sample_buil_year c_5  (cost=0.56..2023925.30 rows=3162364
> width=320) (actual time=15167.330..15167.330 rows=0 loops=1)
>  Filter: (channel_id = $0)
>  Rows Removed by Filter: 50597834
>  Buffers: shared hit=25913147 read=713221
>->  Index Scan Backward using sample_time_cm_idx on 
> sample_ctrl_month c_6  (cost=0.56..1862587.12 rows=537562
>  width=77) (actual time=0.048..0.048 rows=0 loops=1)
>  Index Cond: (channel_id = $0)
>  Buffers: shared read=4
>->  Index Scan Backward using smpl_time_cmx2_idx on 
> sample_ctrl_year c_7  (cost=0.57..3186305.67 rows=2094186
>  width=68) (actual time=25847.549..25847.549 rows=0 loops=1)
>  Filter: (channel_id = $0)
>  Rows Removed by Filter: 79579075
>  Buffers: shared hit=49868991 read=1121715

Right, so you need to check your indexes on sample_ctrl_year and
sample_buil_year. You need an index on (channel_id, smpl_time) on
those.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-03 Thread David Rowley
On Fri, 4 Jan 2019 at 02:20, Abadie Lana  wrote:
> > From: David Rowley 
> > Sent: 03 January 2019 14:01
> Right, so you need to check your indexes on sample_ctrl_year and 
> sample_buil_year. You need an index on (channel_id, smpl_time) on those.

> These indexes exist already

That's interesting. The \d output indicates that the indexes are not
INVALID, so it's not all that obvious why the planner would choose a
lesser index to provide the required rows. One thought is that the
more suitable index is very bloated.  This would increase the
estimated cost of scanning the index and reduce the chances of the
index being selected by the query planner.

If you execute:

select indrelid::regclass as table_name, indexrelid::Regclass as
index_name,pg_size_pretty(pg_relation_size(indrelid))
table_size,pg_size_pretty(pg_relation_size(indexrelid)) index_size
from pg_index
where indrelid in('sample_ctrl_year'::regclass, 'sample_buil_year'::regclass)
order by indrelid::regclass::name, indexrelid::regclass::name;

This should show you the size of the tables and indexes in question.
If the sample_time_cy_idx and sample_time_by_idx indexes are very
large when compared with the size of their table, then it is likely
worth building a new index for these then dropping the old index then
retrying the re-written version of the query.  If this is a live
system then you can build the new indexes by using the CREATE INDEX
CONCURRENTLY command.  This will allow other DML operations to work
without being blocked. The old indexes can then be dropped with DROP
INDEX CONCURRENTLY.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: select query does not pick up the right index

2019-01-09 Thread David Rowley
On Thu, 10 Jan 2019 at 01:55, Abadie Lana  wrote:
> The other query suggested by D.Rowley has the same issue : still swap 
> activity is higher.
> explain analyze select 'BUIL-B36-VA-RT-RT1:CL0001-2-ABW', 
> c.smpl_time,c.nanosecs,c.float_val,c.num_val,c.str_val,c.datatype,c.array_val 
> from sample c where c.channel_id in (select channel_id from channel where 
> name ='BUIL-B36-VA-RT-RT1:CL0001-2-ABW')   order by c.smpl_time  desc limit 5;

This is not the query I suggested.  I mentioned if channel.name had a
unique index, you'd be able to do WHERE c.channel_id = (select
channel_id from channel where name = '...').  That's pretty different
to what you have above.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Parallel stats in execution plans

2019-01-16 Thread David Conlin

Hi folks -

I'm having trouble understanding what some of the stats mean in the 
execution plan output when parallel workers are used. I've tried to read 
up about it, but I haven't been able to find anything that explains what 
I'm seeing. Apologies in advance if there's documentation I've been too 
stupid to find.


I've run the following query. The "towns" table is a massive table that 
I created in order to get some big numbers on a parallel query - don't 
worry, this isn't a real query I want to make faster, just a silly 
example I'd like to understand.


EXPLAIN (ANALYZE, FORMAT JSON, BUFFERS, VERBOSE)
SELECT name, code, article
FROM towns
ORDER BY nameASC,
 codeDESC;

The output looks like this:

[
  {
"Plan": {
  "Node Type": "Gather Merge", "Parallel Aware": false, "Startup Cost": 
1013948.54, "Total Cost": 1986244.55, "Plan Rows": 884, "Plan 
Width": 77, "Actual Startup Time": 42978.838, "Actual Total Time": 
60628.982, "Actual Rows": 1010, "Actual Loops": 1, "Output": ["name", "code", "article"], "Workers Planned": 2, "Workers Launched": 2, "Shared Hit Blocks": 29, 
"Shared Read Blocks": 47641, "Shared Dirtied Blocks": 0, "Shared Written 
Blocks": 0, "Local Hit Blocks": 0, "Local Read Blocks": 0, "Local 
Dirtied Blocks": 0, "Local Written Blocks": 0, "Temp Read Blocks": 
91342, "Temp Written Blocks": 91479, "Plans": [

{
  "Node Type": "Sort", "Parent Relationship": "Outer", "Parallel Aware": 
false, "Startup Cost": 1012948.52, "Total Cost": 1023365.25, "Plan 
Rows": 4166692, "Plan Width": 77, "Actual Startup Time": 42765.496, 
"Actual Total Time": 48526.168, "Actual Rows": 337, "Actual Loops": 
3, "Output": ["name", "code", "article"], "Sort Key": ["towns.name", "towns.code DESC"], "Sort Method": "external merge", "Sort Space Used": 283856, "Sort 
Space Type": "Disk", "Shared Hit Blocks": 170, "Shared Read Blocks": 
142762, "Shared Dirtied Blocks": 0, "Shared Written Blocks": 0, "Local 
Hit Blocks": 0, "Local Read Blocks": 0, "Local Dirtied Blocks": 0, 
"Local Written Blocks": 0, "Temp Read Blocks": 273289, "Temp Written 
Blocks": 273700, "Workers": [

{
  "Worker Number": 0, "Actual Startup Time": 42588.662, "Actual Total 
Time": 48456.662, "Actual Rows": 3277980, "Actual Loops": 1, "Shared Hit 
Blocks": 72, "Shared Read Blocks": 46794, "Shared Dirtied Blocks": 0, 
"Shared Written Blocks": 0, "Local Hit Blocks": 0, "Local Read Blocks": 
0, "Local Dirtied Blocks": 0, "Local Written Blocks": 0, "Temp Read 
Blocks": 89067, "Temp Written Blocks": 89202 }, {
  "Worker Number": 1, "Actual Startup Time": 42946.705, "Actual Total 
Time": 48799.414, "Actual Rows": 3385130, "Actual Loops": 1, "Shared Hit 
Blocks": 69, "Shared Read Blocks": 48327, "Shared Dirtied Blocks": 0, 
"Shared Written Blocks": 0, "Local Hit Blocks": 0, "Local Read Blocks": 
0, "Local Dirtied Blocks": 0, "Local Written Blocks": 0, "Temp Read 
Blocks": 92880, "Temp Written Blocks": 93019 }

  ], "Plans": [
{
  "Node Type": "Seq Scan", "Parent Relationship": "Outer", "Parallel 
Aware": true, "Relation Name": "towns", "Schema": "public", "Alias": 
"towns", "Startup Cost": 0.00, "Total Cost": 184524.92, "Plan Rows": 
4166692, "Plan Width": 77, "Actual Startup Time": 0.322, "Actual Total 
Time": 8305.886, "Actual Rows": 337, "Actual Loops": 3, "Output": ["name", "code", "article"], "Shared Hit Blocks": 96, "Shared Read Blocks": 142762, "Shared Dirtied 
Blocks": 0, "Shared Written Blocks": 0, "Local Hit Blocks": 0, "Local 
Read Blocks": 0, "Local Dirtied Blocks": 0, "Local Written Blocks": 0, 
"Temp Read Blocks": 0, "Temp Written Blocks": 0, "Workers": [

{
  "Worker Number": 0, "Actual Startup Time": 0.105, "Actual Total Time": 
8394.629, "Actual Rows": 3277980, "Actual Loops": 1, "Shared Hit 
Blocks": 35, "Shared Read Blocks": 46794, "Shared Dirtied Blocks": 0, 
"Shared Written Blocks": 0, "Local Hit Blocks": 0, "Local Read Blocks": 
0, "Local Dirtied Blocks": 0, "Local Written Blocks": 0, "Temp Read 
Blocks": 0, "Temp Written Blocks": 0 }, {
  "Worker Number": 1, "Actual Startup Time": 0.113, "Actual Total Time": 
8139.382, "Actual Rows": 3385130, "Actual Loops": 1, "Shared Hit 
Blocks": 32, "Shared Read Blocks": 48327, "Shared Dirtied Blocks": 0, 
"Shared Written Blocks": 0, "Local Hit Blocks": 0, "Local Read Blocks": 
0, "Local Dirtied Blocks": 0, "Local Written Blocks": 0, "Temp Read 
Blocks": 0, "Temp Written Blocks": 0 }

  ]
}
  ]
}
  ]
}, "Planning Time": 22.898, "Triggers": [
], "Execution Time": 61133.161 }
]

Or a more slimmed-down version, with just the confusing fields:

[
  {
"Plan": {
  "Node Type": "Gather Merge", "Parallel Aware": false,"Actual Total 
Time": 60628.982, "Actual Rows": 1010, "Actual Loops": 1,"Workers 
Planned": 2, "Workers Launched": 2,"Plans": [

{
  "Node Type": "Sort"

Re: Parallel stats in execution plans

2019-01-24 Thread David Conlin
It seems like no-one has any ideas on this - does anyone know anywhere 
else I can try to look/ask to find out more?


Is it possible that this is a bug?

Thanks

Dave

On 16/01/2019 11:31, David Conlin wrote:


Hi folks -

I'm having trouble understanding what some of the stats mean in the 
execution plan output when parallel workers are used. I've tried to 
read up about it, but I haven't been able to find anything that 
explains what I'm seeing. Apologies in advance if there's 
documentation I've been too stupid to find.


I've run the following query. The "towns" table is a massive table 
that I created in order to get some big numbers on a parallel query - 
don't worry, this isn't a real query I want to make faster, just a 
silly example I'd like to understand.


EXPLAIN (ANALYZE, FORMAT JSON, BUFFERS, VERBOSE)
SELECT name, code, article
FROM towns
ORDER BY nameASC,
  codeDESC;

The output looks like this:

[
   {
 "Plan": {
   "Node Type": "Gather Merge", "Parallel Aware": false, "Startup Cost": 
1013948.54, "Total Cost": 1986244.55, "Plan Rows": 884, "Plan 
Width": 77, "Actual Startup Time": 42978.838, "Actual Total Time": 
60628.982, "Actual Rows": 1010, "Actual Loops": 1, "Output": ["name", "code", "article"], "Workers Planned": 2, "Workers Launched": 2, "Shared Hit Blocks": 
29, "Shared Read Blocks": 47641, "Shared Dirtied Blocks": 0, "Shared 
Written Blocks": 0, "Local Hit Blocks": 0, "Local Read Blocks": 0, 
"Local Dirtied Blocks": 0, "Local Written Blocks": 0, "Temp Read 
Blocks": 91342, "Temp Written Blocks": 91479, "Plans": [

 {
   "Node Type": "Sort", "Parent Relationship": "Outer", "Parallel Aware": 
false, "Startup Cost": 1012948.52, "Total Cost": 1023365.25, "Plan 
Rows": 4166692, "Plan Width": 77, "Actual Startup Time": 42765.496, 
"Actual Total Time": 48526.168, "Actual Rows": 337, "Actual 
Loops": 3, "Output": ["name", "code", "article"], "Sort Key": ["towns.name", "towns.code DESC"], "Sort Method": "external merge", "Sort Space Used": 283856, "Sort 
Space Type": "Disk", "Shared Hit Blocks": 170, "Shared Read Blocks": 
142762, "Shared Dirtied Blocks": 0, "Shared Written Blocks": 0, "Local 
Hit Blocks": 0, "Local Read Blocks": 0, "Local Dirtied Blocks": 0, 
"Local Written Blocks": 0, "Temp Read Blocks": 273289, "Temp Written 
Blocks": 273700, "Workers": [

 {
   "Worker Number": 0, "Actual Startup Time": 42588.662, "Actual Total 
Time": 48456.662, "Actual Rows": 3277980, "Actual Loops": 1, "Shared 
Hit Blocks": 72, "Shared Read Blocks": 46794, "Shared Dirtied Blocks": 
0, "Shared Written Blocks": 0, "Local Hit Blocks": 0, "Local Read 
Blocks": 0, "Local Dirtied Blocks": 0, "Local Written Blocks": 0, 
"Temp Read Blocks": 89067, "Temp Written Blocks": 89202 }, {
   "Worker Number": 1, "Actual Startup Time": 42946.705, "Actual Total 
Time": 48799.414, "Actual Rows": 3385130, "Actual Loops": 1, "Shared 
Hit Blocks": 69, "Shared Read Blocks": 48327, "Shared Dirtied Blocks": 
0, "Shared Written Blocks": 0, "Local Hit Blocks": 0, "Local Read 
Blocks": 0, "Local Dirtied Blocks": 0, "Local Written Blocks": 0, 
"Temp Read Blocks": 92880, "Temp Written Blocks": 93019 }

   ], "Plans": [
 {
   "Node Type": "Seq Scan", "Parent Relationship": "Outer", "Parallel 
Aware": true, "Relation Name": "towns", "Schema": "public", "Alias": 
"towns", "Startup Cost": 0.00, "Total Cost": 184524.92, "Plan Rows": 
4166692, "Plan Width": 77, "Actual Startup Time": 0.322, "Actual Total 
Time": 8305.886, "Actual Rows": 337, "Actual Loops": 3, "Output": ["name", "code", "article"], "Shared Hit Blocks": 96, "Shared Read Blocks": 142762, "Shared 
Dirtied Blocks&

Re: Parallel stats in execution plans

2019-01-24 Thread David Rowley
On Thu, 17 Jan 2019 at 00:31, David Conlin  wrote:
> How the time values combine with parallelism. For example, each execution of 
> the sort node takes an average of 48.5s, over three loops. This makes a total 
> running time of 145.5s. Even if this was perfectly distributed between the 
> two workers, I would expect this to take 72.75s, which is more than the total 
> execution time, so it can't take this long.
>  How the row numbers combine with those in the "Workers" subkey. For example, 
> in the Sort node, worker #0 produces 3,277,980 rows, while worker #1 produces 
> 3,385,130 rows. The Sort node as a whole produces 3,333,337 rows per loop, 
> for a total of 10,000,010 (the value in the gather merge node). I would have 
> expected the number of rows produced by the two workers to sum to the number 
> produced by the Sort node as a whole, either per loop or in total.
> How the "Actual Loops" values combine with those in the "Workers" subkey. For 
> example, the "Sort" node has 3 loops, but each of the workers inside it have 
> 1 loop. I would have expected either:
>
> each of the workers to have done 3 loops (since the sort is executed 3 
> times), or
> the number of loops in the two workers to sum to three (if the three 
> executions of the sort are distributed across the two workers)

It's important to know that all of the actual row counts and actual
time are divided by the number of loops, which in this case is 3, one
per process working on that part of the plan.  There are two workers,
but also the main process helps out too.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Zero throughput on a query on a very large table.

2019-01-24 Thread David Rowley
On Fri, 25 Jan 2019 at 19:24, Tom Lane  wrote:
> PS: On the third hand, you mention having created new indexes on this
> table with apparently not a lot of pain, which is a tad surprising
> if you don't have the patience to wait for a sort to finish.  How
> long did those index builds take?

It would certainly be good to look at psql's \d tmp_outpatient_rev
output to ensure that the index is not marked as INVALID.


-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread David Rowley
On Thu, 7 Feb 2019 at 00:17, Laurenz Albe  wrote:
>
> On Wed, 2019-02-06 at 12:29 +0200, Mariel Cherkassky wrote:
> > Now the question is how to handle or tune it ? Is there any change that I 
> > need to increase the cost_limit / cost_delay ?
>
> Maybe configuring autovacuum to run faster will help:
>
> alter table orig_table set (toast.autovacuum_vacuum_cost_limit = 2000);
>
> Or, more extreme:
>
> alter table orig_table set (toast.autovacuum_vacuum_cost_delay = 0);

Going by the block hits/misses/dirtied and the mentioned vacuum times,
it looks like auto-vacuum is set to the standard settings and if so it
spent about 100% of its time sleeping on the job.

It might be a better idea to consider changing the vacuum settings
globally rather than just for one table.

Running a vacuum_cost_limit of 200 is likely something you'd not want
to ever do with modern hardware... well maybe unless you just bought
the latest Raspberry PI, or something.  You should be tuning that
value to something that runs your vacuums to a speed you're happy with
but leaves enough IO and CPU for queries running on the database.

If you see that all auto-vacuum workers are busy more often than not,
then they're likely running too slowly and should be set to run more
quickly.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: autovacuum big table taking hours and sometimes seconds

2019-02-06 Thread David Rowley
On Thu, 7 Feb 2019 at 02:34, Mariel Cherkassky
 wrote:
> As I said, I set the next settings for the toasted table :
>
>  alter table orig_table set (toast.autovacuum_vacuum_scale_factor = 0);
>
>   alter table orig_table set (toast.autovacuum_vacuum_threshold  =1);

These settings don't control how fast auto-vacuum runs, just when it should run.

> Can you explain a little bit more why you decided that the autovacuum spent 
> it time on sleeping ?

Yeah, if you look at the following settings.

 vacuum_cost_limit   | 200
 vacuum_cost_page_dirty  | 20
 vacuum_cost_page_hit| 1
 vacuum_cost_page_miss   | 10
 autovacuum_vacuum_cost_delay| 20ms

I've tagged on the default setting for each of these.  Both vacuum and
auto-vacuum keep score of how many points they've accumulated while
running. 20 points for dirtying a page, 10 for a read that's not found
to be in shared_buffers, 1 for reading a buffer from shared buffers.
When vacuum_cost_limit points is reached (or
autovacuum_vacuum_cost_limit if not -1) auto-vacuum sleeps for
autovacuum_vacuum_cost_delay, normal manual vacuums sleep for
vacuum_cost_delay.

In one of the log entries you saw:

> buffer usage: 4808221 hits, 6404148 misses, 6152603 dirtied
> avg read rate: 2.617 MiB/s, avg write rate: 2.514 MiB/s
> system usage: CPU 148.65s/70.06u sec elapsed 19119.55 sec

Doing a bit of maths to see how much time that vacuum should have slept for:

postgres=# select (4808221 * 1 + 6404148 * 10 + 6152603 * 20) / 200.0
* 20 / 1000;
  ?column?

 19190.1761

That's remarkably close to the actual time of 19119.55 sec. If you do
the same for the other 3 vacuums then you'll see the same close match.

> I see the autovacuum statistics from the logs, how can I check that the 
> workers are busy very often ?

It would be nice if there was something better, but periodically doing:

SELECT count(*) FROM pg_stat_activity where query like 'autovacuum%';

will work.

> My vacuum limit is 200 right now, basically If vacuum runs on my toasted 
> table and reached 200 but it didnt finish to clean all the dead tuples, after 
> the nap, should it continue cleaning it or wait until the vacuum_threshold 
> hit again ?

You're confusing nap time is something else, Maybe you're confusing
that with speed of vacuum?  Napping is just the time auto-vacuum will
wait between checking for new tables to work on.  Having the
auto-vacuum run so slowly is a probable cause of still having dead
tuples after the vacuum... likely because they became dead after
vacuum started.

I'd recommend reading the manual or Tomas Vondra's blog about vacuum
costs. It's not overly complex, once you understand what each of the
vacuum settings does.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Why isn't an index scan being used?

2019-02-19 Thread David Rowley
On Wed, 20 Feb 2019 at 13:11, Abi Noda  wrote:
> However, when I index the closed column, a bitmap scan is used instead of an 
> index scan, with slightly slower performance. Why isn't an index scan being 
> used, given that the exact same number of rows are at play as in my query on 
> the state column?

That's down to the planner's cost estimates. Likely it thinks that
either doing a bitmap scan is cheaper, or close enough that it does
not matter.

> How do I index closed in a way where an index scan is used?

The costing does account for the size of the index. If the
"closed_index" index is large than the "state_index", then doing an
Index scan on "closed_index" is going to be costed higher.

Most of this likely boils down to random_page_cost being a guess. You
may want to check your effective_cache_size is set to something like
75% of the machine's memory, and/or tweak random page cost down, if
it's set to the standard 4 setting.  modern SSDs are pretty fast at
random reads. HDDs, not so much.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Massive parallel queue table causes index deterioration, but REINDEX fails with deadlocks.

2019-02-23 Thread David Rowley
On Sun, 24 Feb 2019 at 10:06, Gunther  wrote:
> I am using an SQL queue for distributing work to massively parallel workers. 
> Workers come in servers with 12 parallel threads. One of those worker sets 
> handles 7 transactions per second. If I add a second one, for 24 parallel 
> workers, it scales to 14 /s. Even a third, for 36 parallel workers, I can add 
> to reach 21 /s. If I try a fourth set, 48 workers, I end up in trouble. But 
> that isn't even so much my problem rather than the fact that in short time, 
> the performance will deteriorate, and it looks like that is because the queue 
> index deteriorates and needs a REINDEX.

It sounds very much like auto-vacuum is simply unable to keep up with
the rate at which the table is being updated.   Please be aware, that
by default, auto-vacuum is configured to run fairly slowly so as not
to saturate low-end machines.

vacuum_cost_limit / autovacuum_vacuum_cost limit control how many
"points" the vacuum process can accumulate before it will perform an
autovacuum_vacuum_cost_delay / vacuum_cost_delay.

Additionally, after an auto-vacuum run completes it will wait for
autovacuum_naptime before checking again if any tables require some
attention.

I think you should be monitoring how many auto-vacuums workers are
busy during your runs. If you find that the "queue" table is being
vacuumed almost constantly, then you'll likely want to increase
vacuum_cost_limit / autovacuum_vacuum_cost_limit. You could get an
idea of how often this table is being auto-vacuumed by setting
log_autovacuum_min_duration to 0 and checking the logs.  Another way
to check would be to sample what: SELECT query FROM pg_stat_activity
WHERE query LIKE 'autovacuum%'; returns. You may find that all of the
workers are busy most of the time.  If so, that indicates that the
cost limits need to be raised.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: IS NOT DISTINCT FROM statement

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 00:30, Artur Zając  wrote:
> Is there any reason that „NOT DISTINCT FROM” can’t be autotransformed to „=” 
> when value on right side of expression is not NULL or is this any way to use 
> index with „IS NOT DISTINCT FROM” statement?

Probably nothing other than nobody has done it yet.  It might be
reasonable to have some transformation stage called from
distribute_restrictinfo_to_rels() when adding single rel RestrictInfos
to RTE_RELATION base rels.  It's only these you can check for NOT NULL
constraints, i.e. not so possible with rtekinds such as RTE_FUNCTION
and the like.

It becomes more complex if you consider that someone might have added
a partial index on the relation that matches the IS NOT DISTINCT FROM
clause.  In this case, they might not be happy that their index can no
longer be used. Fixing that would require some careful surgery on
predicate_implied_by() to teach it about IS NOT DISTINCT FROM clauses.
However, that seems to go a step beyond what predicate_implied_by()
does for now. Currently, it only gets to know about quals. Not the
relations they belong to, so there'd be no way to know that the NOT
NULL constraint exists from there.  I'm not sure if there's a good
reason for this or not, it might be because it's not been required
before.  It gets more complex still if you want to consider other
quals in the list to prove not nullness.

In short, probably possible, but why not just write an equality
clause, if you know NULLs are not possible?

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: IS NOT DISTINCT FROM statement

2019-03-08 Thread David Rowley
On Sat, 9 Mar 2019 at 01:25, Artur Zając  wrote:
> I made some workaround. I made function:
>
> CREATE OR REPLACE FUNCTION smarteq(v1 int,v2 INT) RETURNS BOOL AS
> $BODY$
> SELECT (CASE WHEN v2 IS NULL THEN (v1 IS NULL) ELSE v1=v2 END);
> $BODY$ LANGUAGE 'sql' IMMUTABLE PARALLEL SAFE;

> explain analyze select id from sometable where smarteq(id1,NULL);
>   QUERY PLAN
> --
>  Bitmap Heap Scan on sometable  (cost=19338.59..57045.02 rows=882343 width=4) 
> (actual time=116.236..306.304 rows=881657 loops=1)
>Recheck Cond: (id1 IS NULL)
>Heap Blocks: exact=9581
>->  Bitmap Index Scan on sometable_index1  (cost=0.00..19118.00 
> rows=882343 width=0) (actual time=114.209..114.209 rows=892552 loops=1)
>  Index Cond: (id1 IS NULL)
>  Planning time: 0.135 ms
>  Execution time: 339.229 ms
>
> It looks like it works, but I must check if it will still works in plpgsql (I 
> expect some problems if query is prepared).

I think with either that you'll just be at the mercy of whether a
generic or custom plan is chosen.  If you get a custom plan then
likely your case statement will be inlined and constant folded away,
but for a generic plan, that can't happen since those constants are
not consts, they're parameters.   Most likely, if you've got an index
on the column you'll perhaps always get a custom plan as the generic
plan would result in a seqscan and it would have to evaluate your case
statement for each row.  By default, generic plans are only considered
on the 6th query execution and are only chosen if the generic cost is
cheaper than the average custom plan cost + fuzz cost for planning.
PG12 gives you a bit more control over that with the plan_cache_mode
GUC, but... that's the not out yet.

However, possibly the cost of planning each execution is cheaper than
doing the seq scan, so you might be better off with this.  There is a
risk that the planner does for some reason choose a generic plan and
ends up doing the seq scan, but for that to happen likely the table
would have to be small, in which case it wouldn't matter or the costs
would have to be off, which might cause you some pain.

The transformation mentioned earlier could only work if the arguments
of the IS NOT DISTINCT FROM were Vars or Consts. It couldn't work with
Params since the values are unknown to the planner.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Poor man's partitioned index .... not being used?

2019-03-20 Thread David Rowley
On Thu, 21 Mar 2019 at 15:51, Gunther  wrote:
> foo=# CREATE UNIQUE INDEX Test_pk0 ON Test(id) WHERE mod(id,2) = 0;
> CREATE INDEX
> foo=# CREATE UNIQUE INDEX Test_pk1 ON Test(id) WHERE mod(id,2) = 1;
> CREATE INDEX
> foo=# ANALYZE Test;
> ANALYZE
> foo=# EXPLAIN SELECT * FROM Test WHERE id = '8934';
>   QUERY PLAN
> ---
>  Seq Scan on test  (cost=0.00..6569.61 rows=1 width=4)
>Filter: (id = 8934)
> (2 rows)
>
> Why is that index never used?

When the planner looks at partial indexes to see if they'll suit the
scan, the code that does the matching (predicate_implied_by()) simply
does not go to that much trouble to determine if it matches. If you
look at operator_predicate_proof() you'll see it requires the
expression on at least one side of the OpExpr to match your predicate.
Yours matches on neither side since "id" is wrapped up in a mod()
function call.

Certainly, predicate_implied_by() is by no means finished, new smarts
have been added to it over the years to allow it to prove more cases,
but each time something is added we still need to carefully weigh up
the additional overhead of the new code vs. possible benefits.

It may be possible to do something with immutable functions found in
the expr but someone doing so might have a hard time proving that it's
always safe to do so. For example, arg 2 of your mod() call is a
Const. If it had been another Var then it wouldn't be safe to use.
What other unsafe cases are there? Is there a way we can always
identify unsafe cases during planning? ... are the sorts of questions
someone implementing this would be faced with.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Poor man's partitioned index .... not being used?

2019-03-22 Thread David Rowley
On Fri, 22 Mar 2019 at 07:57, Gunther  wrote:
> foo=# PREPARE testplan(int) AS
> foo-#SELECT * FROM Test WHERE mod(id,2) = mod($1,2) AND id = $1;
> PREPARE
> foo=# EXPLAIN EXECUTE testplan(8934);
> QUERY PLAN
> --
>  Index Only Scan using test_pk0 on test  (cost=0.42..8.44 rows=1 width=4)
>Index Cond: (id = 8934)
> (2 rows)
>
> That's quite alright actually. Now the questions is, could we use this in a 
> nested loop query plan? That's where I think it can't work:

Not really. In that case, the parameters were replaced with the
specified values (a.k.a custom plan).  That happens for the first 5
executions of a prepared statement, and in this case likely the
planner will continue to use the custom plan since the generic plan
won't know that the partial index is okay to use and the plan costs
would likely go up enough that the custom plan would continue to be
favoured.

> foo=# SET enable_mergejoin TO off;
> SET
> foo=# EXPLAIN SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON( mod(b.id,2) = 
> mod(a.id,2) AND b.id = a.id) LIMIT 10;
>QUERY PLAN
> 
>  Limit  (cost=0.00..102516.78 rows=10 width=8)
>->  Nested Loop Left Join  (cost=0.00..38238760.24 rows=3730 width=8)
>  Join Filter: ((b.id = a.id) AND (mod(b.id, 2) = mod(a.id, 2)))
>  ->  Seq Scan on test2 a  (cost=0.00..54.30 rows=3730 width=4)
>  ->  Materialize  (cost=0.00..9056.93 rows=388129 width=4)
>->  Seq Scan on test b  (cost=0.00..5599.29 rows=388129 
> width=4)
> (6 rows)
>
> It looks like it doesn't want to evaluate the mod(a.id, 2) before it moves to 
> the index query for the nested loop.

Whether partial indexes can be used are not is determined using only
quals that can be applied at the scan level.  In this case your qual
is a join qual, and since no other qual exists that can be evaluated
at the scan level where the index can be used, then it's not
considered.  In any case, nothing there guarantees that one of your
indexes will match all records. For it to work, both of you indexes
would have to be scanned. It's not clear why you think that would be
any better than scanning just one index. I imagine it would only ever
be a win if you could eliminate one of the index scans with some qual
that guarantees that the index can't contain any records matching your
query.

> I wonder if there was a way of marking such expressions as safe in the query, 
> like suggesting a certain evaluation order, i.e.,
>
> SELECT * FROM Test2 a LEFT OUTER JOIN Test b ON(mod(b.id,2) = 
> EVAL(mod(a.id,2)) AND b.id = a.id) LIMIT 10;
>
> It's OK though. It just goes to show that in a case like this, it is best to 
> just go with the partitioned table anyway.

It sounds like you might want something like partition-wise join that
exists in PG11.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: Postgresql Sort cost Poor performance?

2019-04-02 Thread David Rowley
On Tue, 2 Apr 2019 at 20:00, tank.zhang <[email protected]> wrote:
> 2、 Adding a DISTINCT response time was very slow
>
> qis3_dp2=# SELECT COUNT(DISTINCT SVIN) AS CHECKCARNUM FROM
> QIS_CARPASSEDSTATION A WHERE 1=1 AND A.SSTATIONCD = 'VQ3_LYG' AND A.SLINENO
> IN ( '1F' , '2F' , '3F' ) AND A.DWORKDATE >=
> TO_DATE('2017-02-11','-MM-DD') AND A.DWORKDATE <=
> TO_DATE('2019-03-11','-MM-DD');
>  checkcarnum
> -
>  1071367
> (1 row)

That's because of how DISTINCT is implemented within an aggregate
function in PostgreSQL. Internally within the aggregate code in the
executor, a sort is performed on the entire input to the aggregate
node.  The planner is currently unable to make use of any indexes that
provide pre-sorted input.

One way to work around this would be to perform the DISTINCT and
COUNT(*) in separate stages using a subquery.

>From your original query, something like:

SELECT COUNT(SVIN) AS CHECKCARNUM,SMTOC
FROM (
SELECT SMTOC,SVIN
FROM QIS_CARPASSEDSTATION A
WHERE 1=1 AND A.SSTATIONCD = 'VQ3_LYG'
AND A.SLINENO IN ( '1F' , '2F' , '3F' )
AND A.DWORKDATE >= TO_DATE('2017-02-11','-MM-DD')
AND A.DWORKDATE <= TO_DATE('2019-03-11','-MM-DD')
GROUP BY  SMTOC,SVIN
) A GROUP BY SMTOC;

An index something like:
CREATE INDEX ON QIS_CARPASSEDSTATION (SMTOC, SVIN, SSTATIONCD, DWORKDATE);

Should help speed up the subquery and provide pre-sorted input to the
outer aggregate.  If you like, you could add SLINENO to the end of the
index to allow an index-only scan which may result in further
performance improvements.

Without the index, you're forced to sort, but at least it's just one
sort instead of two.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Temporarily very slow planning time after a big delete

2019-05-20 Thread David Rowley
On Tue, 21 May 2019 at 12:44, Walter Smith  wrote:
>
> We had a mysterious (to us) slowdown today that I'm hoping someone can 
> explain just based on PG's principles of operation. It got better by itself 
> so it seems like it was "normal" behavior -- I just don't know what behavior 
> it was exhibiting.
>
> We have a table of user notifications containing about 80 million rows. It 
> gets a lot of inserts continually, and is cleaned up once a day. There are no 
> updates. In all history there have been about 330 million rows created.
>
> Today we deleted about 15 million rows in one transaction from this table. 
> Immediately afterwards, a particular SELECT started running very slowly -- 
> 500 to 3000 ms rather than the usual <1ms.
>
> We did an EXPLAIN ANALYZE on this select and it was still doing an index scan 
> as usual. The *planning time* for the query is what had gotten slow. The 
> query itself was still executing in <1ms.

It would be good to know which version you're running here.  It
basically sounds very much like get_actual_variable_range() will be
the culprit.  Basically, if a constant value that's being used by the
planner to determine row estimates with falls outside the statistic's
histogram and a btree index exists that we can use to look up the
actual bound of the data, then we do so in that function.  If you've
just deleted a bunch of rows then that index scan may have to traverse
a bunch of dead tuples before finding that first live tuple.  This
code has changed a few times in recent times, see fccebe421 and
3ca930fc3, which is why your version is of interest.

Apart from that, if you want to confirm that's the issue and you just
want it fixed, just VACUUM the table. You should likely be doing that
anyway directly after your bulk delete.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Analyze results in more expensive query plan

2019-05-20 Thread David Rowley
On Tue, 21 May 2019 at 08:23, Jeremy Altavilla
 wrote:
>
> Thanks for the suggestion. I created extended statistics objects for the two 
> tables in question. Unfortunately the resulting plan was the same (and had 
> the same estimates). It looks like the extended stats discovered a 
> potentially useful correlation on bag: "2, 3 => 1" (owner_id, bag_type_id => 
> id). I'm guessing this wasn't usable because the docs state "They are not 
> used to improve estimates for equality conditions comparing two columns".

I'd say that since the time spent planning is near 3x what is spent
during execution that you're wasting your time trying to speed up the
execution.  What you should be thinking about is using PREPAREd
statements to avoid the planning overhead completely.  If that's not
possible then you've more chance of reducing the time spent planning
by reducing the statistics on the table rather than adding more
planning overhead by adding extended stats.  You might want to
experiment with ALTER TABLE ... ALTER COLUMN ... SET STATISTICS ..;
and setting those down a bit then analyzing the tables again.
Although, that's likely only going to make a very small difference, if
any, than getting rid of the planning completely.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Temporarily very slow planning time after a big delete

2019-05-20 Thread David Rowley
On Tue, 21 May 2019 at 14:04, Walter Smith  wrote:
> I'm so sorry -- I meant to give the version, of course. It's 9.6.13.

https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=3ca930fc3
has been applied since then.

It would be good if you could confirm the problem is resolved after a
vacuum.  Maybe run VACUUM VERBOSE on the table and double check
there's not some large amount of tuples that are "nonremovable".

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: scans on table fail to be excluded by partition bounds

2019-06-26 Thread David Rowley
On Tue, 25 Jun 2019 at 05:31, Justin Pryzby  wrote:
> ts=# explain SELECT * FROM eric_enodeb_cell_metrics WHERE start_time BETWEEN 
> '2019-01-01 04:00' AND '2019-01-01 05:00' OR start_time BETWEEN '2019-01-02 
> 04:00' AND '2019-01-02 05:00';
>  Append  (cost=36.04..39668.56 rows=12817 width=2730)
>->  Bitmap Heap Scan on eric_enodeb_cell_20190101  (cost=36.04..19504.14 
> rows=6398 width=2730)
>  Recheck Cond: (((start_time >= '2019-01-01 04:00:00-05'::timestamp 
> with time zone) AND (start_time <= '2019-01-01 05:00:00-05'::timestamp with 
> time zone)) OR ((start_time >= '2019-01-02 04:00:00-05'::timestamp with time 
> zone) AND (start_time <= '2019-01-02 05:00:00-05'::timestamp with time zone)))
>  ->  BitmapOr  (cost=36.04..36.04 rows=6723 width=0)
>->  Bitmap Index Scan on eric_enodeb_cell_20190101_idx  
> (cost=0.00..16.81 rows=6465 width=0)
>  Index Cond: ((start_time >= '2019-01-01 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-01 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Index Scan on eric_enodeb_cell_20190101_idx  
> (cost=0.00..16.03 rows=259 width=0)
>  Index Cond: ((start_time >= '2019-01-02 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-02 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Heap Scan on eric_enodeb_cell_20190102  (cost=36.08..20100.34 
> rows=6419 width=2730)
>  Recheck Cond: (((start_time >= '2019-01-01 04:00:00-05'::timestamp 
> with time zone) AND (start_time <= '2019-01-01 05:00:00-05'::timestamp with 
> time zone)) OR ((start_time >= '2019-01-02 04:00:00-05'::timestamp with time 
> zone) AND (start_time <= '2019-01-02 05:00:00-05'::timestamp with time zone)))
>  ->  BitmapOr  (cost=36.08..36.08 rows=6982 width=0)
>->  Bitmap Index Scan on eric_enodeb_cell_20190102_idx  
> (cost=0.00..16.03 rows=259 width=0)
>  Index Cond: ((start_time >= '2019-01-01 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-01 
> 05:00:00-05'::timestamp with time zone))
>->  Bitmap Index Scan on eric_enodeb_cell_20190102_idx  
> (cost=0.00..16.84 rows=6723 width=0)
>  Index Cond: ((start_time >= '2019-01-02 
> 04:00:00-05'::timestamp with time zone) AND (start_time <= '2019-01-02 
> 05:00:00-05'::timestamp with time zone))
>
> Is there some reason why the partition constraints aren't excluding any of the
> index scans ?

Yeah, we don't do anything to remove base quals that are redundant due
to the partition constraint.

There was a patch [1] to try and fix this but it's not seen any recent activity.

[1] https://commitfest.postgresql.org/19/1264/

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Planner performance in partitions

2019-08-12 Thread David Rowley
On Tue, 13 Aug 2019 at 08:03, Piotr Włodarczyk
 wrote:
> PostgreSQL11.5:
>
> select on main partition (public.book): planner: 60ms, execution: 5ms
> select on partitioned table (public.book_1-1000): planner: 2.7 ms, execution: 
> 2,4 ms
>
> PostgreSQL 12B3:
>
> select on main partition (public.book): planner: 2,5ms , execution: 1,2ms
> select on partitioned table (public.book_1-1000): planner: 2.5 ms, execution: 
> 1,2 ms
>
> So looking at above results we have two options:
>
> Wait for 12.0 stable version
> Wait for patches to 11 – PostgreSQL Team: can You do this?

You'll need to either reduce the number of partitions down to
something realistic or wait for 12.0.

The work done to speed up the planner with partitioned tables for v12
won't be going into v11.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Get the planner used by a query?

2019-10-09 Thread David Rowley
On Wed, 9 Oct 2019 at 19:21, Behrang Saeedzadeh  wrote:
>
> Is there a way to display the planner algorithm used by a query, either in 
> EXPLAIN or in a different way?

There's not really any simple way to know. If the number of relations
in the join search meets or exceeds geqo_threshold then it'll use the
genetic query optimizer. However, knowing exactly how many relations
are in the join search is not often simple since certain types of
subqueries can be pulled up into the main query and that can increase
the number of relations in the search.

If you don't mind writing C code, then you could write an extension
that hooks into join_search_hook and somehow outputs this information
to you before going on to call the geqo if the "enable_geqo &&
levels_needed >= geqo_threshold" condition is met.  Besides that, I
don't really know if there's any way.  You could try editing the
geqo_seed and seeing if the plan changes, but if it does not, then
that does not mean the geqo was not used, so doing it that way could
be quite error-prone.  You'd only be able to tell the geqo was being
used if you could confirm that changing geqo_seed did change the plan.
(And you could be certain the plan did not change for some other
reason like an auto-analyze).



--
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Query slow again after adding an `OR` operation (was: Slow PostgreSQL 10.6 query)

2019-10-09 Thread David Rowley
On Thu, 10 Oct 2019 at 01:07, Behrang Saeedzadeh  wrote:
>
> This is a follow up to 
> https://www.postgresql.org/message-id/flat/CAERAJ%2B-1buiJ%2B_JWEo0a9Ao-CVMWpgp%3DEnFx1dJtnB3WmMi2zQ%40mail.gmail.com
>
> The query (generated by Hibernate) got a bit more complex and performance 
> degraded again. I have uploaded all the details here (with changed table 
> names, etc.): https://github.com/behrangsa/slow-query
>
> In short, the new query is:

The query mostly appears slow due to the "Rows Removed By Filter" in
the OR condition. The only way to get around not scanning the entire
branch_invoices table would be to somehow write the way in such a way
that allows it to go on the inner side of the join.

You could do that if you ensure there's an index on branch_invoices
(branch_id) and format the query as:

SELECT inv.id   AS i_id,
   inv.invoice_date AS inv_d,
   inv.invoice_xid  AS inv_xid,
   inv.invoice_type AS inv_type,
   brs.branch_idAS br_id,
   cinvs.company_id AS c_id
FROM invoices inv
 LEFT OUTER JOIN branch_invoices brs ON inv.id = brs.invoice_id
 LEFT OUTER JOIN company_invoices cinvs ON inv.id = cinvs.invoice_id
 INNER JOIN branches br ON brs.branch_id = br.id
WHERE brs.branch_id IN (SELECT br1.id
FROM branches br1
 INNER JOIN access_rights ar1 ON
br1.id = ar1.branch_id
 INNER JOIN users usr1 ON ar1.user_id = usr1.id
 INNER JOIN groups grp1 ON
ar1.group_id = grp1.id
 INNER JOIN group_permissions gpr1 ON
grp1.id = gpr1.group_id
 INNER JOIN permissions prm1 ON
gpr1.permission_id = prm1.id
WHERE usr1.id = 1636
  AND prm1.code = 'C2'
  AND ar1.access_type = 'T1')
UNION ALL
SELECT br3.id
FROM companies cmp
 INNER JOIN branches br3 ON cmp.id =
br3.company_id
 INNER JOIN access_rights ar2 ON
cmp.id = ar2.company_id
 INNER JOIN users usr2 ON ar2.user_id = usr2.id
 INNER JOIN groups g2 ON ar2.group_id = g2.id
 INNER JOIN group_permissions gpr2 ON
g2.id = gpr2.group_id
 INNER JOIN permissions prm2 ON
gpr2.permission_id = prm2.id
WHERE usr2.id = 1636
  AND prm2.code = 'C2'
  AND ar2.access_type = 'T1')
ORDER BY inv.invoice_date DESC, br.name ASC
LIMIT 12;

The planner may then choose to pullup the subquery and uniquify it
then put it on the outside of a nested loop join then lookup the
branch_invoices record using the index on branch_id.  I think this is
quite a likely plan since the planner estimates there's only going to
be 1 row from each of the subqueries.

Also note, that the LEFT JOIN you have to branch_invoices is not
really a left join since you're insisting that the branch_id must be
in the first or 2nd sub-plan. There's no room for it to be NULL. The
planner will just convert that to an INNER JOIN with the above query
since that'll give it the flexibility to put the subquery in the IN
clause on the outside of the join (after having uniquified it).
You'll need to decide what you actually want the behaviour to be here.
If you do need those NULL rows then you'd better move your WHERE quals
down into the join condition for branch_invoices table. I'd suggest
testing with some mock-up data if you're uncertain of what I mean.

If you find that is faster and you can't rewrite the query due to it
having been generated by Hibernate, then that sounds like a problem
with Hibernate.  PostgreSQL does not currently attempt to do any
rewrites which convert OR clauses to use UNION or UNION ALL. No amount
of tweaking the planner settings is going to change that fact.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Change in CTE treatment in query plans?

2019-10-15 Thread David Conlin

Hi folks -

Does anyone know if there's been a change in the way values for CTEs are 
displayed in query plans?


I think that it used to be the case that, for keys that include the 
values of child nodes values (eg "Shared Hit Blocks", or "Actual Total 
Time"), CTE scans included the CTE itself, even if it wasn't included as 
one of its children in the plan. If you didn't subtract the CTE scan, 
you would see surprising things, like sort operations reading table 
data, or the total time of the nodes in a single-threaded query plan 
adding up to significantly more than 100% of the total query time.


Now (I think since v11, but I'm not sure), it looks like these values 
only include the children listed in the plan. For example, I've seen CTE 
scans that have smaller times and buffers values than the CTE itself, 
which couldn't be true if the CTE was included in the scan.


I'm much less sure, but I *think* the same is also true of other 
InitPlan nodes - for example, if a node includes the filter "value > 
$1", its time and buffers used to (but no longer does) include the total 
for the InitPlan node which returned the value "$1".


Am I way off base with this, or did this change happen, and if so, am I 
right in thinking that it was changed in v11?


Thanks in advance

Dave







Re: Change in CTE treatment in query plans?

2019-10-21 Thread David Conlin

Hi Tom -

Thanks so much for getting back to me.

I didn't realise that the costs of init/sub plans would be spread across 
the call sites - I had (stupidly) assumed that each call site would 
include the full cost.


Having taken a couple of days to go back over the problems I was seeing, 
you were absolutely right - it was all to do with multiple call sites - 
the postgres version was just a red herring.


Thanks for your help & all the best,

Dave

On 17/10/2019 10:04, Tom Lane wrote:

David Conlin  writes:

Does anyone know if there's been a change in the way values for CTEs are
displayed in query plans?

Offhand I don't recall any such changes, nor does a cursory look
through explain.c find anything promising.

If you're concerned with a multiply-referenced CTE, one possibility
for funny results is that the blame for its execution cost could be
spread across the multiple call sites.  The same can happen with
initplans/subplans.  But I'm just guessing; you didn't show any
concrete examples so it's hard to be definite.

regards, tom lane





Slow planning, fast execution for particular 3-table query

2019-11-03 Thread David Wheeler
We’re having trouble working out why the planning time for this particular 
query is slow (~2.5s vs 0.9ms execution time). As you can see below, there are 
only 3 tables involved so it’s hard to imagine what decisions the planner has 
to make that take so long. After 5 runs the prepared-statement code kicks in 
and it becomes quick, but it’s quite infuriating for the first 5 runs given the 
execution is so much faster.

Are you able to give any tips what might be taking so long (and how we might 
improve it)?

We read elsewhere that someone had a “catalog stats file leak”, which I’m 
taking to mean a big pg_statistic table. Ours is 10mb, which doesn’t seem 
particularly large to me, but I don’t have much context for it. 
https://www.postgresql.org/message-id/CABWW-d21z_WgawkjXFQQviqm16oAx0KQvR6bLkRxvYQmhdByfg%40mail.gmail.com

Other queries (with 3 or more tables) in the same db seem to be planning much 
quicker.

The query:

explain (analyse) SELECT subscription_binding.subscription_binding,
 subscription_binding.tid,
 subscription.subscription_uuid,
 subscription_binding.subscription_binding_uuid,
 binding.binding_uuid,
 subscription_binding.start_time,
 subscription_binding.end_time,
 subscription_binding.timezone,
 now() >= subscription_binding.start_time AND 
(subscription_binding.end_time IS NULL OR now() <= 
subscription_binding.end_time) AS active
FROM jackpot.binding
  JOIN jackpot.subscription_binding USING (tid, binding)
  JOIN jackpot.subscription USING (tid, subscription)
where (tid=2082003407) AND 
(binding_uuid='4f61dcd5-97a0-4098-b9ae-c1546c31b2e6'::uuid) offset 0 limit 1000;

QUERY PLAN
--
Limit  (cost=1.29..25.38 rows=1 width=80) (actual time=0.770..0.771 rows=1 
loops=1)
   ->  Nested Loop  (cost=1.29..25.38 rows=1 width=80) (actual 
time=0.770..0.771 rows=1 loops=1)
 ->  Nested Loop  (cost=0.86..16.91 rows=1 width=76) (actual 
time=0.697..0.698 rows=1 loops=1)
   ->  Index Scan using binding_tid_binding_uuid_key on binding  
(cost=0.43..8.45 rows=1 width=28) (actual time=0.647..0.647 rows=1 loops=1)
 Index Cond: ((tid = 2082003407) AND (binding_uuid = 
'4f61dcd5-97a0-4098-b9ae-c1546c31b2e6'::uuid))
   ->  Index Scan using subscription_binding_idx on 
subscription_binding  (cost=0.43..8.45 rows=1 width=64) (actual 
time=0.045..0.046 rows=1 loops=1)
 Index Cond: ((tid = 2082003407) AND (binding = 
binding.binding))
 ->  Index Scan using subscription_pkey on subscription  
(cost=0.43..8.45 rows=1 width=28) (actual time=0.068..0.068 rows=1 loops=1)
   Index Cond: ((tid = 2082003407) AND (subscription = 
subscription_binding.subscription))
Planning time: 2429.682 ms
Execution time: 0.914 ms
(11 rows)

Postgres version 9.5.19

Each of the tables has between 3-4 indexes, and all the indexes include tid as 
first parameter. No partitions, no sign of a stray replication slot / 
uncommitted transaction / prepared transaction that may be holding up autovac, 
no sign of bloated indexes.

TIA!

Best regards,

David Wheeler
General Manager Bali Office
Bali T +62 361 475 2333   M +62 819 3660 9180
E [email protected]<mailto:[email protected]>
Jl. Pura Mertasari No. 7, Sunset Road Abian Base
Kuta, Badung – Bali 80361, Indonesia
http://www.dgitsystems.com<http://www.dgitsystems.com/>

[signature_1605257152][signature_1263573595]




Re: Slow planning, fast execution for particular 3-table query

2019-11-03 Thread David Wheeler
I'm not sure what "unusually large" is, but they're all < 1mb which is a little 
larger than some of our other comparable databases (mostly <300kb) but seems 
reasonable to me. 

Regards, 
 
David

On 4/11/19, 3:37 pm, "Laurenz Albe"  wrote:

    On Mon, 2019-11-04 at 03:04 +, David Wheeler wrote:
> We’re having trouble working out why the planning time for this 
particular query is slow
> (~2.5s vs 0.9ms execution time). As you can see below, there are only 3 
tables involved
> so it’s hard to imagine what decisions the planner has to make that take 
so long. After
> 5 runs the prepared-statement code kicks in and it becomes quick, but 
it’s quite
> infuriating for the first 5 runs given the execution is so much faster. 
> 
> Are you able to give any tips what might be taking so long (and how we 
might improve it)?
> 
[...]
> Planning time: 2429.682 ms
> 
> Execution time: 0.914 ms

Strange.
Are any of your catalog tables unusually large?

SELECT pg_relation_size(t.oid),
   t.relname
FROM pg_class AS t
   JOIN pg_namespace AS n ON t.relnamespace = n.oid
WHERE t.relkind = 'r'
ORDER BY pg_relation_size(t.oid) DESC
LIMIT 10;

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com





Re: Slow planning, fast execution for particular 3-table query

2019-11-03 Thread David Wheeler
>To see this issue, you have to have recently
>inserted or deleted a bunch of extremal values of the indexed join-key
>column.  And the problem only persists until those values become known
>committed-good, or known dead-to-everybody.  (Maybe you've got a
>long-running transaction somewhere, postponing the dead-to-everybody
>condition?)

There are no long-running transactions that have backend_xmin set in 
pg_stat_activity, if that's what you mean here. There are also no open prepared 
transactions or replication slots which I understand have a similar 
keeping-things-alive issue. 

These tables are biggish (hundreds of mb), but not changing so frequently that 
I'd expect large quantities of data to be inserted or deleted before autovac 
can get in there and clean it up. And certainly not in a single uncommitted 
transaction. 

I'll try reindexing each of the tables just to make sure it's not strange index 
imbalance or something causing the issue. 

Regards, 
 
David

On 4/11/19, 4:01 pm, "Tom Lane"  wrote:

David Wheeler  writes:
> We’re having trouble working out why the planning time for this
> particular query is slow (~2.5s vs 0.9ms execution time). As you can see
> below, there are only 3 tables involved so it’s hard to imagine what
> decisions the planner has to make that take so long.

I wonder whether this traces to the cost of trying to estimate the
largest/smallest value of an indexed column by looking into the index.
Normally that's pretty cheap, but if you have a lot of recently-inserted
or recently-deleted values at the end of the index, it can get painful.
AFAIR this only happens for columns that are equijoin keys, so the fact
that your query is a join is significant.

I'm not convinced that this is the problem, because it's a corner case
that few people hit.  To see this issue, you have to have recently
inserted or deleted a bunch of extremal values of the indexed join-key
column.  And the problem only persists until those values become known
committed-good, or known dead-to-everybody.  (Maybe you've got a
long-running transaction somewhere, postponing the dead-to-everybody
condition?)

> Postgres version 9.5.19

If this *is* the cause, v11 and up have a performance improvement that
you need:

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3ca930fc3

regards, tom lane




Re: Slow planning, fast execution for particular 3-table query

2019-11-06 Thread David Wheeler
>> I'll try reindexing each of the tables just to make sure it's not strange 
>> index imbalance or something causing the issue.
> I seen this issue few time - and reindex helps.

Alas our reindex doesn’t seem to have helped. I’m going to see if we can 
reproduce this on a non-prod environment so we can muck about a bit more. If we 
can reproduce it in a safe place, is there a tool we can use to get more info 
out of the query planner to find what it’s doing to take so long?

Regards,

David

From: Pavel Stehule 
Date: Monday, 4 November 2019 at 4:53 pm
To: David Wheeler 
Cc: Tom Lane , "[email protected]" 
, Cameron Redpath 

Subject: Re: Slow planning, fast execution for particular 3-table query



po 4. 11. 2019 v 6:17 odesílatel David Wheeler 
mailto:[email protected]>> napsal:
>To see this issue, you have to have recently
>inserted or deleted a bunch of extremal values of the indexed join-key
>column.  And the problem only persists until those values become known
>committed-good, or known dead-to-everybody.  (Maybe you've got a
>long-running transaction somewhere, postponing the dead-to-everybody
>condition?)

There are no long-running transactions that have backend_xmin set in 
pg_stat_activity, if that's what you mean here. There are also no open prepared 
transactions or replication slots which I understand have a similar 
keeping-things-alive issue.

These tables are biggish (hundreds of mb), but not changing so frequently that 
I'd expect large quantities of data to be inserted or deleted before autovac 
can get in there and clean it up. And certainly not in a single uncommitted 
transaction.

I'll try reindexing each of the tables just to make sure it's not strange index 
imbalance or something causing the issue.

I seen this issue few time - and reindex helps.

Pavel


Regards,

David

On 4/11/19, 4:01 pm, "Tom Lane" mailto:[email protected]>> 
wrote:

David Wheeler mailto:[email protected]>> 
writes:
> We’re having trouble working out why the planning time for this
> particular query is slow (~2.5s vs 0.9ms execution time). As you can see
> below, there are only 3 tables involved so it’s hard to imagine what
> decisions the planner has to make that take so long.

I wonder whether this traces to the cost of trying to estimate the
largest/smallest value of an indexed column by looking into the index.
Normally that's pretty cheap, but if you have a lot of recently-inserted
or recently-deleted values at the end of the index, it can get painful.
AFAIR this only happens for columns that are equijoin keys, so the fact
that your query is a join is significant.

I'm not convinced that this is the problem, because it's a corner case
that few people hit.  To see this issue, you have to have recently
inserted or deleted a bunch of extremal values of the indexed join-key
column.  And the problem only persists until those values become known
committed-good, or known dead-to-everybody.  (Maybe you've got a
long-running transaction somewhere, postponing the dead-to-everybody
condition?)

> Postgres version 9.5.19

If this *is* the cause, v11 and up have a performance improvement that
you need:

https://git.postgresql.org/gitweb/?p=postgresql.git&a=commitdiff&h=3ca930fc3

regards, tom lane



Re: Slow planning, fast execution for particular 3-table query

2019-11-06 Thread David Wheeler
Is default_statistics_target set above default 100? I would assume that would 
reflect in the size of pg_statistic, but wanted to ask since increasing that 
from 100 to 1000 was the only time I have seen planning time explode. Are other 
queries slow to plan?

Looks like you’ve found it! Someone has set the target to 10k so that’s going 
to wildly increase planning time.

Thanks for your help! And thanks to the others who chipped in along the way 😊

Regards,

David



Re: Slow planning, fast execution for particular 3-table query

2019-11-06 Thread David Rowley
On Thu, 7 Nov 2019 at 11:59, Justin Pryzby  wrote:
>
> On Mon, Nov 04, 2019 at 03:04:45AM +, David Wheeler wrote:
> > Postgres version 9.5.19
> > Each of the tables has between 3-4 indexes, and all the indexes include tid 
> > as first parameter.


> But note that index definition will be prohibited since:
>
> https://www.postgresql.org/docs/9.6/release-9-6.html
> |Disallow creation of indexes on system columns, except for OID columns 
> (David Rowley)
> |Such indexes were never considered supported, and would very possibly 
> misbehave since the system might change the system-column fields of a tuple 
> without updating indexes. However, previously there were no error checks to 
> prevent them from being created.

David will have meant the user column named "tid" rather than the
system column named "ctid".

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: Slow performance with trivial self-joins

2020-02-05 Thread David Rowley
On Thu, 6 Feb 2020 at 11:12, Adam Brusselback  wrote:
>
> > You can create a library of
> > reusable views that are small, easy-to-understand and readable. Then
> > you build them up into bigger views, and finally query from them. But
> > then you end up with lots of (hidden) self-joins.
>
> I will concur with this use case being pretty common, but also something I
> have actively avoided anywhere performance is important because of the
> lack of this optimization.
>
> Even still, I have 20+ views like that in my database.

I think the best direction to move in to push that forward would be to
go and benchmark the proposed patch and see if the overhead of
detecting the self joined relations is measurable with various queries
with varying numbers of joins.

It does not sound too like it would be a great deal of effort to look
through the rangetable for duplicate Oids and only do further
processing to attempt self-join removal if there are. However, if that
effort happened to slow down all queries by say 5%, then perhaps it
would be a bad idea.  People's opinions don't really have much
traction for arguments on this. Unbiased and reproducible benchmarks
should be used as evidence to support discussion. Doing worst-case and
average-case benchmarks initially will save you time, as someone will
almost certainly ask if you don't do it.

(I've not been following the thread for the patch)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services




Re: pg12 partitions show bad performance vs pg96

2020-03-08 Thread David Rowley
On Mon, 9 Mar 2020 at 05:05, Mariel Cherkassky
 wrote:
> PG12 :
>  Planning Time: 8.157 ms
>  Execution Time: 2.920 ms
> (22 rows)
>
>
> PG96 :
>  Planning time: 0.815 ms
>  Execution time: 0.158 ms
> (12 rows)

8 ms seems pretty slow to planning that query. Does the planning time
drop if you execute this multiple times in the same session? Does the
time change if you try again without any foreign keys?

The planning time for the partitioned case is certainly going to take
longer. Partitioned tables are never going to improve the times of
query planning. It's only possible that they'll improve the
performance during query execution.

For such small fast queries as the ones you've shown, it's important
to remember that more complex query plans (ones with more nodes) do
lead to longer times for executor startup and shutdown.   EXPLAIN
(without ANALYZE), will perform query planning and executor
startup/shutdown. If you enable \timing on in psql and test the
EXPLAIN performance of these queries in each version, then you might
get an idea of where the overheads are.

Additionally, you're unlikely to see performance improvements with
table partitioning unless you're accessing many rows and partitioning
allows the data locality of the rows that you are accessing to
improve. i.e accesses fewer buffers and/or improves cache hit ratios.
In PG12, if the partition pruning can be done during query planning
then the planning and executor startup overhead is much lower since
there are fewer relations to generate access paths for and fewer nodes
in the final plan. This also improves the situation during execution
as it means fewer locks to take and fewer nodes to startup/shutdown.

> As you can see, the results are better in pg96. This example only shows the 
> results for a small data set. In bigger data sets I get a bigger diff...

Can you share the results of that?

> I tried changing many postgresql.conf parameters that were added 
> (max_workers_per_gather,enable_partitionwise_join and so on..).

The former only does anything for parallel queries. None of the plans
you've shown are parallel ones.   The latter also does not count in
this case. It only counts when joining two identically partitioned
tables.

> I dont understand why in pg12 it scans all the partitions instead of the 
> relevant one..

If you'd specified a specific "device" in the query SQL, then the
query planner would know which partition to scan for that particular
device. However, since you're looking up the device in another table
and performing a join, the device is only known during query
execution. The plan nodes for the non-matching partitions do go
through executor startup, but they're not scanned during execution, as
you've seen with the "(never executed)" appearing in the EXPLAIN
ANALYZE output.  Since executor startup occurs before execution, the
device you mean is still unknown during executor startup, so the
executor must startup the nodes for all partitions that are in the
plan. Starting up a plan node is not free, but not really very
expensive either. However, the overhead of it might be quite large
proportionally in your case since the executor is doing so little
work.

The most basic guidelines for table partitioning are, don't partition
your tables unless it's a net win.   If partitioning was always
faster, we'd just have designed Postgres to implicitly partition all
of your tables for you. There are some other guidelines in [1].

[1] 
https://www.postgresql.org/docs/12/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE-BEST-PRACTICES

David




Re: pg12 partitions show bad performance vs pg96

2020-03-09 Thread David Rowley
On Tue, 10 Mar 2020 at 02:08, Mariel Cherkassky
 wrote:

> PG12 - 3 PARTITIONS
>
>  QUERY 
> PLAN
> 
>  Gather  (cost=1002.26..1563512.35 rows=10 width=44) (actual 
> time=22306.091..22309.209 rows=0 loops=1)
>Workers Planned: 2
>Workers Launched: 2
>->  Hash Join  (cost=2.26..1562511.35 rows=4 width=44) (actual 
> time=22299.412..22299.413 rows=0 loops=3)
>  Hash Cond: (da_2.device = de.id)
>  ->  Parallel Append  (cost=0.00..1562506.90 rows=814 width=37) 
> (actual time=22299.411..22299.411 rows=0 loops=3)
>->  Parallel Seq Scan on iot_data_2 da_2  
> (cost=0.00..584064.14 rows=305 width=37) (actual time=9076.535..9076.535 
> rows=0 loops=3)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 12460009
>->  Parallel Seq Scan on iot_data_1 da_1  
> (cost=0.00..504948.69 rows=262 width=36) (actual time=10296.751..10296.751 
> rows=0 loops=2)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 16158316
>->  Parallel Seq Scan on iot_data_0 da  (cost=0.00..473490.00 
> rows=247 width=37) (actual time=19075.081..19075.081 rows=0 loops=1)
>  Filter: ((metadata = 50) AND (lower(data) ~~ '50'::text))
>  Rows Removed by Filter: 30303339
>  ->  Hash  (cost=2.25..2.25 rows=1 width=7) (never executed)
>->  Seq Scan on iot_device de  (cost=0.00..2.25 rows=1 
> width=7) (never executed)
>  Filter: (name = '50a'::text)
>  Planning Time: 30.429 ms
>  Execution Time: 22309.364 ms
> (20 rows)

>From what I can work out, the DDL you used here is:

-- you didn't seem to specify the DDL for iot_device, so I used:
create table iot_device (
id bigint primary key,
name text not null
);

insert into iot_device select x,x::Text || 'a' from generate_Series(1,100) x;

create table iot_data(id serial ,data text,metadata bigint,device
bigint references iot_device(id),primary key(id,device)) partition by
hash(device);
create table iot_data_0 partition of iot_data for values with (MODULUS
3, remainder 0);
create table iot_data_1 partition of iot_data for values with (MODULUS
3, remainder 1);
create table iot_data_2 partition of iot_data for values with (MODULUS
3, remainder 2);

insert into iot_data select
generate_series(1,1000),random()*10,random()*254,random()*99+1;
create index on iot_data(metadata,lower(data));
vacuum analyze iot_data;

In which case, you're getting a pretty different plan than I am. (I
admit that I only tried on current master and not PG12.2, however, I
see no reason that PG12.2 shouldn't produce the same plan)

I get:

# explain analyze select * from iot_data da,iot_device de where
de.name in ('50a') and de.id = da.device and da.metadata=50 and
lower(da.data) like '50';

  QUERY PLAN

 Nested Loop  (cost=0.56..28.04 rows=1 width=49) (actual
time=0.058..0.058 rows=0 loops=1)
   Join Filter: (da.device = de.id)
   ->  Seq Scan on iot_device de  (cost=0.00..2.25 rows=1 width=11)
(actual time=0.013..0.016 rows=1 loops=1)
 Filter: (name = '50a'::text)
 Rows Removed by Filter: 99
   ->  Append  (cost=0.56..25.76 rows=3 width=38) (actual
time=0.040..0.040 rows=0 loops=1)
 ->  Index Scan using iot_data_0_metadata_lower_idx on
iot_data_0 da_1  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.020..0.020 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 ->  Index Scan using iot_data_1_metadata_lower_idx on
iot_data_1 da_2  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.010..0.010 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 ->  Index Scan using iot_data_2_metadata_lower_idx on
iot_data_2 da_3  (cost=0.56..8.58 rows=1 width=38) (actual
time=0.009..0.009 rows=0 loops=1)
   Index Cond: ((metadata = 50) AND (lower(data) = '50'::text))
   Filter: (lower(data) ~~ '50'::text)
 Planning Time: 0.280 ms
 Execution Time: 0.094 ms
(17 rows)

Are you certain that you added an index on iot_data (metadata, lower(data)); ?




Re: Duplicate WHERE condition changes performance and plan

2020-04-15 Thread David Rowley
On Thu, 16 Apr 2020 at 07:56, [email protected]  wrote:
> We have an odd issue where specifying the same where clause twice causes PG 
> to pick a much more efficent plan. We would like to know why.

> The EXPLAIN ANALYZE for both queries can be found here:-
> Query A: https://explain.depesz.com/s/lFuy
> Query B: https://explain.depesz.com/s/Jqmv

This is basically down to just a poor join selectivity estimation.
The selectivity estimation on the duplicate not equal clause is not
removed by the planner and the selectivity of that is taking into
account twice which reduces the selectivity of the table named
"problem". With that selectivity taken into account, the query planner
thinks a nested loop will be a more optimal plan, to which it seems to
be.

Join selectivity estimations can use the most common values lists as
you may see if you look at the pg_stats view for the tables and
columns involved in the join condition.  Perhaps ID columns are not
good candidates to get an MCV list in the stats.  In that case, the
ndistinct estimate will be used.  If there's no MCV list in the stats
then check ndistinct is reasonably accurate. If there is an MCV list,
then you can make that bigger by increasing the statistics targets on
the join columns and running ANALYZE. Note: Planning can become slower
when you increase the statistics targets.

Starting with PostgreSQL 9.6, foreign keys are also used to help with
join selectivity estimations. I see you have a suitable foreign key
from the schema you posted.  You might want to add that to the list of
reasons to upgrade.

David




  1   2   3   >