Re: Different plan chosen when in lateral subquery

2017-12-06 Thread Laurenz Albe
Alex Reece wrote:
> I get very different plan chosen when my query is in a lateral subquery vs 
> standalone --
> it doesn't use a key when joining on a table, instead opting to do a hash 
> join. Here is the query:
> 
>   select distinct on (sub.entity_id, sub.note_id, sub.series_id)
>  entity_id, note_id, series_id
>   from
>   (
>   select alloc.entity_id, alloc.note_id, alloc.series_id, 
> alloc.amount, inv.name
>   from public.portfolio_allocations alloc
>   JOIN contributions contrib on contrib.id = alloc.note_id
>   JOIN investments inv on inv.id = contrib.investment_id
>   where entity_id = '\x5787f132f50f7b03002cf835' and 
>   alloc.allocated_on <= dates.date
>   ) sub
> 
> And wrapped inside the lateral:
> 
> explain analyze
> select *
> from generate_series('2017-03-14 20:59:59.999'::TIMESTAMPTZ,  
>   current_timestamp::TIMESTAMP + INTERVAL '1 day', '24 hours') dates,
> LATERAL (
>   ...  ...
> ) lat
> 
> Run by itself injecting a hard coded value for dates.date, I get the expected 
> plan which uses a key index on contributions:

[...]

>  ->  Nested Loop  (cost=0.17..14.23 rows=2 width=52) 
> (actual time=0.022..0.028 rows=2 loops=1)
>->  Index Scan using 
> portfolio_allocations_entity_id_allocated_on_idx on portfolio_allocations 
> alloc  (cost=0.09..6.05 rows=2 width=39) (actual time=0.012..0.014 
>  Index Cond: ((entity_id = 
> '\x5787f132f50f7b03002cf835'::bytea) AND (allocated_on <= '2017-03-14 
> 20:59:59.999+00'::timestamp with time zone))
>->  Index Scan using 
> contributions_id_accrue_from_idx on contributions contrib  (cost=0.08..4.09 
> rows=1 width=26) (actual time=0.005..0.005 rows=1 loops=2)
>  Index Cond: (id = alloc.note_id)

[...]

> But run in the lateral, it doesn't use the index:

[...]

> ->  Hash Join  (cost=10775.83..20355.61 rows=5724 
> width=52) (actual time=1.657..5.980 rows=6713 loops=267)
>Hash Cond: (alloc.note_id = contrib.id)
>->  Bitmap Heap Scan on portfolio_allocations 
> alloc  (cost=69.82..9628.13 rows=5724 width=39) (actual time=1.010..2.278 
> rows=6713 loops=267)
>  Recheck Cond: ((entity_id = 
> '\x5787f132f50f7b03002cf835'::bytea) AND (allocated_on <= date(dates.dates)))
>  Heap Blocks: exact=118074
>  ->  Bitmap Index Scan on 
> portfolio_allocations_entity_id_allocated_on_idx  (cost=0.00..69.53 rows=5724 
> width=0) (actual time=0.956..0.956 rows=6713 lo
>Index Cond: ((entity_id = 
> '\x5787f132f50f7b03002cf835'::bytea) AND (allocated_on <= date(dates.dates)))
>->  Hash  (cost=9464.85..9464.85 rows=354617 
> width=26) (actual time=169.792..169.792 rows=354617 loops=1)
>  Buckets: 524288  Batches: 1  Memory Usage: 
> 24296kB
>  ->  Seq Scan on contributions contrib  
> (cost=0.00..9464.85 rows=354617 width=26) (actual time=0.007..83.246 
> rows=354617 loops=1)

[...]

> I have a few questions here:
>   - Why doesn't it use the primary key index in either case?

I don't know about the first query; perhaps the primary key index is fragmented.
Compare the size of the indexes on disk.
In the second query a sequential scan is used because PostgreSQL chooses a hash 
join.
That choice is made because the index scans returns 6713 rows rather than the 2
from the first query, probably because the date is different.

>   - Why isn't it choosing portfolio_allocations_pnsa, which seems like it 
> would prevent it from having to sort?

In a bitmap index scan, the table is scanned in physical order, so the result
is not sorted in index order.
I don't know if PostgreSQL is smart enough to figure out that it could use an 
index
scan and preserve the order through the joins to obviate the sort.
You could try to set enable_bitmapscan=off and see if things are different then.
Perhaps the slower index scan would outweigh the advantage of avoiding the sort.

Yours,
Laurenz Albe



Re: Bitmap scan is undercosted? - overestimated correlation and cost_index

2017-12-06 Thread Justin Pryzby
On Tue, Dec 05, 2017 at 01:50:11PM -0500, Tom Lane wrote:
> Jeff Janes  writes:
> > On Dec 3, 2017 15:31, "Tom Lane"  wrote:
> >> Jeff Janes  writes:
> >>> But I do see that ties within the logical order of the column values are
> >>> broken to agree with the physical order.  That is wrong, right?  Is there
> >>> any argument that this is desirable?
> 
> >> Uh ... what do you propose doing instead?  We'd have to do something with
> >> ties, and it's not so obvious this way is wrong.
> 
> > Let them be tied.
...
> I thought some more about this.  What we really want the correlation stat
> to do is help us estimate how randomly an index-ordered scan will access
> the heap.  If the values we've sampled are all unequal then there's no
> particular issue.  However, if we have some group of equal values, we
> do not really know what order an indexscan will visit them in.  The
> existing correlation calculation is making the *most optimistic possible*
> assumption, that such a group will be visited exactly in heap order ---
> and that assumption isn't too defensible.

I'm interested in discusstion regarding bitmap cost, since it would have helped
our case discussed here ~18 months ago:
https://www.postgresql.org/message-id/flat/20160524173914.GA11880%40telsasoft.com#[email protected]

...but remember: in Vitaliy's case (as opposed to mine), the index scan is
*faster* but being estimated at higher cost than bitmap (I have to keep
reminding myself).  So the rest of this discussion is about the
overly-optimistic cost estimate of index scans, which moves in the opposite
direction for this reported problem.  For the test cases I looked at, index
scans were used when RPC=1 and redundant conditions were avoided, so I'm not
sure if there's any remaining issue (but I haven't looked at the latest cases
Vitaliy sent).

> In any case, given that we do this calculation without regard
> to any specific index,

One solution is to compute stats (at least correlation) for all indices, not
just expr inds.  I did that earlier this year while throwing around/out ideas.
https://www.postgresql.org/message-id/20170707234119.GN17566%40telsasoft.com

> We do have an idea, from the data we have, whether the duplicates are close
> together in the heap or spread all over.

I think you just mean pg_stats.correlation for all values, not just duplicates
(with the understanding that duplicates might be a large fraction of the
tuples, and high weight in correlation).

Another issue I noted in an earlier thread is that as table sizes increase, the
existing correlation computation approaches 1 for correlated insertions, (like
"append-only" timestamps clustered around now()), due to ANALYZE sampling a
fraction of the table, and thereby representing only large-scale correlation,
and, to an increasing degree, failing to represent small-scale variations
between adjacent index TIDs, which has real cost (and for which the mitigation
by cache effects probably decreases WRT table size, too).  I think any solution
needs to handle this somehow.

Generated data demonstrating this (I reused this query so it's more complicated
than it needs to be):

[pryzbyj@database ~]$ time for sz in {,9{,9{,9{,9 ; do psql postgres 
-tc "DROP TABLE IF EXISTS t; CREATE TABLE t(i float, j int); CREATE INDEX ON 
t(i);INSERT INTO t SELECT i/9.0+pow(2,(-random())) FROM 
generate_series(1,$sz) i ORDER BY i; ANALYZE t; SELECT $sz, correlation, 
most_common_freqs[1] FROM pg_stats WHERE attname='i' AND tablename='t'"; done

  |0.187146 |  
9 |0.900629 |  
   99 |0.998772 |  
  999 |0.87 |  

Trying to keep it all in my own head: For sufficiently large number of pages,
bitmap scan should be preferred to idx scan due to reduced random-page-cost
outweighing its overhead in CPU cost.  Probably by penalizing index scans, not
discounting bitmap scans.  Conceivably a correlation adjustment can be
conditionalized or weighted based on index_pages_fetched() ...
x = ln (x/99);
if (x>1) correlation/=x;

I think one could look at the fraction of duplicated index keys expected to be
returned: if we expect to return 1000 tuples, with 200 duplicates from MCV,
cost_index would multiply correlation by (1 - 200/1000), meaning to use
something closer to max_IO_cost rather than min_IO_cost.  I imagine it'd be
possible to limit to only those MCVs which pass quals - if none pass, then
there may be few tuples returned, so apply no correction to (additionally)
penalize index scan.

In my tests, at one point I implemented idx_corr_fudge(), returning a value
like "fragmentation" from pgstatindex (which I'm sure is where I got the phrase
when reporting the problem).  That only uses the leaf nodes' "next" pointer,
and not the individual tuples, which probably works if there's a sufficiently
number of repeated keys.

I think that's all for now..

Justin



Re: Bitmap scan is undercosted?

2017-12-06 Thread Vitaliy Garnashevich


What seems odd to me is that in different kinds of tests (with 
different frequency of column values):


i1 Rows Removed by Filter = 900156, 179792, 89762 (decreased a lot)
i1 buffers = 46983, 44373, 39928 (decreased, but not a lot)
i1 best case time = 756.045, 127.814, 79.492 (decreased a lot, as well 
as probably average case too)

i1 cost estimates = 67358.15, 48809.34, 46544.80 (did not decrease a lot)

i2 Rows Removed by Filter = 900835, 980350, 991389
i2 buffers = 46985, 46985, 46987
i2 best case time = 377.554, 346.481, 387.874
i2 cost estimates = 39166.34, 39247.83, 40167.34

It's odd that increase in actual execution time for "i1" was not 
reflected enough in cost estimates. The cost even didn't go below "i2" 
costs.


I've added some logging, in order to get the actual numbers which were 
used for estimation.


--drop table if exists aaa;
--create table aaa as select floor(random()*100)::int num, (random()*10 
< 1)::int flag from generate_series(1, 1000) id;

--analyze aaa;

--set enable_bitmapscan = off; set enable_indexscan = on;  set 
enable_seqscan = off;
--set seq_page_cost = 1.0; set random_page_cost = 1.0; set 
cpu_tuple_cost = 0.01; set cpu_index_tuple_cost = 0.005; set 
cpu_operator_cost = 0.0025;


--create index i1 on aaa  (num);
--drop index if exists i2;
--explain (analyze,verbose,costs,buffers) select * from aaa where num = 
1 and flag = 1;


Index Scan using i1 on public.aaa  (cost=0.43..46697.59 rows=10641 
width=8) (actual time=0.047..153.521 rows=9826 loops=1)

  Rows Removed by Filter: 89948

--drop index if exists i1;
--create index i2 on aaa  (flag);
--explain (analyze,verbose,costs,buffers) select * from aaa where num = 
1 and flag = 1;


Index Scan using i2 on public.aaa  (cost=0.43..39583.11 rows=10641 
width=8) (actual time=0.098..351.454 rows=9826 loops=1)

  Rows Removed by Filter: 990249


LOG:  cost_index:
    seq_page_cost=1.00, random_page_cost=1.00, 
cpu_tuple_cost=0.0100, cpu_index_tuple_cost=0.0050, 
cpu_operator_cost=0.0025, effective_cache_size=131072
    indexStartupCost=0.43, indexTotalCost=1103.94, 
indexSelectivity=0.01076667, indexCorrelation=0.00208220
    baserel->tuples=1033.00, baserel->pages=44248.00, 
baserel->allvisfrac=0.

    tuples_fetched=107667.00, pages_fetched=477.00
    max_IO_cost=44248., min_IO_cost=477., csquared=0.
    qpqual_cost.startup=0., qpqual_cost.per_tuple=0.0025, 
cpu_per_tuple=0.0125

    spc_seq_page_cost=1.00, spc_random_page_cost=1.00
    startup_cost=0.43, total_cost=46697.59

LOG:  cost_index:
    seq_page_cost=1.00, random_page_cost=1.00, 
cpu_tuple_cost=0.0100, cpu_index_tuple_cost=0.0050, 
cpu_operator_cost=0.0025, effective_cache_size=131072
    indexStartupCost=0.43, indexTotalCost=10123.93, 
indexSelectivity=0.0988, indexCorrelation=0.82505685
    baserel->tuples=1000.00, baserel->pages=44248.00, 
baserel->allvisfrac=0.

    tuples_fetched=988333.00, pages_fetched=4374.00
    max_IO_cost=44248., min_IO_cost=4374., csquared=0.6807
    qpqual_cost.startup=0., qpqual_cost.per_tuple=0.0025, 
cpu_per_tuple=0.0125

    spc_seq_page_cost=1.00, spc_random_page_cost=1.00
    startup_cost=0.43, total_cost=39583.11


Here is a break down of the total_cost into components, for i1 query and 
for i2 query (some rounding was removed from the formula for brevity):


path->path.total_cost =
  (indexTotalCost + qpqual_cost.startup) +
  (max_IO_cost + csquared * (min_IO_cost - max_IO_cost)) +
  (cpu_tuple_cost + qpqual_cost.per_tuple) * (indexSelectivity * 
baserel->tuples);

path->path.total_cost =
  1103.94 + 0. +    // 1103.94 +
  44248. + 0. * (477. - 44248.) +   // 44248.00 +
  (0.0100 + 0.0025) * (0.01076667 * 1033.00)    // 1345.84
  = 46697.78;   // = 46697.78;

path->path.total_cost =
  (indexTotalCost + qpqual_cost.startup) +
  (max_IO_cost + csquared * (min_IO_cost - max_IO_cost)) +
  (cpu_tuple_cost + qpqual_cost.per_tuple) * (indexSelectivity * 
baserel->tuples);

path->path.total_cost =
  10123.93 + 0. +   // 10123.93 +
  44248. + 0.6807 * (4374. - 44248.) +  // 17105.77 +
  (0.0100 + 0.0025) * (0.0988 * 1000.00)    // 12354.17
  = 39583.86;   // = 39583.86;


PS.
The code used for logging:
/postgresql-9.3.1/src/backend/optimizer/path/costsize.c : cost_index()

    ereport(LOG,
            (errmsg("cost_index: \n"
                    "seq_page_cost=%.2f, random_page_cost=%.2f, 
cpu_tuple_cost=%.4f, cpu_index_tuple_cost=%.4f, cpu_operator_cost=%.4f, 
effective_cache_size=%.0f\n"
                    "indexStartupCost=%.2f, indexTotalCost=%.2f, 
indexSelectivity=%.8f, indexCorrelation=%.8f\n"
                    "baserel->tuples=%.2f, baserel->pages=%.2f, 
baserel->allvisfrac=%.8f\n"

                    "tuples_fetched=%.2f, pa

Re: insert and query performance on big string table with pg_trgm

2017-12-06 Thread Matthew Hall

> On Dec 5, 2017, at 11:23 PM, Sergei Kornilov  wrote:
> You has very slow (or busy) disks, not postgresql issue. Reading 6760 * 8KB 
> in 70 seconds is very bad result.
> 
> For better performance you need better disks, at least raid10 (not raid5). 
> Much more memory in shared_buffers can help with read performance and so 
> reduce disk utilization, but write operations still will be slow.
> 
> Sergei

Sergei,

Thanks so much for confirming, this really helps a lot to know what to do. I 
thought the disk could be some of my issue, but I wanted to make sure I did all 
of the obvious tuning first. I have learned some very valuable things which 
I'll be able to use on future challenges like this which I didn't learn 
previously.

Based on this advice from everyone, I'm setting up a box with more RAM, lots of 
SSDs, and RAID 10. I'll write back in a few more days after I've completed it.

I can also confirm that the previous advice about using a hash / digest based 
unique index seemed to make the loading process slower for me, not faster, 
which is an interesting result to consider for future users following this 
thread (if any). I don't yet have specific data how much slower, because it's 
actually still going!

Sincerely,
Matthew.


Re: Bitmap scan is undercosted?

2017-12-06 Thread Jeff Janes
On Sun, Dec 3, 2017 at 1:15 PM, Vitaliy Garnashevich <
[email protected]> wrote:

> On 02/12/2017 23:17, Jeff Janes wrote:
>
> Right, so there is a cpu costing problem (which could only be fixed by
> hacking postgresql and recompiling it), but it is much smaller of a problem
> than the IO cost not being accurate due to the high hit rate.  Fixing the
> CPU costing problem is unlikely to make a difference to your real query.
> If you set the page costs to zero, what happens to your real query?
>
> I can't reproduce the exact issue on the real database any more. The query
> started to use the slow bitmap scan recently, and had been doing so for
> some time lately, but now it's switched back to use the index scan. The
> table involved in the query gets modified a lot. It has hundreds of
> millions of rows. Lots of new rows are appended to it every day, the oldest
> rows are sometimes removed. The table is analyzed at least daily. It's
> possible that statistics was updated and that caused the query to run
> differently. But I still would like to understand why that issue happened,
> and how to properly fix it, in case the issue returns.
>

While your test case displays some cost estimation issues, there is really
no reason to think that they are the same issues your real query shows.
Particularly since you said the difference was a factor of 30 in the real
case, rather than 3.  Any chance you can show EXPLAIN ANALYZE output for
the real query, but when it is acting up and when it is not?  Something in
the plans might stand out to us as the obvious problem.  On the other hand,
maybe nothing will stand out without having a replicable test case.  The
only way to know is to try.


>
>
>
>> But I doubt that the settings seq_page_cost = random_page_cost = 0.0
>> should actually be used.
>>
>
> Why not?  If your production server really has everything in memory during
> normal operation, that is the correct course of action.  If you ever
> restart the server, then you could have some unpleasant time getting it
> back up to speed again, but pg_prewarm could help with that.
>
> In the real database, not everything is in memory. There are 200GB+ of
> RAM, but DB is 500GB+. The table involved in the query itself is 60GB+ of
> data and 100GB+ of indexes. I'm running the test case in a way where all
> reads are done from RAM, only to make it easier to reproduce and to avoid
> unrelated effects.
>

Is everything that the particular query in questions needs in memory, even
if other queries need things from disk?  Or does the problematic query also
need things from disk?  If the query does need to read things from disk,
the bitmap actually should be faster.  Which reinforces the idea that maybe
the issue brought up by your test case is not the same as the issue brought
up by your real case, even if they both point in the same direction.


> As far as know, costs in Postgres were designed to be relative to
> seq_page_cost, which for that reason is usually defined as 1.0. Even if
> everything would be in RAM, accesses to the pages would still not have zero
> cost. Setting 0.0 just seems too extreme, as all other non-zero costs would
> become infinitely bigger.
>

When exploring things, 0.0 certain helps to simplify things.  Yes, 0.05 or
something similar might be better for a completely cached database.  The
problem is that it is very  context dependent.  Reading a page from
shared_buffers when there is no contention from other processes for the
same page is probably less than 0.01.  If it is not in shared_buffers but
is in effective_cache_size, it is probably a few multiples of 0.01.  If
there is contention either for that specific page, or for available buffers
into which to read pages, then it could be substantially higher yet.
Higher, none of those are things the planner is aware of.

If you really want to target the plan with the BitmapAnd, you should
> increase  cpu_index_tuple_cost and/or cpu_operator_cost but not increase
> cpu_tuple_cost.  That is because the  unselective bitmap index scan does
> not incur any cpu_tuple_cost, but does incur index_tuple and operator
> costs.  Unfortunately all other index scans in the system will also be
> skewed by such a change if you make the change system-wide.
>
> Exactly. I'd like to understand why the worse plan is being chosen, and 1)
> if it's fixable by tuning costs, to figure out the right settings which
> could be used in production, 2) if there is a bug in Postgres optimizer,
> then to bring some attention to it, so that it's eventually fixed in one of
> future releases, 3) if Postgres is supposed to work this way, then at least
> I (and people who ever read this thread) would understand it better.
>

I  would argue that it is planner "bug", (quotes because it doesn't give
wrong answers, just sub-optimal plans) but one that is very hard to pin
down, and also depends on the hardware you are running on.  Also, people
have made some optimizations to the machinery behind 

Re: Bitmap scan is undercosted? - boolean correlation

2017-12-06 Thread Jeff Janes
On Tue, Dec 5, 2017 at 10:50 AM, Tom Lane  wrote:

> Jeff Janes  writes:
> > On Dec 3, 2017 15:31, "Tom Lane"  wrote:
> >> Jeff Janes  writes:
> >>> But I do see that ties within the logical order of the column values
> are
> >>> broken to agree with the physical order.  That is wrong, right?  Is
> there
> >>> any argument that this is desirable?
>
> >> Uh ... what do you propose doing instead?  We'd have to do something
> with
> >> ties, and it's not so obvious this way is wrong.
>
> > Let them be tied.  If there are 10 distinct values, number the values 0
> to
> > 9, and all rows of a given distinct values get the same number for the
> > logical order axis.
> > Calling the correlation 0.8 when it is really 0.0 seems obviously wrong
> to
> > me.  Although if we switched btree to store duplicate values with tid as
> a
> > tie breaker, then maybe it wouldn't be as obviously wrong.
>
> I thought some more about this.  What we really want the correlation stat
> to do is help us estimate how randomly an index-ordered scan will access
> the heap.


The correlation is used in another place, estimating how much of the table
we will visit in the first place.  If the correlation is very high, then
scanning 10% of the index leaf pages means we will visit 10% of the table.
If the correlation is low, then we use Mackert and Lohman, and (in the case
of visiting 10% of the index) predict we will visit most of the table.
Assuming effective_cache_size is high, we will visit most of the table just
once, but still in a random order, because subsequent visits for the same
query will be found in the cache.  Rather than visiting the various pages
repeatedly and not finding them in cache each time.

In addition to estimating how much of the table we visit, we also estimate
how "sequential like" those visits are.  Which is the use that you
describe.  Ideally for that use case, we would know for each distinct
value, how correlated the tids are with the leaf page ordering.  If the
index is freshly built, that is very high.  We visit 1/10 of the index,
which causes us to visit 100% of the table but in perfect order, plucking
1/10 of the tuples from each table page.

But visiting 100% of the table in physical order in order to pluck out 10%
of the tuples from each page is quite different than visiting 10% of the
table pages in physical order to pluck out 100% of the tuples from those
pages and 0% from the pages not visited.

...

BTW, I disagree that "correlation = zero" is the right answer for this
> particular example.  If the btree is freshly built, then an index-order
> scan would visit all the heap pages in sequence to fetch "f" rows, and
> then visit them all in sequence again to fetch "t" rows, which is a whole
> lot better than the purely random access that zero correlation implies.
> So I think 0.8 or so is actually a perfectly reasonable answer when the
> index is fresh.  The trouble is just that it'd behoove us to derate that
> answer somewhat for the probability that the index isn't fresh.
>

But, for the case of "how much of the table do we visit at all",
correlation = zero is the right answer, even if it isn't the right answer
for "how sequentially do we visit whatever we visit"



> My first thought for a concrete fix was to use the mean position of
> a group of duplicates for the purposes of the correlation calculation,
> but on reflection that's clearly wrong.  We do have an idea, from the
> data we have, whether the duplicates are close together in the heap
> or spread all over.  Using only mean position would fail to distinguish
> those cases, but really we'd better penalize the spread-all-over case.
> I'm not sure how to do that.
>

Departing from correlations, we could also try to estimate "How many
different table pages does each index leaf page reference".  This could
capture functional dependencies which are strong, but not in the form of
linear correlations.  (The current extended statistics only captures
dependencies between user columns, not between one user column and one
system column such as table slot)

For whatever its worth, here is my "let ties be ties" patch.

It breaks two regression tests due to plan changes, and both are cases
where maybe the plan ought to change for the very reason being discussed.
If I just put random gibberish into the correlation field, more regression
tests fail, so I think my implementation is not too far broken.

The accumulations into corr_ysum and corr_y2sum could trivially be pushed
down into the "if", and corr_xysum could as well with a little algebra.
But that seems like premature optimization for a proof-of-concept patch.


Cheers,

Jeff
diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index f952b3c..3369784 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -2309,6 +2309,8 @@ compute_scalar_stats(VacAttrStatsP stats,
 	bool		is_varwidth = (!stats->attrtype->typbyval &&
 			   stats->attrtype->typlen <