rows selectivity overestimate for @> operator for arrays

2022-05-27 Thread Alexey Ermakov

Hello, please look into following example:

postgres=# create table test_array_selectivity as select 
array[id]::int[] as a from generate_series(1, 1000) gs(id);

SELECT 1000
postgres=# explain analyze select * from test_array_selectivity where a 
@> array[1];

 QUERY PLAN
-
 Seq Scan on test_array_selectivity  (cost=0.00..198531.00 rows=5 
width=32) (actual time=0.023..2639.029 rows=1 loops=1)

   Filter: (a @> '{1}'::integer[])
   Rows Removed by Filter: 999
 Planning Time: 0.078 ms
 Execution Time: 2639.038 ms
(5 rows)


for row estimation rows=5=1000*0.005 we are using constant 
DEFAULT_CONTAIN_SEL if I'm not mistaken.
and we're using it unless we have something in most_common_elems (MCE) 
in statistics which in this case is empty.


if we have something in MCE list then we could use much better estimate 
(https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/utils/adt/array_selfuncs.c#L628):

elem_selec = Min(DEFAULT_CONTAIN_SEL, minfreq / 2)

for small tables we could get away with larger stats target for column, 
but for example in this case stats target 1 is not enough.



if I'm reading sources correctly element frequency in sample should be 
more than 0.0063/stats_target to make it into MCE list:

https://github.com/postgres/postgres/blob/REL_14_STABLE/src/backend/utils/adt/array_typanalyze.c#L471

so if we store mostly one element in array and they're almost all 
distinct then in tables with more then stats_target/0.0063 (~1.58M for 
maximum stats target 1) rows we'll get 0.005 constant for selectivity.
which could be pretty bad estimate (in real example it was 6-7 orders of 
magnitude difference).


I ran into this issue 2 times in last year with 2 different projects so 
perhaps it's not very rare situation. In first case increasing stats 
target helped, in second it didn't (for table with 150M rows), had to 
use hacks to fix the plan.


It was in PostgreSQL 12.x and 14.3.

I'm not sure if there is a simple fix for this, maybe store and use 
something like n_distinct for elements for selectivity estimation ? or 
perhaps we should store something in MCE list anyway even if frequency 
is low (at least one element) ?



--

Thanks,

Alexey Ermakov





Re: How to monitor Postgres real memory usage

2022-05-27 Thread Justin Pryzby
On Sat, May 28, 2022 at 01:40:14AM +0800, 徐志宇徐 wrote:
> vm.swappiness=0

I think this is related to the problem.

swappiness=0 means to *never* use swap, even if that means that processes are
killed.

If you really wanted that, you should remove the swap space.

Swap is extremely slow and worth avoiding, but this doesn't let you use it at
all.  You can't even look at your swap usage as a diagnostic measure to tell if
things had been paged out at some point.

I *suspect* the problem will go away if you set swappiness=1 in /proc (and in
sysctl.conf).

-- 
Justin




Re: rows selectivity overestimate for @> operator for arrays

2022-05-27 Thread Tom Lane
Alexey Ermakov  writes:
> so if we store mostly one element in array and they're almost all 
> distinct then in tables with more then stats_target/0.0063 (~1.58M for 
> maximum stats target 1) rows we'll get 0.005 constant for selectivity.

Yeah.  There's a comment in array_selfuncs.c about

 * TODO: this estimate probably could be improved by using the distinct
 * elements count histogram.  For example, excepting the special case of
 * "column @> '{}'", we can multiply the calculated selectivity by the
 * fraction of nonempty arrays in the column.

but I'm not sure whether that's relevant here.

One thought is that if there is a pg_statistic row but it contains
no MCE list, we could assume that the column elements are all distinct
and see what sort of estimate that leads us to.

regards, tom lane