Re: REINDEXdb performance degrading gradually PG13.4

2022-06-01 Thread Praneel Devisetty
On Tue, May 31, 2022 at 9:12 PM David G. Johnston <
[email protected]> wrote:

> On Tuesday, May 31, 2022, Praneel Devisetty 
> wrote:
>
>>
>> Initially it was processing 1000 tables per minute. Performance is
>>> gradually dropping and now after 24 hr it was processing 90 tables per
>>> minute.
>>>
>>
> That seems like a fairly problematic metric given the general vast
> disparities in size tables have.
>
> Building indexes is so IO heavy that the non-IO bottlenecks that exists
> likely have minimal impact on the overall times this rebuild everything
> will take.  That said, I’ve never done anything at this scale before.  I
> wouldn’t be too surprised if per-session cache effects are coming into play
> given the number of objects involved and the assumption that each session
> used for parallelism is persistent.  I’m not sure how the parallelism works
> for managing the work queue though as it isn’t documented and I haven’t
> inspected the source code.
>

could you please share more about   per-session cache effects /Point me to
link with more info .


Re: REINDEXdb performance degrading gradually PG13.4

2022-06-01 Thread Jeff Janes
On Tue, May 31, 2022 at 11:14 AM Praneel Devisetty <
[email protected]> wrote:

>
> Hi,
>>
>> We are trying to reindex 600k tables in a single database  of size 2.7TB
>> using reindexdb utility in a shell script
>> reindexdb -v -d $dbname -h $hostname -U tkcsowner --concurrently -j
>> $parallel -S $schema
>>
>>
What is the value of $parallel?  Are all the tables in the same schema?


> Initially it was processing 1000 tables per minute. Performance is
>> gradually dropping and now after 24 hr it was processing 90 tables per
>> minute.
>>
>
I can't even get remotely close to 1000 per minute with those options, even
with only 10 single-index tables with all of them being empty.  Are you
sure that isn't 1000 per hour?

Using --concurrently really hits the stats system hard (I'm not sure why).
 Could you just omit that?  If it is running at 1000 per minute or even per
hour, does it really matter if the table is locked for as long as it takes
to reindex?

Cheers,

Jeff


Re: rows selectivity overestimate for @> operator for arrays

2022-06-01 Thread Jeff Janes
On Fri, May 27, 2022 at 12:19 PM Alexey Ermakov <
[email protected]> wrote:

> Hello, please look into following example:
>
> postgres=# create table test_array_selectivity as select
> array[id]::int[] as a from generate_series(1, 1000) gs(id);
> SELECT 1000
> postgres=# explain analyze select * from test_array_selectivity where a
> @> array[1];
>   QUERY PLAN
>
> -
>   Seq Scan on test_array_selectivity  (cost=0.00..198531.00 rows=5
> width=32) (actual time=0.023..2639.029 rows=1 loops=1)
> Filter: (a @> '{1}'::integer[])
> Rows Removed by Filter: 999
>   Planning Time: 0.078 ms
>   Execution Time: 2639.038 ms
> (5 rows)
>
>
> for row estimation rows=5=1000*0.005 we are using constant
> DEFAULT_CONTAIN_SEL if I'm not mistaken.
> and we're using it unless we have something in most_common_elems (MCE)
> in statistics which in this case is empty.
>
>
This was discussed before at
https://www.postgresql.org/message-id/flat/CAMkU%3D1x2W1gpEP3AQsrSA30uxQk1Sau5VDOLL4LkhWLwrOY8Lw%40mail.gmail.com

My solution was to always store at least one element in the MCE, even if
the sample size was too small to be reliable.  It would still be more
reliable than the alternative fallback assumption.  That patch still
applies and fixes your example, or improves it anyway and to an extent
directly related to the stats target size. (It also still has my bogus code
comments in which I confuse histogram with n_distinct).

Then some other people proposed more elaborate patches, and I never wrapped
my head around what they were doing differently or why the elaboration was
important.

Since you're willing to dig into the source code and since this is directly
applicable to you, maybe you would be willing to go over to pgsql-hackers
to revive, test, and review these proposals with an eye of getting them
applied in v16.

I'm not sure if there is a simple fix for this, maybe store and use
> something like n_distinct for elements for selectivity estimation ? or
> perhaps we should store something in MCE list anyway even if frequency
> is low (at least one element) ?
>

n_distinct might be the best solution, but I don't see how it could be
adapted to the general array case.  If it could only work when the vast
majority or arrays had length 1, I think that would be too esoteric to be
accepted.

Cheers,

Jeff