Partitioned table statistics vs autoanalyze

2021-07-22 Thread Kamil Frydel

Hi,

we faced a performance issue when joining 2 partitioned tables 
(declarative partitioning). The planner chooses nested loop while we 
expect hash join.


The query and the plan are available here: https://explain.depesz.com/s/23r9

table_1 and table_2 are hash partitioned using volume_id column. Usually 
we make analyze on partitions. We do not make analyze on the partitioned 
table (parent).
However, if we run 'analyze' on the partitioned table then planner 
starts choosing hash join. As a comparison, the execution using nested 
loop takes about 15 minutes and if it is done using hash join then the 
query lasts for about 1 minute. When running 'analyze' for the 
partitioned table, postgres inserts statistics for the partitioned table 
into pg_stats (pg_statistics). Before that, there are only statistics 
for partitions. We suspect that this is the reason for selecting bad 
query plan.


The query is executed with cursor thus, in order to avoid parallel 
query, I set max_parallel_workers_per_gather to 0 during tests.


We found that a similar issue was discussed in the context of 
inheritance: 
https://www.postgresql.org/message-id/Pine.BSO.4.64.0904161836540.11937%40leary.csoft.net 
and the conclusion was to add the following paragraph to the 'analyze' doc:


> If the table being analyzed has one or more children, ANALYZE will 
gather statistics twice: once on the rows of the parent table only, and 
a second time on the rows of the parent table with all of its children. 
This second set of statistics is needed when planning queries that 
traverse the entire inheritance tree. The autovacuum daemon, however, 
will only consider inserts or updates on the parent table itself when 
deciding whether to trigger an automatic analyze for that table. If that 
table is rarely inserted into or updated, the inheritance statistics 
will not be up to date unless you run ANALYZE manually.

(https://www.postgresql.org/docs/13/sql-analyze.html)

I would appreciate if anyone could shed some light on the following 
questions:
1) Is this above paragraph from docs still valid in PG 13 and does it 
apply to declarative partitioning as well? Is running analyze manually 
on a partitioned table needed to get proper plans for queries on 
partitioned tables? Partitioned table (in the declarative way) is 
”virtual” and does not keep any data so it seems that there are no 
statistics that can be gathered from the table itself and statistics 
from partitions should be sufficient.
2) Why does the planner need these statistics since they seem to be 
unused in the query plan. The query plan uses only partitions, not the 
partitioned table.


PostgreSQL version number:
   version
-
 PostgreSQL 13.3 (Ubuntu 13.3-1.pgdg16.04+1) on x86_64-pc-linux-gnu, 
compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit

(1 row)


How you installed PostgreSQL: From Ubuntu 16 repositories.

Changes made to the settings in the postgresql.conf file:
name | current_setting 
|source

-+-+--
 application_name| psql 
| client
 auto_explain.log_analyze| on 
| configuration file
 auto_explain.log_min_duration   | 30s 
| configuration file
 auto_explain.log_nested_statements  | on 
| configuration file
 auto_explain.log_timing | off 
| configuration file
 autovacuum_freeze_max_age   | 10 
| configuration file
 autovacuum_max_workers  | 6 
| configuration file
 autovacuum_vacuum_cost_delay| 20ms 
| configuration file
 autovacuum_vacuum_cost_limit| 2000 
| configuration file
 checkpoint_completion_target| 0.9 
| configuration file
 checkpoint_timeout  | 15min 
| configuration file
 cluster_name| 13/main 
| configuration file
 cpu_index_tuple_cost| 0.001 
| configuration file
 cpu_operator_cost   | 0.0005 
| configuration file
 cursor_tuple_fraction   | 1 
| configuration file
 DateStyle   | ISO, MDY 
| configuration file
 default_statistics_target   | 200 
| configuration file
 default_text_search_config  | pg_catalog.english 
| configuration file
 dynamic_shared_memory_type  | posix 
| configuration file
 effective_cache_size| 193385MB 
| configuration file
 effective_io_concurrency| 1000 
| configuration file
 external_pid_file   | /var/run/postgresql/13

Re: Partitioned table statistics vs autoanalyze

2021-07-22 Thread Justin Pryzby
On Thu, Jul 22, 2021 at 01:32:51PM +0200, Kamil Frydel wrote:
> table_1 and table_2 are hash partitioned using volume_id column. Usually we
> make analyze on partitions. We do not make analyze on the partitioned table
> (parent).
> However, if we run 'analyze' on the partitioned table then planner starts
> choosing hash join. As a comparison, the execution using nested loop takes
> about 15 minutes and if it is done using hash join then the query lasts for
> about 1 minute. When running 'analyze' for the partitioned table, postgres
> inserts statistics for the partitioned table into pg_stats (pg_statistics).
> Before that, there are only statistics for partitions. We suspect that this
> is the reason for selecting bad query plan.

> updated, the inheritance statistics will not be up to date unless you run
> ANALYZE manually.
> (https://www.postgresql.org/docs/13/sql-analyze.html)
> 
> I would appreciate if anyone could shed some light on the following
> questions:
> 1) Is this above paragraph from docs still valid in PG 13 and does it apply
> to declarative partitioning as well? Is running analyze manually on a
> partitioned table needed to get proper plans for queries on partitioned
> tables? Partitioned table (in the declarative way) is ”virtual” and does not
> keep any data so it seems that there are no statistics that can be gathered
> from the table itself and statistics from partitions should be sufficient.

Up through v13, autoanalyze doesn't collect stats on parent tables (neither
declarative nor inheritence).  I agree that this doesn't seem to be well
documented.  I think it should also be mentioned here:
https://www.postgresql.org/docs/current/routine-vacuuming.html#VACUUM-FOR-STATISTICS

In v14 (which is currently in beta), autoanalyze will process the partitioned
table automatically:
https://www.postgresql.org/docs/14/release-14.html
|Autovacuum now analyzes partitioned tables (Yuzuko Hosoya, Álvaro Herrera)
|Insert, update, and delete tuple counts from partitions are now propagated to 
their parent tables so autovacuum knows when to process them.

> 2) Why does the planner need these statistics since they seem to be unused
> in the query plan. The query plan uses only partitions, not the partitioned
> table.

The "inherited" stats are used when you SELECT FROM table.  The stats for the
individual table would be needed when you SELECT FROM ONLY table (which makes
no sense for a partitioned table).

-- 
Justin




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]
OK. Will do another round of testing.


-Original Message-
From: David Rowley  
Sent: Thursday, July 22, 2021 00:44
To: [email protected]
Cc: Tom Lane ; Peter Geoghegan ; Justin 
Pryzby ; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, 22 Jul 2021 at 16:37, [email protected]  
wrote:
> Seems like no cigar ☹ See plan pasted below. I changed the conf as follows:
>   - hash_mem_multiplier = '2'
>   - work_mem = '1GB'

> Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB

You might want to keep going higher with hash_mem_multiplier until you see no 
"Disk Usage" there.  As mentioned, v11 didn't spill to disk and just used all 
the memory it pleased.  That was a bit dangerous as it could result in OOM, so 
it was fixed.

David


Re: Partitioned table statistics vs autoanalyze

2021-07-22 Thread Kamil Frydel

In v14 (which is currently in beta), autoanalyze will process the partitioned
table automatically:
https://www.postgresql.org/docs/14/release-14.html
|Autovacuum now analyzes partitioned tables (Yuzuko Hosoya, Álvaro Herrera)
|Insert, update, and delete tuple counts from partitions are now propagated to 
their parent tables so autovacuum knows when to process them.



Thank you for the prompt reply! Changes in v14 sound promising.


--
Best regards
Kamil Frydel




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]


-Original Message-
From: [email protected]  
Sent: Thursday, July 22, 2021 09:37
To: David Rowley 
Cc: Tom Lane ; Peter Geoghegan ; Justin 
Pryzby ; [email protected]
Subject: RE: Big performance slowdown from 11.2 to 13.3

OK. Will do another round of testing.


-Original Message-
From: David Rowley  
Sent: Thursday, July 22, 2021 00:44
To: [email protected]
Cc: Tom Lane ; Peter Geoghegan ; Justin 
Pryzby ; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, 22 Jul 2021 at 16:37, [email protected]  
wrote:
> Seems like no cigar ☹ See plan pasted below. I changed the conf as follows:
>   - hash_mem_multiplier = '2'
>   - work_mem = '1GB'

> Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB

You might want to keep going higher with hash_mem_multiplier until you see no 
"Disk Usage" there.  As mentioned, v11 didn't spill to disk and just used all 
the memory it pleased.  That was a bit dangerous as it could result in OOM, so 
it was fixed.

David

--

Hello all,

So, I went possibly nuclear, and still no cigar. Something's not right.
- hash_mem_multiplier = '10'
- work_mem = '1GB'

The results are
Batches: 5  Memory Usage: 2,449,457kB  Disk Usage: 105,936kB
Execution Time: 1,837,126.766 ms

It's still spilling to disk and seems to cap at 2.5GB of memory usage in spite 
of configuration. More importantly
  - I am not understanding how spilling to disk 100MB (which seems low to me 
and should be fast on our SSD), causes the query to slow down by a factor of 10.
  - It seems at the very least that memory consumption on 11 was more moderate? 
This process of ours was running several of these types of queries concurrently 
and I don't think I ever saw the machine go over 40GB in memory usage.


HashAggregate  (cost=1774568.21..1774579.21 rows=200 width=1260) (actual 
time=84860.629..1836583.909 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=46 read=169822, temp read=13144 written=23035
  ->  HashAggregate  (cost=1360804.75..1374830.63 rows=1402588 width=56) 
(actual time=27890.422..39975.074 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2449457kB  Disk Usage: 105936kB
Buffers: shared hit=32 read=169819, temp read=13144 written=23035
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256856.62 
rows=13859750 width=38) (actual time=0.053..13623.310 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
  Rows Removed by Filter: 171680
  Buffers: shared hit=32 read=169819
Settings: effective_cache_size = '52GB', from_collapse_limit = '24', 
hash_mem_multiplier = '10', jit = 'off', jit_above_cost = '2e+08', 
jit_inline_above_cost = '5e+08', jit_optimize_above_cost = '5e+08', 
join_collapse_limit = '24', max_parallel_workers = '20', 
max_parallel_workers_per_gather = '8', random_page_cost = '1.1', temp_buffers = 
'4G

Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Tom Lane
"[email protected]"  writes:
> So, I went possibly nuclear, and still no cigar. Something's not right.
> - hash_mem_multiplier = '10'
> - work_mem = '1GB'

> The results are
>   Batches: 5  Memory Usage: 2,449,457kB  Disk Usage: 105,936kB
>   Execution Time: 1,837,126.766 ms

> It's still spilling to disk and seems to cap at 2.5GB of memory usage in 
> spite of configuration.

That is ... weird.  Maybe you have found a bug in the spill-to-disk logic;
it's quite new after all.  Can you extract a self-contained test case that
behaves this way?

regards, tom lane




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Tom Lane
I wrote:
> "[email protected]"  writes:
>> It's still spilling to disk and seems to cap at 2.5GB of memory usage in 
>> spite of configuration.

> That is ... weird.

Oh: see get_hash_mem:

hash_mem = (double) work_mem * hash_mem_multiplier;

/*
 * guc.c enforces a MAX_KILOBYTES limitation on work_mem in order to
 * support the assumption that raw derived byte values can be stored in
 * 'long' variables.  The returned hash_mem value must also meet this
 * assumption.
 *
 * We clamp the final value rather than throw an error because it should
 * be possible to set work_mem and hash_mem_multiplier independently.
 */
if (hash_mem < MAX_KILOBYTES)
return (int) hash_mem;

return MAX_KILOBYTES;

So basically, we now have a hard restriction that hashaggs can't use
more than INT_MAX kilobytes, or approximately 2.5GB, and this use case
is getting eaten alive by that restriction.  Seems like we need to
do something about that.

regards, tom lane




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread David Rowley
On Fri, 23 Jul 2021 at 03:56, Tom Lane  wrote:
> So basically, we now have a hard restriction that hashaggs can't use
> more than INT_MAX kilobytes, or approximately 2.5GB, and this use case
> is getting eaten alive by that restriction.  Seems like we need to
> do something about that.

Hmm, math check?

postgres=# select pg_size_pretty(power(2,31)::numeric*1024);
 pg_size_pretty

 2048 GB
(1 row)

David




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> That is ... weird.  Maybe you have found a bug in the spill-to-disk logic;
> it's quite new after all.  Can you extract a self-contained test case that
> behaves this way?

I wonder if this has something to do with the way that the input data
is clustered. I recall noticing that that could significantly alter
the behavior of HashAggs as of Postgres 13.

-- 
Peter Geoghegan




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]



-Original Message-
From: Tom Lane  
Sent: Thursday, July 22, 2021 11:57
To: [email protected]
Cc: David Rowley ; Peter Geoghegan ; Justin 
Pryzby ; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

I wrote:
> "[email protected]"  writes:
>> It's still spilling to disk and seems to cap at 2.5GB of memory usage in 
>> spite of configuration.

> That is ... weird.

Oh: see get_hash_mem:

hash_mem = (double) work_mem * hash_mem_multiplier;

/*
 * guc.c enforces a MAX_KILOBYTES limitation on work_mem in order to
 * support the assumption that raw derived byte values can be stored in
 * 'long' variables.  The returned hash_mem value must also meet this
 * assumption.
 *
 * We clamp the final value rather than throw an error because it should
 * be possible to set work_mem and hash_mem_multiplier independently.
 */
if (hash_mem < MAX_KILOBYTES)
return (int) hash_mem;

return MAX_KILOBYTES;

So basically, we now have a hard restriction that hashaggs can't use more than 
INT_MAX kilobytes, or approximately 2.5GB, and this use case is getting eaten 
alive by that restriction.  Seems like we need to do something about that.

regards, tom lane




--

Hello!


Ah... int vs long then? Tried even more (multiplier=16) and this seems to be 
definitely the case.

Is it fair then to deduce that the total memory usage would be 2,400,305kB + 
126,560kB? Is this what under the covers V11 is consuming more or less?

Is it also expected that a spill over of just 100MB (on top of 2.4GB memory 
consumption) would cause the query to collapse like this? I am still not 
visualizing in my head how that would happen. 100MB just seems so small, and 
our SSD is fast.

Generating a dataset would take me a lot of time. This is a clinical database 
so I cannot reuse the current table. I would have to entirely mock the use case 
and create a dummy dataset from scratch.



HashAggregate  (cost=1774568.21..1774579.21 rows=200 width=1260) (actual 
time=94618.303..1795311.542 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=14 read=169854, temp read=15777 written=27588
  ->  HashAggregate  (cost=1360804.75..1374830.63 rows=1402588 width=56) 
(actual time=30753.022..45384.558 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB
Buffers: shared read=169851, temp read=15777 written=27588
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256856.62 
rows=13859750 width=38) (actual time=0.110..14342.258 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
  Rows Removed by Filter: 171680
  Buffers: shared read=169851
Settings: effective_cache_size = '52GB', from_collapse_limit = '24', 
hash_mem_multiplier = '16', jit = 'off', jit_above_cost = '2e+08', 
jit_inline_above_cost = '5e+08', jit_optimize_above_cost = '5e+08', 
join_collapse_limi

Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread David Rowley
On Fri, 23 Jul 2021 at 04:14, Peter Geoghegan  wrote:
>
> On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> > That is ... weird.  Maybe you have found a bug in the spill-to-disk logic;
> > it's quite new after all.  Can you extract a self-contained test case that
> > behaves this way?
>
> I wonder if this has something to do with the way that the input data
> is clustered. I recall noticing that that could significantly alter
> the behavior of HashAggs as of Postgres 13.

Isn't it more likely to be reaching the group limit rather than the
memory limit?

if (input_groups * hashentrysize < hash_mem * 1024L)
{
if (num_partitions != NULL)
*num_partitions = 0;
*mem_limit = hash_mem * 1024L;
*ngroups_limit = *mem_limit / hashentrysize;
return;
}

There are 55 aggregates on a varchar(255). I think hashentrysize is
pretty big. If it was 255*55 then only 765591 groups fit in the 10GB
of memory.

David




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]

-Original Message-
From: Peter Geoghegan  
Sent: Thursday, July 22, 2021 12:14
To: Tom Lane 
Cc: Jeff Davis ; [email protected]; David Rowley 
; Justin Pryzby ; 
[email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> That is ... weird.  Maybe you have found a bug in the spill-to-disk 
> logic; it's quite new after all.  Can you extract a self-contained 
> test case that behaves this way?

I wonder if this has something to do with the way that the input data is 
clustered. I recall noticing that that could significantly alter the behavior 
of HashAggs as of Postgres 13.

--
Peter Geoghegan



I could execute that test and re-cluster against the index. But I believe 
that's already done? Let me check.

Thank you,
Laurent.


Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Tom Lane
David Rowley  writes:
> On Fri, 23 Jul 2021 at 03:56, Tom Lane  wrote:
>> So basically, we now have a hard restriction that hashaggs can't use
>> more than INT_MAX kilobytes, or approximately 2.5GB, and this use case
>> is getting eaten alive by that restriction.  Seems like we need to
>> do something about that.

> Hmm, math check?

Yeah, I should have said "2GB plus palloc slop".  It doesn't surprise
me a bit that we seem to be eating another 20% on top of the nominal
limit.

I think the right fix here is to remove the cap, which will require
changing get_hash_mem to return double, and then maybe some cascading
changes --- I've not looked at its callers.

regards, tom lane




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Justin Pryzby
On Thu, Jul 22, 2021 at 04:16:34PM +, [email protected] wrote:
> Is it fair then to deduce that the total memory usage would be 2,400,305kB + 
> 126,560kB? Is this what under the covers V11 is consuming more or less?

It might be helpful to know how much RAM v11 is using.

Could you run the query with log_executor_stats=on; client_min_messages=debug;

The interesting part is this:
!   7808 kB max resident size

-- 
Justin




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]


-Original Message-
From: David Rowley  
Sent: Thursday, July 22, 2021 12:18
To: Peter Geoghegan 
Cc: Tom Lane ; Jeff Davis ; 
[email protected]; Justin Pryzby ; 
[email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Fri, 23 Jul 2021 at 04:14, Peter Geoghegan  wrote:
>
> On Thu, Jul 22, 2021 at 8:45 AM Tom Lane  wrote:
> > That is ... weird.  Maybe you have found a bug in the spill-to-disk 
> > logic; it's quite new after all.  Can you extract a self-contained 
> > test case that behaves this way?
>
> I wonder if this has something to do with the way that the input data 
> is clustered. I recall noticing that that could significantly alter 
> the behavior of HashAggs as of Postgres 13.

Isn't it more likely to be reaching the group limit rather than the memory 
limit?

if (input_groups * hashentrysize < hash_mem * 1024L) { if (num_partitions != 
NULL) *num_partitions = 0; *mem_limit = hash_mem * 1024L; *ngroups_limit = 
*mem_limit / hashentrysize; return; }

There are 55 aggregates on a varchar(255). I think hashentrysize is pretty big. 
If it was 255*55 then only 765591 groups fit in the 10GB of memory.

David



-

Hello,

So, FYI The query I shared is actually a simpler use case of ours 😊 We do 
have a similar pivot query over 600 columns to create a large flat tale for 
analysis on an even larger table. Takes about 15mn to run on V11 with strong 
CPU usage and no particular memory usage spike that I can detect via 
TaskManager. We have been pushing PG hard and simplify the workflows of our 
analysts and data scientists downstream.

Thank you,
Laurent.



Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Tom Lane
I wrote:
> I think the right fix here is to remove the cap, which will require
> changing get_hash_mem to return double, and then maybe some cascading
> changes --- I've not looked at its callers.

Or, actually, returning size_t would likely make the most sense.
We'd fold the 1024L multiplier in here too instead of doing that
at the callers.

regards, tom lane




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]



-Original Message-
From: Justin Pryzby  
Sent: Thursday, July 22, 2021 12:23
To: [email protected]
Cc: Tom Lane ; David Rowley ; Peter 
Geoghegan ; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 04:16:34PM +, [email protected] wrote:
> Is it fair then to deduce that the total memory usage would be 2,400,305kB + 
> 126,560kB? Is this what under the covers V11 is consuming more or less?

It might be helpful to know how much RAM v11 is using.

Could you run the query with log_executor_stats=on; client_min_messages=debug;

The interesting part is this:
!   7808 kB max resident size

-- 
Justin


---

Hello Justin,

> log_executor_stats=on; client_min_messages=debug;

Would the results then come in EXPLAIN or would I need to pick something up 
from the logs?

Thank you,
Laurent.






Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Justin Pryzby
On Thu, Jul 22, 2021 at 04:30:00PM +, [email protected] wrote:
> Hello Justin,
> 
> > log_executor_stats=on; client_min_messages=debug;
> 
> Would the results then come in EXPLAIN or would I need to pick something up 
> from the logs?

If you're running with psql, and client_min_messages=debug, then it'll show up
as a debug message to the client:

ts=# SET log_executor_stats=on; SET client_min_messages=debug; explain analyze 
SELECT 1;
SET
SET
LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!   0.11 s user, 0.000209 s system, 0.000219 s elapsed
!   [0.040608 s user, 0.020304 s system total]
!   7808 kB max resident size
...

It can but doesn't have to use "explain" - that just avoids showing the query
output, since it's not what's interesting here.

-- 
Justin




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 9:21 AM Tom Lane  wrote:
> Yeah, I should have said "2GB plus palloc slop".  It doesn't surprise
> me a bit that we seem to be eating another 20% on top of the nominal
> limit.

MAX_KILOBYTES is the max_val for the work_mem GUC itself, and has been
for many years. The function get_hash_mem() returns a work_mem-style
int that callers refer to as hash_mem -- the convention is that
callers pretend that there is a work_mem style GUC (called hash_mem)
that they must access by calling get_hash_mem().

I don't see how it's possible for get_hash_mem() to be unable to
return a hash_mem value that could be represented by work_mem
directly. MAX_KILOBYTES is an annoyingly low limit on Windows, where
sizeof(long) is 4. But that's nothing new.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Justin Pryzby
On Thu, Jul 22, 2021 at 09:36:02AM -0700, Peter Geoghegan wrote:
> I don't see how it's possible for get_hash_mem() to be unable to
> return a hash_mem value that could be represented by work_mem
> directly. MAX_KILOBYTES is an annoyingly low limit on Windows, where
> sizeof(long) is 4. But that's nothing new.

Oh.  So the problem seems to be that:

1) In v12, HashAgg now obeyes work_mem*hash_mem_multiplier;
2) Under windows, work_mem is limited to 2GB.

-- 
Justin




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, Jul 22, 2021 at 9:21 AM Tom Lane  wrote:
>> Yeah, I should have said "2GB plus palloc slop".  It doesn't surprise
>> me a bit that we seem to be eating another 20% on top of the nominal
>> limit.

> MAX_KILOBYTES is the max_val for the work_mem GUC itself, and has been
> for many years.

Right.  The point here is that before v13, hash aggregation was not
subject to the work_mem limit, nor any related limit.  If you did an
aggregation requiring more than 2GB-plus-slop, it would work just fine
as long as your machine had enough RAM.  Now, the performance sucks and
there is no knob you can turn to fix it.  That's unacceptable in my book.

regards, tom lane




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 9:42 AM Tom Lane  wrote:
> Right.  The point here is that before v13, hash aggregation was not
> subject to the work_mem limit, nor any related limit.  If you did an
> aggregation requiring more than 2GB-plus-slop, it would work just fine
> as long as your machine had enough RAM.  Now, the performance sucks and
> there is no knob you can turn to fix it.  That's unacceptable in my book.

Oh! That makes way more sense.

I suspect David's theory about hash_agg_set_limits()'s ngroup limit is
correct. It certainly seems like a good starting point.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Tom Lane
Justin Pryzby  writes:
> Oh.  So the problem seems to be that:

> 1) In v12, HashAgg now obeyes work_mem*hash_mem_multiplier;
> 2) Under windows, work_mem is limited to 2GB.

And more to the point, work_mem*hash_mem_multiplier is *also* limited
to 2GB.  We didn't think that through very carefully.  The point of
the hash_mem_multiplier feature was to allow hash aggregation to still
consume more than the work_mem limit, but we failed to free it from
this 2GB limit.

You're right though that this is Windows-only; on machines with
64-bit "long" there's less of a problem.

regards, tom lane




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 9:53 AM Peter Geoghegan  wrote:
> I suspect David's theory about hash_agg_set_limits()'s ngroup limit is
> correct. It certainly seems like a good starting point.

I also suspect that if Laurent set work_mem and/or hash_mem_multiplier
*extremely* aggressively, then eventually the hash agg would be
in-memory. And without actually using all that much memory.

I'm not suggesting that that is a sensible resolution to Laurent's
complaint. I'm just pointing out that it's probably not fundamentally
impossible to make the hash agg avoid spilling through tuning these
GUCs. At least I see no evidence of that right now.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Tom Lane
Peter Geoghegan  writes:
> I also suspect that if Laurent set work_mem and/or hash_mem_multiplier
> *extremely* aggressively, then eventually the hash agg would be
> in-memory. And without actually using all that much memory.

No, he already tried, upthread.  The trouble is that he's on a Windows
machine, so get_hash_mem is quasi-artificially constraining the product
to 2GB.  And he needs it to be a bit more than that.  Whether the
constraint is hitting at the ngroups stage or it's related to actual
memory consumption isn't that relevant.

What I'm wondering about is whether it's worth putting in a solution
for this issue in isolation, or whether we ought to embark on the
long-ignored project of getting rid of use of "long" for any
memory-size-related computations.  There would be no chance of
back-patching something like the latter into v13, though.

regards, tom lane




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]
I did try 2000MB work_mem and 16 multiplier 😊 It seems to plateau at 2GB no 
matter what. This is what the explain had:

HashAggregate  (cost=1774568.21..1774579.21 rows=200 width=1260) (actual 
time=94618.303..1795311.542 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=14 read=169854, temp read=15777 written=27588
  ->  HashAggregate  (cost=1360804.75..1374830.63 rows=1402588 width=56) 
(actual time=30753.022..45384.558 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126560kB
Buffers: shared read=169851, temp read=15777 written=27588
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256856.62 
rows=13859750 width=38) (actual time=0.110..14342.258 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
  Rows Removed by Filter: 171680
  Buffers: shared read=169851
Settings: effective_cache_size = '52GB', from_collapse_limit = '24', 
hash_mem_multiplier = '16', jit = 'off', jit_above_cost = '2e+08', 
jit_inline_above_cost = '5e+08', jit_optimize_above_cost = '5e+08', 
join_collapse_limit = '24', max_parallel_workers = '20', 
max_parallel_workers_per_gather = '8', random_page_cost = '1.1', temp_buffers = 
'4GB', work_mem = ' 2000MB'
Planning:
  Buffers: shared hit=186 read=37
Planning Time: 55.709 ms
Execution Time: 1795921.717 ms





-Original Message-
From: Peter Geoghegan  
Sent: Thursday, July 22, 2021 13:05
To: Tom Lane 
Cc: David Rowley ; [email protected]; Justin Pryzby 
; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 9:53 AM Peter Geoghegan  wrote:
> I suspect David's theory about hash_agg_set_limits()'s ngroup limit is 
> correct. It certainly seems like a good starting point.

I also suspect that if Laurent set work_mem and/or hash_mem_multiplier
*extremely* aggressively, then eventually the hash agg would be in-memory. And 
without actually using all that much memory.

I'm not suggesting that that is a sensible resolution to Laurent's complaint. 
I'm just pointing out that it's probably not fundamentally impossible to make 
the hash agg avoid spilling through tuning these GUCs. At least I see no 
evidence of that right now.

--
Peter Geoghegan


RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]



-Original Message-
From: Justin Pryzby  
Sent: Thursday, July 22, 2021 12:36
To: [email protected]
Cc: Tom Lane ; David Rowley ; Peter 
Geoghegan ; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 04:30:00PM +, [email protected] wrote:
> Hello Justin,
> 
> > log_executor_stats=on; client_min_messages=debug;
> 
> Would the results then come in EXPLAIN or would I need to pick something up 
> from the logs?

If you're running with psql, and client_min_messages=debug, then it'll show up 
as a debug message to the client:

ts=# SET log_executor_stats=on; SET client_min_messages=debug; explain analyze 
SELECT 1; SET SET
LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!   0.11 s user, 0.000209 s system, 0.000219 s elapsed
!   [0.040608 s user, 0.020304 s system total]
!   7808 kB max resident size
...

It can but doesn't have to use "explain" - that just avoids showing the query 
output, since it's not what's interesting here.

--
Justin


-

Justin,

I tried this but not seeing max resident size data output.

Pepper=# SET log_executor_stats=on; SET client_min_messages=debug; explain 
analyze SELECT 1;
SET
SET
LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!   0.00 s user, 0.00 s system, 0.000109 s elapsed
!   [494.640625 s user, 19.171875 s system total]
 QUERY PLAN

 Result  (cost=0.00..0.01 rows=1 width=4) (actual time=0.002..0.002 rows=1 
loops=1)
 Planning Time: 0.041 ms
 Execution Time: 0.012 ms
(3 rows)


For my query:


LOG:  EXECUTOR STATISTICS
DETAIL:  ! system usage stats:
!   169.625000 s user, 5.843750 s system, 175.490088 s elapsed
!   [494.640625 s user, 19.171875 s system total]
 HashAggregate  (cost=1764285.18..1764296.18 rows=200 width=1260) (actual 
time=86323.813..174737.442 rows=723659 loops=1)
   Group Key: t.iccqa_iccassmt_fk
   Buffers: shared hit=364 read=170293, temp written=83229
   CTE t
 ->  HashAggregate  (cost=1343178.39..1356985.17 rows=1380678 width=56) 
(actual time=22594.053..32519.573 rows=13865785 loops=1)
   Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
   Buffers: shared hit=364 read=170293
   ->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1240682.76 
rows=13666084 width=38) (actual time=0.170..10714.598 rows=13865785 loops=1)
 Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOE
S PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN PRESENT","DRAIN 
TYPE","EDGE / SURROUNDING TISSUE - MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE 
AMOUNT",
"EXUDATE TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND 
CLOSURE","INDICATE TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL 
WOUND OR SUSPECTE
D DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHE
R COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREM
ENTS NOT TAKEN","PAIN FREQUENCY","PAIN INTERVENTIONS","PAIN 
QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE INDURATION","REASON 
MEASUREMENTS NOT TAKEN","R
ESPONSE TO PAIN INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN 
COLOR SURROUNDING WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE 
ESCHAR
","TOTAL NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 
3 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LO
CATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOC
ATION - 3 - 6 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 6 - 9 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 O''CLOCK","WIDTH (CM)","WOUND 
PAIN LEVEL, WHERE
 0 = \"NO PAIN\" AND 10 = \"WORST POSSIBLE PAIN\""}'::text[]))
 Rows Removed by Filter: 172390
 Buffers: shared hit=364 read=170293
   ->  CTE Scan on t  (cost=0.00..27613.56 rows=1380678 width=552) (actual 
time=22594.062..40248.874 rows=13865785 loops=1)
 Buffers: shared hit=364 read=170293, temp written=83229
 Planning Time: 0.728 ms
 Execution Time: 175482.904 ms
(15 rows)






RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]


-Original Message-
From: Tom Lane  
Sent: Thursday, July 22, 2021 12:42
To: Peter Geoghegan 
Cc: David Rowley ; [email protected]; Justin Pryzby 
; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

Peter Geoghegan  writes:
> On Thu, Jul 22, 2021 at 9:21 AM Tom Lane  wrote:
>> Yeah, I should have said "2GB plus palloc slop".  It doesn't surprise 
>> me a bit that we seem to be eating another 20% on top of the nominal 
>> limit.

> MAX_KILOBYTES is the max_val for the work_mem GUC itself, and has been 
> for many years.

Right.  The point here is that before v13, hash aggregation was not subject to 
the work_mem limit, nor any related limit.  If you did an aggregation requiring 
more than 2GB-plus-slop, it would work just fine as long as your machine had 
enough RAM.  Now, the performance sucks and there is no knob you can turn to 
fix it.  That's unacceptable in my book.

regards, tom lane


---

Hello all,

As a user of PG, we have taken pride in the last few years in tuning the heck 
out of the system and getting great performance compared to alternatives like 
SQLServer. The customers we work with typically have data centers and are 
overwhelmingly Windows shops: we won the battle to deploy a complex operational 
system on PG vs SQLServer, but Linux vs Windows was still a bridge too far for 
many. I am surprised that this limitation introduced after V11 hasn't caused 
issues elsewhere though. Are we doing things that are such out of the normal? 
Are we early in pushing V13 to full production? 😊 Doing analytics with pivoted 
tables with hundreds of columns is not uncommon in our world.

As for the three other requests from the team:

Clustering:
==
I re-clustered the table on the index that drives the pivot logic but I didn't 
see any change:

cluster verbose assessmenticcqa_raw using 
assessmenticcqa_raw_idx_iccqar_assmt_ques;

HashAggregate  (cost=1774465.36..1774476.36 rows=200 width=1260) (actual 
time=80848.591..1763443.586 rows=722853 loops=1)
  Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk
  Batches: 1  Memory Usage: 1277985kB
  Buffers: shared hit=369 read=169577, temp read=15780 written=27584
  ->  HashAggregate  (cost=1360748.50..1374772.80 rows=1402430 width=56) 
(actual time=25475.554..38256.923 rows=13852618 loops=1)
Group Key: assessmenticcqa_raw.iccqar_iccassmt_fk, 
assessmenticcqa_raw.iccqar_ques_code
Batches: 5  Memory Usage: 2400305kB  Disk Usage: 126552kB
Buffers: shared hit=352 read=169577, temp read=15780 written=27584
->  Seq Scan on assessmenticcqa_raw  (cost=0.00..1256812.09 
rows=13858188 width=38) (actual time=0.085..11914.135 rows=13852618 loops=1)
  Filter: ((iccqar_ques_code)::text = ANY ('{"DEBRIDEMENT 
DATE","DEBRIDEMENT THIS VISIT","DEBRIDEMENT TYPE","DEPTH (CM)","DEPTH 
DESCRIPTION","DOES PATIENT HAVE PAIN ASSOCIATED WITH THIS WOUND?","DRAIN 
PRESENT","DRAIN TYPE","EDGE / SURROUNDING TISSUE - 
MACERATION",EDGES,EPITHELIALIZATION,"EXUDATE AMOUNT","EXUDATE 
TYPE","GRANULATION TISSUE","INDICATE OTHER TYPE OF WOUND CLOSURE","INDICATE 
TYPE","INDICATE WOUND CLOSURE","IS THIS A CLOSED SURGICAL WOUND OR SUSPECTED 
DEEP TISSUE INJURY?","LENGTH (CM)","MEASUREMENTS TAKEN","NECROTIC TISSUE 
AMOUNT","NECROTIC TISSUE TYPE",ODOR,"OTHER COMMENTS REGARDING DEBRIDEMENT 
TYPE","OTHER COMMENTS REGARDING DRAIN TYPE","OTHER COMMENTS REGARDING PAIN 
INTERVENTIONS","OTHER COMMENTS REGARDING PAIN QUALITY","OTHER COMMENTS 
REGARDING REASON MEASUREMENTS NOT TAKEN","PAIN FREQUENCY","PAIN 
INTERVENTIONS","PAIN QUALITY","PERIPHERAL TISSUE EDEMA","PERIPHERAL TISSUE 
INDURATION","REASON MEASUREMENTS NOT TAKEN","RESPONSE TO PAIN 
INTERVENTIONS",SHAPE,"SIGNS AND SYMPTOMS OF INFECTION","SKIN COLOR SURROUNDING 
WOUND",STATE,"SURFACE AREA (SQ CM)","TOTAL NECROTIC TISSUE ESCHAR","TOTAL 
NECROTIC TISSUE SLOUGH",TUNNELING,"TUNNELING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","TUNNELING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","TUNNELING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK",UNDERMINING,"UNDERMINING SIZE(CM)/LOCATION - 12 - 3 
O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 3 - 6 O''CLOCK","UNDERMINING 
SIZE(CM)/LOCATION - 6 - 9 O''CLOCK","UNDERMINING SIZE(CM)/LOCATION - 9 - 12 
O''CLOCK","WIDTH (CM)","WOUND PAIN LEVEL, WHERE 0 = \"NO PAIN\" AND 10 = 
\"WORST POSSIBLE PAIN\""}'::text[]))
  Rows Removed by Filter: 171680
  Buffers: shared hit=352 read=169577
Settings: effective_cache_size = '52GB', from_collapse_limit = '24', 
hash_mem_multiplier = '4', jit = 'off', jit_above_cost = '2e+08', 
jit_inline_above_cost = '5e+08', jit_optimize_above_cost = '5e+08', 
join_collapse_limit = '24', max_parallel_workers = '20', 
max_parallel_workers_per_gather = '8', random_page_cost = '1.1', temp_buffers = 
'4GB', work_mem = '1GB'
Planning:
  Buffers: shared hit=100 r

Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 10:11 AM Tom Lane  wrote:
> No, he already tried, upthread.  The trouble is that he's on a Windows
> machine, so get_hash_mem is quasi-artificially constraining the product
> to 2GB.  And he needs it to be a bit more than that.  Whether the
> constraint is hitting at the ngroups stage or it's related to actual
> memory consumption isn't that relevant.

Somehow I missed that part.

> What I'm wondering about is whether it's worth putting in a solution
> for this issue in isolation, or whether we ought to embark on the
> long-ignored project of getting rid of use of "long" for any
> memory-size-related computations.  There would be no chance of
> back-patching something like the latter into v13, though.

+1. Even if we assume that Windows is a low priority platform, in the
long run it'll be easier to make it more like every other platform.

The use of "long" is inherently suspect to me. It signals that the
programmer wants something wider than "int", even if the standard
doesn't actually require that "long" be wider. This seems to
contradict what we know to be true for Postgres, which is that in
general it's unsafe to assume that long is int64. It's not just
work_mem related calculations. There is also code like logtape.c,
which uses long for block numbers -- that also exposes us to risk on
Windows.

By requiring int64 be used instead of long, we don't actually increase
risk for non-Windows platforms to any significant degree. I'm pretty
sure that "long" means int64 on non-Windows 64-bit platforms anyway.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Justin Pryzby
On Thu, Jul 22, 2021 at 05:26:26PM +, [email protected] wrote:
> I tried this but not seeing max resident size data output.

Oh.  Apparently, that's not supported under windows..

#if defined(HAVE_GETRUSAGE)
appendStringInfo(&str,
 "!\t%ld kB max resident size\n",




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]


-Original Message-
From: Justin Pryzby  
Sent: Thursday, July 22, 2021 13:29
To: [email protected]
Cc: Tom Lane ; David Rowley ; Peter 
Geoghegan ; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 05:26:26PM +, [email protected] wrote:
> I tried this but not seeing max resident size data output.

Oh.  Apparently, that's not supported under windows..

#if defined(HAVE_GETRUSAGE)
appendStringInfo(&str,
 "!\t%ld kB max resident size\n",




Hello,

Damn... I know Windows is a lower priority, and this is yet another issue, but 
in Healthcare, Windows is so prevalent everywhere...

Thank you,
Laurent.




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Tom Lane
Peter Geoghegan  writes:
> On Thu, Jul 22, 2021 at 10:11 AM Tom Lane  wrote:
>> What I'm wondering about is whether it's worth putting in a solution
>> for this issue in isolation, or whether we ought to embark on the
>> long-ignored project of getting rid of use of "long" for any
>> memory-size-related computations.  There would be no chance of
>> back-patching something like the latter into v13, though.

> By requiring int64 be used instead of long, we don't actually increase
> risk for non-Windows platforms to any significant degree. I'm pretty
> sure that "long" means int64 on non-Windows 64-bit platforms anyway.

Well, what we really ought to be using is size_t (a/k/a Size), at least
for memory-space-related calculations.  I don't have an opinion right
now about what logtape.c ought to use.  I do agree that avoiding "long"
altogether would be a good ultimate goal.

In the short term though, the question is whether we want to regard this
hashagg issue as something we need a fix for in v13/v14.  The fact that
it's Windows-only makes it slightly less pressing in my mind, but it's
still a regression that some people are going to hit.

regards, tom lane




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 10:33 AM [email protected]
 wrote:
> Damn... I know Windows is a lower priority, and this is yet another issue, 
> but in Healthcare, Windows is so prevalent everywhere...

To be clear, I didn't actually say that. I said that it doesn't matter
either way, as far as addressing this long standing "int64 vs long"
issue goes.

-- 
Peter Geoghegan




RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]


-Original Message-
From: Peter Geoghegan  
Sent: Thursday, July 22, 2021 13:36
To: [email protected]
Cc: Justin Pryzby ; Tom Lane ; David 
Rowley ; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

On Thu, Jul 22, 2021 at 10:33 AM [email protected] 
 wrote:
> Damn... I know Windows is a lower priority, and this is yet another issue, 
> but in Healthcare, Windows is so prevalent everywhere...

To be clear, I didn't actually say that. I said that it doesn't matter either 
way, as far as addressing this long standing "int64 vs long"
issue goes.

--
Peter Geoghegan


Yes, agreed Peter... The "lower priority" issue was mentioned, but not in terms 
of the applicability of the fix overall. Personally, I would prefer going the 
size_t route vs int/long/int64 in C/C++/. Of course, as a user, I'd love a 
patch on V13 and something cleaner in V14.

Thanks,
Laurent.




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Peter Geoghegan
On Thu, Jul 22, 2021 at 10:35 AM Tom Lane  wrote:
> Well, what we really ought to be using is size_t (a/k/a Size), at least
> for memory-space-related calculations.  I don't have an opinion right
> now about what logtape.c ought to use.  I do agree that avoiding "long"
> altogether would be a good ultimate goal.

I assume that we often use "long" in contexts where a signed integer
type is required. Maybe this is not true in the case of the work_mem
style calculations. But I know that it works that way in logtape.c,
where -1 is a sentinel value.

We already use int64 (not size_t) in tuplesort.c for roughly the same
reason: LACKMEM() needs to work with negative values, to handle
certain edge cases.

> In the short term though, the question is whether we want to regard this
> hashagg issue as something we need a fix for in v13/v14.  The fact that
> it's Windows-only makes it slightly less pressing in my mind, but it's
> still a regression that some people are going to hit.

True. I worry about the potential for introducing new bugs on Windows
by backpatching a fix for this. Technically this restriction existed
in every other work_mem consumer on Windows. Of course this won't
matter much to users like Laurent.

-- 
Peter Geoghegan




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Ranier Vilela
Em qui., 22 de jul. de 2021 às 14:28, Peter Geoghegan  escreveu:

> On Thu, Jul 22, 2021 at 10:11 AM Tom Lane  wrote:
> > No, he already tried, upthread.  The trouble is that he's on a Windows
> > machine, so get_hash_mem is quasi-artificially constraining the product
> > to 2GB.  And he needs it to be a bit more than that.  Whether the
> > constraint is hitting at the ngroups stage or it's related to actual
> > memory consumption isn't that relevant.
>
> Somehow I missed that part.
>
> > What I'm wondering about is whether it's worth putting in a solution
> > for this issue in isolation, or whether we ought to embark on the
> > long-ignored project of getting rid of use of "long" for any
> > memory-size-related computations.  There would be no chance of
> > back-patching something like the latter into v13, though.
>
> +1. Even if we assume that Windows is a low priority platform, in the
> long run it'll be easier to make it more like every other platform.
>
> The use of "long" is inherently suspect to me. It signals that the
> programmer wants something wider than "int", even if the standard
> doesn't actually require that "long" be wider. This seems to
> contradict what we know to be true for Postgres, which is that in
> general it's unsafe to assume that long is int64. It's not just
> work_mem related calculations. There is also code like logtape.c,
> which uses long for block numbers -- that also exposes us to risk on
> Windows.
>
> By requiring int64 be used instead of long, we don't actually increase
> risk for non-Windows platforms to any significant degree. I'm pretty
> sure that "long" means int64 on non-Windows 64-bit platforms anyway.
>
I wonder if similar issues not raise from this [1].

(b/src/backend/optimizer/path/costsize.c)
cost_tuplesort uses *long* to store sort_mem_bytes.
I suggested switching to int64, but obviously to no avail.

+1 to switch long to int64.

regards,
Ranier Vilela

[1]
https://www.postgresql.org/message-id/CAApHDvqhUYHYGmovoGWJQ1%2BZ%2B50Mz%3DPV6bW%3DQYEh3Z%2BwZTufPQ%40mail.gmail.com


Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Alvaro Herrera
On 2021-Jul-22, [email protected] wrote:

> Yes, agreed Peter... The "lower priority" issue was mentioned, but not
> in terms of the applicability of the fix overall. Personally, I would
> prefer going the size_t route vs int/long/int64 in C/C++/. Of course,
> as a user, I'd love a patch on V13 and something cleaner in V14.

Just to clarify our terminology here.  "A patch" means any kind of
change to the source code, regardless of its cleanliness or
applicability to versions deemed stable.  You can have one patch which
is a ugly hack for a stable version that can't have invasive changes,
and another patch which is a clean, more maintainable version of a
totally different fix for the version in development.  We use the same
word, "patch", for both types of source code changes.

-- 
Álvaro Herrera PostgreSQL Developer  —  https://www.EnterpriseDB.com/




Re: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread Vijaykumar Jain
Just asking, I may be completely wrong.

is this query parallel safe?
can we force parallel workers, by setting low parallel_setup_cost or
otherwise to make use of scatter gather and Partial HashAggregate(s)?
I am just assuming more workers doing things in parallel, would require
less disk spill per hash aggregate (or partial hash aggregate ?) and the
scatter gather at the end.

I did some runs in my demo environment, not with the same query, some group
by aggregates  with around 25M rows, and it showed reasonable results, not
too off.
this was pg14 on ubuntu.


RE: Big performance slowdown from 11.2 to 13.3

2021-07-22 Thread [email protected]
I am not sure I understand this parameter well enough but it’s with a default 
value right now of 1000. I have read Robert’s post 
(http://rhaas.blogspot.com/2018/06/using-forceparallelmode-correctly.html) and 
could play with those parameters, but unsure whether what you are describing 
will unlock this 2GB limit.


From: Vijaykumar Jain 
Sent: Thursday, July 22, 2021 16:32
To: [email protected]
Cc: Justin Pryzby ; [email protected]
Subject: Re: Big performance slowdown from 11.2 to 13.3

Just asking, I may be completely wrong.

is this query parallel safe?
can we force parallel workers, by setting low parallel_setup_cost or otherwise 
to make use of scatter gather and Partial HashAggregate(s)?
I am just assuming more workers doing things in parallel, would require less 
disk spill per hash aggregate (or partial hash aggregate ?) and the scatter 
gather at the end.

I did some runs in my demo environment, not with the same query, some group by 
aggregates  with around 25M rows, and it showed reasonable results, not too off.
this was pg14 on ubuntu.