Re: Very slow query performance when using CTE

2025-04-03 Thread Michael Christofides
>
> CREATE STATISTICS st_simrun_component_metadata (dependencies) ON
> sim_run_id, key FROM sim_run_component_metadata;
> ANALYZE sim_run_component_metadata;
>
> When I run this query, no statistics are returned:
>
> SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid =
> stxoid),
>
> pg_mcv_list_items(stxdmcv) m WHERE stxname =
> 'st_simrun_component_metadata';
>
> Is there something I might have missed?
>

It looks like you created "dependencies" statistics, but then searched for
"mcv" statistics. To test if mcv helps, you could drop and recreate as:
CREATE STATISTICS st_simrun_component_metadata (mcv) ...

The fetch from the table is rather fast. some milliseconds. But a
> subsequent sort operations takes very long time, for the amount of records
> fetched.
>

This does not seem to be the case for the slow cases you shared (those are
dominated by several millisecond index scans that are looped over 32k
times). So I assume you're talking about the fast case? If so, there is a
Sort that takes a couple of hundred milliseconds being done on disk (~15MB)
so you might also want to look into how fast that would be in memory (via
work_mem).


> But, just like the estimated rows in the plan, it does not match the real
> amount of available data in the table:
>

I'm not sure what you mean by this, is it only that the row estimates are
still bad?

Regards,
Michael


Re: Very slow query performance when using CTE

2025-04-03 Thread Chris Joysn
> CREATE STATISTICS st_simrun_component_metadata (dependencies) ON
>> sim_run_id, key FROM sim_run_component_metadata;
>> ANALYZE sim_run_component_metadata;
>>
>> When I run this query, no statistics are returned:
>>
>> SELECT m.* FROM pg_statistic_ext join pg_statistic_ext_data on (oid =
>> stxoid),
>>
>> pg_mcv_list_items(stxdmcv) m WHERE stxname =
>> 'st_simrun_component_metadata';
>>
>> Is there something I might have missed?
>>
>
> It looks like you created "dependencies" statistics, but then searched for
> "mcv" statistics. To test if mcv helps, you could drop and recreate as:
> CREATE STATISTICS st_simrun_component_metadata (mcv) ...
>

oh, right. Thank you. However, I increased the statistics target to 1,
and there are some statistics in pg_statistics_ext. But I am not allowed to
access pg_statistics_ext_data.


> The fetch from the table is rather fast. some milliseconds. But a
>> subsequent sort operations takes very long time, for the amount of records
>> fetched.
>>
>
> This does not seem to be the case for the slow cases you shared (those are
> dominated by several millisecond index scans that are looped over 32k
> times). So I assume you're talking about the fast case? If so, there is a
> Sort that takes a couple of hundred milliseconds being done on disk (~15MB)
> so you might also want to look into how fast that would be in memory (via
> work_mem).
>

What I see in the plan is, that there is a CTE scan with 512.960.256 rows,
consuming 30 seconds. The CTE result set has ~12.632 rows. I do not
understand what makes the CTE scan explode so drastically.
I am refering to this plan: https://explain.dalibo.com/plan/0b6f789h973833b1

When I look at this, considering 12632 rows in the CTE:
[image: image.png]
the left join is accessing / scanning the CTE result 40.608 times, and thus
reads 512.960.256 rows from the CTE. On the CTE there is no index and thus
a scan is needed.
When I remove that CTE and go with the real table on the join, the index is
used and thus its way faster.

My naive assumption was that using CTEs in queries when their result is
needed multiple times will speed up queries. But this is not the case when
as this example shows. Maybe in smaller CTEs result sets, but CTEs will
most likely be used in joins, and thus lead to CTE scans which have the
potential to explode.

Or are there approaches to address such situations? I can not assume that
the row distribution is like I face now, the query might have to deal with
even larger sub result sets and way smaller ones as well.

KR
Chris


Re: partition table optimizer join cost misestimation

2025-04-03 Thread James Pang
  Follow your suggestion to increase statistics_target (I increase
target_statistic to 1000 for  aa.mmm_id and cc.sss_id ,analyze tablea,
tablec again),  optimizer choose the good SQL plan.

Thanks,

James

Andrei Lepikhov  於 2025年4月3日週四 下午4:44寫道:

> On 4/3/25 10:04, James Pang wrote:
> > one more comments, for vacuum/analyze, we enable autovacuum=on, that may
> > sometimes automatically analyze part of partition table directly.
> I see some incoherence in data provided. The ranges of joining columns
> intersects only partially:
>
> cc.sss_id: 5 100 001 101 - 7 999 999 601
> aa.mmm_id: 2 005 242 651 - 5 726 786 022
>
> So, the intersection range 511101 - 5726786022 - is about 10% of the
> whole range.
> But I don't see it in the column statistics you provided. And Postgres
> may do the same.
> So, at first, I do recommend increasing default_statistics_target or
> just statistics_target on partitioned tables only. For such big tables I
> usually set it at least to the 2500.
> Also, don't trust in autovacuum on partitioned table - to make an
> analyse it needs to lock each partition which is highly unlikely to happen.
> So, increase stat target, make ANALYZE tablea, tablec and let me know
> what will happen. May be after the analyse statistics will be more
> consistent.
>
> --
> regards, Andrei Lepikhov
>