Re: Too few rows expected by Planner on partitioned tables

2020-07-22 Thread Justin Pryzby
On Wed, Jul 22, 2020 at 06:33:17AM +, Julian Wolf wrote:
> Hello Justin,
> 
> 
> thank you very much for your fast response.
> 
> > Is there a correlation between daterange and spacial_feature_id ?
> 
> I am not entirely sure, what you mean by that. Basically, no, they are not 
> correlated - spatial features are places on a map, date ranges are time 
> periods. But, as they are both part of a primary key in this particular 
> table, they are correlated in some way as to be a part of uniquely 
> identifying a row.
> 
> 
> > Are the estimates good if you query on *only* daterange?  
> > spacial_feature_id ?
> Unfortunately no, they are not:

I checked and found that range types don't have "normal" statistics, and in
particular seem to use a poor ndistinct estimate..

/* Estimate that non-null values are unique */
stats->stadistinct = -1.0 * (1.0 - stats->stanullfrac);

You could try to cheat and hardcode a different ndistinct that's "less wrong"
by doing something like this:

ALTER TABLE t ALTER a SET (N_DISTINCT=-0.001);  ANALYZE t;

Maybe a better way is to create an index ON: lower(range),upper(range)
And then query: WHERE (lower(a),upper(a)) = (1,112);

Since you'd be storing the values separately in the index anyway, maybe this
means that range types won't work well for you for primary, searchable columns.

But if you're stuck with the schema, another kludge, if you want to do
something extra weird, is to remove statistics entirely by disabling
autoanalyze on the table and then manually run ANALYZE(columns) where columns
doesn't include the range column.  You'd have to remove the stats:

begin; DELETE FROM pg_statistic s USING pg_attribute a WHERE 
s.staattnum=a.attnum AND s.starelid=a.attrelid AND starelid='t'::regclass AND 
a.attname='a';

-- 
Justin




Re: Too few rows expected by Planner on partitioned tables

2020-07-22 Thread Justin Pryzby
On Tue, Jul 21, 2020 at 01:09:22PM +, Julian Wolf wrote:
> Hello,
> 
> A description of what you are trying to achieve and what results you expect:
> Our database is growing on a daily basis by about 2.5million rows per table 
> (2 at the moment). Because of that, we decided to partition the data, 
> especially, as we are pre-aggregating the data for weeks, months, quarters 
> and years. Every aggregation is stored in a separate partition:
> 
...
> Our problem is, that the planner always predicts one row to be returned, 
> although only a part of the primary key is queried. This problem exceeds 
> feasibility of performance rapidly - a query only involving a few days 
> already takes dozens of seconds. All tables are analyzed and pg_stats looks 
> reasonable IMHO.
...
> PRIMARY KEY ( daterange, spatial_feature_id, visitor_profile_id, 
> activity_type_combination_id,
>  activity_chain_id),
...
> ) PARTITION BY LIST (daterange);

> schemaname relname n_live_tup
> mobility_insights location_statistics_y2019m03d 23569853
> mobility_insights location_statistics_y2019m03w 19264373
> mobility_insights location_statistics_y2019m03 18105295

> Aggregate  (cost=2.79..2.80 rows=1 width=8) (actual time=143.073..143.073 
> rows=1 loops=1)
>   Buffers: shared hit=67334
>   ->  Index Scan using location_statistics_y2019m03w_pkey on 
> location_statistics_y2019m03w st  (cost=0.56..2.78 rows=1 width=8) (actual 
> time=0.026..117.284 rows=516277 loops=1)
> Index Cond: ((daterange = '[2019-03-04,2019-03-11)'::daterange) AND 
> (spatial_feature_id = 12675))
> Buffers: shared hit=67334

I guess this isn't actually the problem query, since it takes 143ms and not
dozens of seconds.  I don't know what is the problem query, but maybe it might
help to create an new index on spatial_feature_id, which could be scanned
rather than scanning the unique index.

Also, if daterange *and* spatial_feature_id are always *both* included, then
this might work:

postgres=# CREATE STATISTICS t_stats (mcv) ON daterange,spatial_feature_id FROM 
t ; ANALYZE t;

-- 
Justin




Re: Too few rows expected by Planner on partitioned tables

2020-07-22 Thread Julian Wolf
Hello Justin,


thank you very much for your fast response.

> Is there a correlation between daterange and spacial_feature_id ?

I am not entirely sure, what you mean by that. Basically, no, they are not 
correlated - spatial features are places on a map, date ranges are time 
periods. But, as they are both part of a primary key in this particular table, 
they are correlated in some way as to be a part of uniquely identifying a row.


> Are the estimates good if you query on *only* daterange?  spacial_feature_id ?
Unfortunately no, they are not:




EXPLAIN (ANALYZE , BUFFERS)
SELECT sum(visitors * n)
FROM location_statistics st
WHERE st.daterange = '[2019-03-04,2019-03-11)'::DATERANGE

QUERY PLAN
Aggregate  (cost=2.79..2.80 rows=1 width=8) (actual time=1143.393..1143.393 
rows=1 loops=1)
  Buffers: shared hit=304958
  ->  Index Scan using location_statistics_y2019m03w_pkey on 
location_statistics_y2019m03w st  (cost=0.56..2.78 rows=1 width=8) (actual 
time=0.024..931.645 rows=4296639 loops=1)
Index Cond: (daterange = '[2019-03-04,2019-03-11)'::daterange)
Buffers: shared hit=304958
Planning Time: 0.080 ms
Execution Time: 1143.421 ms



EXPLAIN (ANALYZE , BUFFERS)
SELECT sum(visitors * n)
FROM location_statistics_y2019m03w st
WHERE st.daterange = '[2019-03-04,2019-03-11)'::DATERANGE

QUERY PLAN
Aggregate  (cost=2.79..2.80 rows=1 width=8) (actual time=1126.819..1126.820 
rows=1 loops=1)
  Buffers: shared hit=304958
  ->  Index Scan using location_statistics_y2019m03w_pkey on 
location_statistics_y2019m03w st  (cost=0.56..2.78 rows=1 width=8) (actual 
time=0.023..763.852 rows=4296639 loops=1)
Index Cond: (daterange = '[2019-03-04,2019-03-11)'::daterange)
Buffers: shared hit=304958
Planning Time: 0.046 ms
Execution Time: 1126.845 ms


Checking only on the spatial_feature is not the same query, as the table 
contains 4 different date ranges. Furthermore, there is no index for this 
operation. Because of that, I can only invoke this query on one partition, 
otherwise the query would take days.

EXPLAIN (ANALYZE , BUFFERS)
SELECT sum(visitors * n)
FROM location_statistics_y2019m03w st
WHERE spatial_feature_id = 12675

QUERY PLAN
Finalize Aggregate  (cost=288490.25..288490.26 rows=1 width=8) (actual 
time=1131.593..1131.593 rows=1 loops=1)
  Buffers: shared hit=40156 read=139887
  ->  Gather  (cost=288490.03..288490.24 rows=2 width=8) (actual 
time=1131.499..1148.872 rows=2 loops=1)
Workers Planned: 2
Workers Launched: 1
Buffers: shared hit=40156 read=139887
->  Partial Aggregate  (cost=287490.03..287490.04 rows=1 width=8) 
(actual time=1118.578..1118.579 rows=1 loops=2)
  Buffers: shared hit=40156 read=139887
  ->  Parallel Seq Scan on location_statistics_y2019m03w st  
(cost=0.00..280378.27 rows=948235 width=8) (actual time=3.544..1032.899 
rows=1134146 loops=2)
Filter: (spatial_feature_id = 12675)
Rows Removed by Filter: 8498136
Buffers: shared hit=40156 read=139887
Planning Time: 0.218 ms
JIT:
  Functions: 12
  Options: Inlining false, Optimization false, Expressions true, Deforming true
  Timing: Generation 0.929 ms, Inlining 0.000 ms, Optimization 0.426 ms, 
Emission 6.300 ms, Total 7.655 ms
Execution Time: 1191.741 ms

The estimates seem to be good though.

Thanks in Advance

Julian

[http://www.invenium.io/images/invenium_triangle_64.png]
Julian P. Wolf | Invenium Data Insights GmbH
[email protected] | +43 664 88 199 013
Herrengasse 28 | 8010 Graz | www.invenium.io


From: Justin Pryzby 
Sent: Tuesday, July 21, 2020 7:27 PM
To: Julian Wolf 
Cc: pgsql-performance Postgres Mailing List 

Subject: Re: Too few rows expected by Planner on partitioned tables

On Tue, Jul 21, 2020 at 01:09:22PM +, Julian Wolf wrote:
> Our problem is, that the planner always predicts one row to be returned, 
> although only a part of the primary key is queried. This problem exceeds 
> feasibility of performance rapidly - a query only involving a few days 
> already takes dozens of seconds. All tables are analyzed and pg_stats looks 
> reasonable IMHO.

> daterangedaterange NOT NULL,
> spatial_feature_id   INTEGER,

> Aggregate  (cost=2.79..2.80 rows=1 width=8) (actual time=143.073..143.073 
> rows=1 loops=1)
>   Buffers: shared hit=67334
>   ->  Index Scan using location_statistics_y2019m03w_pkey on 
> location_statistics_y2019m03w st  (cost=0.56..2.78 rows=1 width=8) (actual 
> time=0.02