Statistics on array values

2020-02-02 Thread Marco Colli
Hello!

Let's say that you have a simple query like the following on a large table
(for a multi-tenant application):
SELECT "subscribers".* FROM "subscribers" WHERE "subscribers"."project_id"
= 123 AND (tags @> ARRAY['de']::varchar[]);

If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
For example I get an expected count of 3,500 rows whereas the actual
result is 20 rows. This also results in bad query plans...

In a previous discussion someone said that this wrong estimate is because
@> uses a fixed selectivity of 0.001, **regardless of actual data**!!
Is that true? Is there any solution or any plan to improve this in future
versions of PostgreSQL?

Finally it would be useful to have the ability to CREATE STATISTICS, to
show PostgreSQL that there's a correlation between project_id and tag
values... but this is a further step. Currently I can create statistics,
however it seems to have no positive effect on the estimates for the
above case


Marco Colli


Re: Statistics on array values

2020-02-02 Thread Justin Pryzby
On Sun, Feb 02, 2020 at 03:18:19PM +0100, Marco Colli wrote:
> Hello!
> 
> Let's say that you have a simple query like the following on a large table
> (for a multi-tenant application):
> SELECT "subscribers".* FROM "subscribers" WHERE "subscribers"."project_id"
> = 123 AND (tags @> ARRAY['de']::varchar[]);
> 
> If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
> For example I get an expected count of 3,500 rows whereas the actual
> result is 20 rows. This also results in bad query plans...

https://www.postgresql.org/message-id/CAMkU%3D1z%2BQijUWAYgeqeyw%2BAvD7adPgOmEnY%2BOcTw6qDVFtD7cQ%40mail.gmail.com
On Fri, Jan 10, 2020 at 12:12:52PM -0500, Jeff Janes wrote:
> Why is the estimate off by so much?  If you run a simple select, what the
> actual and expected number of rows WHERE project_id = 12345?  WHERE tags @>
> '{crt:2018_11}'?  Is one of those estimates way off reality, or is it only
> the conjunction which is deranged?

Could you respond to Jeff's inquiry ?

Justin




Re: Statistics on array values

2020-02-02 Thread Marco Colli
> Is one of those estimates way off reality, or is it only the conjunction
which is deranged?

The estimate is wrong *even with a single tag*, without the conjunction
(e.g. expected 3500, actual 20). Then the conjunction can make the bias
even worse...

On Sun, Feb 2, 2020 at 3:23 PM Justin Pryzby  wrote:

> On Sun, Feb 02, 2020 at 03:18:19PM +0100, Marco Colli wrote:
> > Hello!
> >
> > Let's say that you have a simple query like the following on a large
> table
> > (for a multi-tenant application):
> > SELECT "subscribers".* FROM "subscribers" WHERE
> "subscribers"."project_id"
> > = 123 AND (tags @> ARRAY['de']::varchar[]);
> >
> > If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
> > For example I get an expected count of 3,500 rows whereas the actual
> > result is 20 rows. This also results in bad query plans...
>
>
> https://www.postgresql.org/message-id/CAMkU%3D1z%2BQijUWAYgeqeyw%2BAvD7adPgOmEnY%2BOcTw6qDVFtD7cQ%40mail.gmail.com
> On Fri, Jan 10, 2020 at 12:12:52PM -0500, Jeff Janes wrote:
> > Why is the estimate off by so much?  If you run a simple select, what the
> > actual and expected number of rows WHERE project_id = 12345?  WHERE tags
> @>
> > '{crt:2018_11}'?  Is one of those estimates way off reality, or is it
> only
> > the conjunction which is deranged?
>
> Could you respond to Jeff's inquiry ?
>
> Justin
>


Re: Statistics on array values

2020-02-02 Thread Tom Lane
Marco Colli  writes:
> Let's say that you have a simple query like the following on a large table
> (for a multi-tenant application):
> SELECT "subscribers".* FROM "subscribers" WHERE "subscribers"."project_id"
> = 123 AND (tags @> ARRAY['de']::varchar[]);

> If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
> For example I get an expected count of 3,500 rows whereas the actual
> result is 20 rows. This also results in bad query plans...

> In a previous discussion someone said that this wrong estimate is because
> @> uses a fixed selectivity of 0.001, **regardless of actual data**!!
> Is that true?

Hasn't been true since 9.2.

You might get some insight from looking into the most_common_elems,
most_common_elem_freqs, and elem_count_histogram fields of the pg_stats
view.

It seems likely to me that increasing the statistics target for this array
column would help.  IIRC, estimates for values that don't show up in
most_common_elems are going to depend on the lowest frequency that *does*
show up there ... so if you want better resolution for non-common values,
you need more entries.

regards, tom lane




Re: Statistics on array values

2020-02-02 Thread Marco Colli
Thanks Tom for the clear explanation.
Unfortunately I don't get actual improvements. I use PG 11 and I run the
following commands:

ALTER TABLE subscriptions ALTER tags SET STATISTICS 1000;
ANALYZE subscriptions;

However the bias remains pretty much the same (slightly worse after). Any
idea?

On Sun, Feb 2, 2020 at 6:11 PM Tom Lane  wrote:

> Marco Colli  writes:
> > Let's say that you have a simple query like the following on a large
> table
> > (for a multi-tenant application):
> > SELECT "subscribers".* FROM "subscribers" WHERE
> "subscribers"."project_id"
> > = 123 AND (tags @> ARRAY['de']::varchar[]);
>
> > If you run EXPLAIN ANALYZE you can see that stats are completely wrong.
> > For example I get an expected count of 3,500 rows whereas the actual
> > result is 20 rows. This also results in bad query plans...
>
> > In a previous discussion someone said that this wrong estimate is because
> > @> uses a fixed selectivity of 0.001, **regardless of actual data**!!
> > Is that true?
>
> Hasn't been true since 9.2.
>
> You might get some insight from looking into the most_common_elems,
> most_common_elem_freqs, and elem_count_histogram fields of the pg_stats
> view.
>
> It seems likely to me that increasing the statistics target for this array
> column would help.  IIRC, estimates for values that don't show up in
> most_common_elems are going to depend on the lowest frequency that *does*
> show up there ... so if you want better resolution for non-common values,
> you need more entries.
>
> regards, tom lane
>


Re: Statistics on array values

2020-02-02 Thread Tom Lane
Marco Colli  writes:
> Unfortunately I don't get actual improvements. I use PG 11 and I run the
> following commands:
> ALTER TABLE subscriptions ALTER tags SET STATISTICS 1000;
> ANALYZE subscriptions;
> However the bias remains pretty much the same (slightly worse after). Any
> idea?

So what have you got in the pg_stats fields I asked about?
How big is this table anyway (how many rows, how many different tag
values)?

regards, tom lane




Re: Statistics on array values

2020-02-02 Thread Marco Colli
Sorry, I don't understand your exact question about pg_stats. In any case I
cannot make strict assumptions about data, because that greatly varies from
project to project (it's a SaaS) and over time. To give an idea the table
has some tens of millions of rows, each project has from a few thousands to
a few millions of rows and each project has its own tags that the customer
can define (unlimited tags for each row, but usually only 1 - 10 actually
used)

Il Dom 2 Feb 2020, 19:32 Tom Lane  ha scritto:

> Marco Colli  writes:
> > Unfortunately I don't get actual improvements. I use PG 11 and I run the
> > following commands:
> > ALTER TABLE subscriptions ALTER tags SET STATISTICS 1000;
> > ANALYZE subscriptions;
> > However the bias remains pretty much the same (slightly worse after). Any
> > idea?
>
> So what have you got in the pg_stats fields I asked about?
> How big is this table anyway (how many rows, how many different tag
> values)?
>
> regards, tom lane
>