RE: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

2025-06-06 Thread Mark Frost

> On 6/5/25 17:42, Mark Frost wrote:
> > Is there any good explanation for this behaviour? Preferably we’d like
> > some way for proper `most_common_elems` statistics to be collected in
> > our production database, in the hope that influences a good query plan
> > to always be selected.


> most_common_elems has a limited size, and if all the elements have the
> same freq, there's nothing we can do.

>  You could do: alter table test alter column tags set statistics X;
>  However, X is capped at 1…

Actually *any* most_common_elems stats would be fine, because the reasoning is:

  *   If the searched element is in most_common_elems we know it’s frequency
  *   If it’s not, it’s less frequent than the least most_common_elems

So in our case when every row is unique, we’d only actually need stats to 
record a single most_common_elems (if only it would record one)

Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: Building C, IBM Hursley Office, Hursley Park Road, 
Winchester, Hampshire SO21 2JN


Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

2025-06-06 Thread Frédéric Yhuel




On 6/5/25 23:52, Tom Lane wrote:

The idea of treating lack of MCELEM differently from complete
lack of stats still seems to have merit, though.



Couldn't we count / estimate the number of distinct two-by-two elements, 
and use that instead of the default selectivity estimate?





Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column

2025-06-06 Thread Tom Lane
Mark Frost  writes:
> Actually *any* most_common_elems stats would be fine, because the reasoning 
> is:
>   *   If the searched element is in most_common_elems we know it's frequency
>   *   If it's not, it's less frequent than the least most_common_elems
> So in our case when every row is unique, we'd only actually need stats to 
> record a single most_common_elems (if only it would record one)

Well, we don't have a most common element in this scenario --- the
whole point is that the occurrence counts resulting from the lossy
counting algorithm are too low to be trustworthy.  However, what we
do have is the cutoff frequency, and it seems to me that we could use
that as the estimate of the maximum frequency of the non-MCEs.

Attached is an extremely crude prototype patch for that.  What it
does is to create an MCELEM stats entry with zero MCEs, but containing
min and max frequencies equal to the cutoff frequency (plus the nulls
frequency, which we know accurately in any case).  Mark, this fixes
your example case, but I wonder if it fixes your original problem ---
are you in a position to test it?

Assuming we like this direction, the main thing that makes this a hack
not a polished patch is that I had to strongarm the code into storing
a zero-length values array.  What update_attstats would normally do
is leave the values column of the MCELEM stats slot NULL, which then
causes get_attstatsslot to throw a that-shouldn't-be-null error.
An alternative answer is to change get_attstatsslot to allow a null,
but I'm not sure that that's any cleaner.  Either way it seems like
there's a hazard of breaking some code that isn't expecting the case.

An alternative that feels cleaner but a good bit more invasive
is to get rid of the convention of storing the min/max/nulls
frequencies as extra entries in the MCELEM numbers entry ---
which surely is a hack too --- and put them into some new slot
type.  I'm not sure if that's enough nicer to be worth the
conversion pain.

Thoughts?

regards, tom lane

diff --git a/src/backend/commands/analyze.c b/src/backend/commands/analyze.c
index 4fffb76e557..c312de7d593 100644
--- a/src/backend/commands/analyze.c
+++ b/src/backend/commands/analyze.c
@@ -1733,7 +1733,10 @@ update_attstats(Oid relid, bool inh, int natts, VacAttrStats **vacattrstats)
 		i = Anum_pg_statistic_stavalues1 - 1;
 		for (k = 0; k < STATISTIC_NUM_SLOTS; k++)
 		{
-			if (stats->numvalues[k] > 0)
+			/* XXX ugly hack to allow zero-length MCELEM values arrays */
+			if (stats->numvalues[k] > 0 ||
+(stats->numvalues[k] == 0 &&
+ stats->stakind[k] == STATISTIC_KIND_MCELEM))
 			{
 ArrayType  *arry;
 
diff --git a/src/backend/utils/adt/array_typanalyze.c b/src/backend/utils/adt/array_typanalyze.c
index 6f61629b977..9c6dd2852ab 100644
--- a/src/backend/utils/adt/array_typanalyze.c
+++ b/src/backend/utils/adt/array_typanalyze.c
@@ -497,6 +497,15 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 		 * If we obtained more elements than we really want, get rid of those
 		 * with least frequencies.  The easiest way is to qsort the array into
 		 * descending frequency order and truncate the array.
+		 *
+		 * On the other extreme, we might have found no elements with
+		 * frequency above the cutoff.  Then there's nothing to store so far
+		 * as element values go, but we still want to record the cutoff
+		 * frequency, since array_selfuncs.c can use that as an upper bound on
+		 * the frequency of non-MCVs.  (Note: what we store is the minimum
+		 * frequency that would have been accepted as a valid MCE.  In
+		 * particular, this ensures we don't create a bogus stats entry with
+		 * frequency zero.)
 		 */
 		if (num_mcelem < track_len)
 		{
@@ -506,10 +515,14 @@ compute_array_stats(VacAttrStats *stats, AnalyzeAttrFetchFunc fetchfunc,
 			minfreq = sort_table[num_mcelem - 1]->frequency;
 		}
 		else
+		{
 			num_mcelem = track_len;
+			if (track_len == 0)
+minfreq = maxfreq = cutoff_freq + 1;
+		}
 
 		/* Generate MCELEM slot entry */
-		if (num_mcelem > 0)
+		if (num_mcelem >= 0)
 		{
 			MemoryContext old_context;
 			Datum	   *mcelem_values;