Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column
I finally got around to testing your patch on a realistic data set. In short, the patch worked beautifully even with the division by 2 removed. In case it's helpful, the full write up of my investigation can be found at https://gist.github.com/mattlong/0617bec6e1cf5bc6b70c6c2951901df7 Your reasoning about the division by 2 being a way to approximate the "average" non-MCE value seems logical. The patch as tested above without a division yielded a row estimate of 28 for a non-MCE element on a table with ~2.1 million rows where all non-MCE elements occur exactly once. For our scenario/distribution of the data, the most conservative option, i.e. no division, was good enough to produce optimal query plans, but this is of course a very specific data distribution. Thinking more generally, there are two cases: When the MCE list is full and when the MCE list is NOT full. When the MCE list is NOT full, the case that this patch addresses, the data is definitionally heavily skewed; minfreq will already be quite low. Whether or not it gets divided by 2 (or some other number) probably won't make much of a difference either way. When the MCE list is full, the data is less skewed; there will generally be a longer tail of element frequencies. Dividing by 2 will be more impactful here in an attempt to approximate the "average" non-MCE element frequency. Putting that together, it seems like keeping the division by 2 would be preferable? On Tue, Sep 9, 2025 at 12:19 PM Tom Lane wrote: > > I wrote: > > * The selectivity functions believe that the upper bound on the > > frequency of non-MCEs is minfreq / 2, not the stored minfreq. > > This seems like complete brain fade: there could easily be > > elements with frequency just less than minfreq, and probably are > > if the data distribution follows Zipf's law. I did not dig into > > the git history, but I wonder if the divide-by-two business > > predates the introduction of the lossy-counting algorithm, and > > if so whether it was less insane with the original collection > > algorithm. In any case, this patch removes the divisions by 2, > > and makes some nearby cosmetic improvements. > > In the light of morning, I started to have second thoughts about > this aspect of the patch. I checked the git history this time, > and found that the oldest instance of "minfreq / 2" dates to > 4e57668da which originated from this discussion: > > https://www.postgresql.org/message-id/flat/488DAEB8.3000402%40students.mimuw.edu.pl > > It's already coded like that in Jan's initial patch, and there > was no discussion in the thread, so not a lot to be gleaned: > > +/* > + * The element is not in MCELEM. Punt, but assure that the > + * selectivity cannot be more than minfreq / 2. > + */ > +return (Selectivity) Min(DEFAULT_TS_SEL, minfreq / 2); > > But looking at this, I think the problem is really that the comment > fails to describe his thought process. We know that the frequency > of this not-in-the-MCE-list value cannot be more than minfreq, but > we have no idea how much less it is. I think the idea of the > division might have been to assume that an "average" non-MCE value > would have a frequency about half that of the lowest MCE. It does > seem reasonable to use some number lower than the cutoff, although > I dunno if 0.5 is exactly the right factor. > > So now I'm wondering if we should leave the divisions alone and > instead fix the comments to explain why they are there. Bolstering > this is that on the two test cases you guys submitted, the patch > with the divisions gets a spot-on estimate (1 row) while removing > the divisions yields an estimate of 2 rows. I don't put a lot of > weight on that, since these are toy examples. But I wonder if you > guys could test the patch on some of your real-world cases and > see if it looks like we should keep the divisions. > > regards, tom lane
Re: Poor row estimates from planner, stat `most_common_elems` sometimes missing for a text[] column
Not to let perfect be the enemy of better, but we're facing a variant of
this issue that would not be addressed by the proposed patch. If you're
interested, the real world use case is described in
https://github.com/medplum/medplum/issues/7310. Rows have an array
consisting of one common element present in every row and one unique
element. With a statistics target at the default 100, this pushes the
threshold for vastly overestimated row counts for the unique elements down
to 7,937 rows.
In this case, the effects of the proposed patch are not applied since the
most_common_elems array is not empty. I'm not a statistician, so maybe this
wouldn't be valid, but it seems like using the highest frequency of an
element that did not qualify for the mce list instead of the 0.5% default
frequency could be an elegant, but more invasive solution.
Simple repro of the problem:
CREATE TABLE public.test(
id integer,
tags text[]
);
INSERT INTO public.test (id, tags)
SELECT n, ARRAY['common', TO_CHAR(n,'fm')] FROM ( SELECT
generate_series(1,7_937) AS n );
ANALYZE public.test;
SELECT most_common_elems, most_common_elem_freqs FROM pg_stats WHERE
schemaname = 'public' AND tablename = 'test' AND attname = 'tags';
EXPLAIN (ANALYZE,BUFFERS,VERBOSE)
SELECT * FROM test WHERE tags && ARRAY['common'];
EXPLAIN (ANALYZE,BUFFERS,VERBOSE)
SELECT * FROM test WHERE tags && ARRAY['0001'];
With 7,936 rows:
most_common_elems has 1,000 entries and looks like
{0001,0003,0009,...,common}
most_common_elem_freqs has 1,003 entries and looks like
{0.00012600806,0.00012600806,0.00012600806,...,1,0.00012600806,1,0}
With 7,937+ rows:
most_common_elems is {common}
most_common_elem_freqs is {1,1,1,0}
On Mon, Sep 8, 2025 at 11:27 AM Tom Lane wrote:
> I wrote:
> > Well, we don't have a most common element in this scenario --- the
> > whole point is that the occurrence counts resulting from the lossy
> > counting algorithm are too low to be trustworthy. However, what we
> > do have is the cutoff frequency, and it seems to me that we could use
> > that as the estimate of the maximum frequency of the non-MCEs.
>
> Here's a less crude patch for that. The array_typanalyze.c changes
> are the same as before, but I reconsidered what to do about this
> stumbling block:
>
> > Assuming we like this direction, the main thing that makes this a hack
> > not a polished patch is that I had to strongarm the code into storing
> > a zero-length values array. What update_attstats would normally do
> > is leave the values column of the MCELEM stats slot NULL, which then
> > causes get_attstatsslot to throw a that-shouldn't-be-null error.
> > An alternative answer is to change get_attstatsslot to allow a null,
> > but I'm not sure that that's any cleaner. Either way it seems like
> > there's a hazard of breaking some code that isn't expecting the case.
>
> I concluded that letting get_attstatsslot accept nulls is too risky;
> there is probably code that assumes that get_attstatsslot would've
> rejected that. Instead, this version changes update_attstats' rule
> for when to store an array rather than a null. Now it will do so
> if the passed-in stavalues pointer isn't null, even if numvalues
> is zero. I think that this doesn't change the outcome for any
> existing analyze routines because they wouldn't pass a data pointer
> if they have no values; and even if it does, storing an empty array
> not a null seems like it should be pretty harmless.
>
> > An alternative that feels cleaner but a good bit more invasive
> > is to get rid of the convention of storing the min/max/nulls
> > frequencies as extra entries in the MCELEM numbers entry ---
> > which surely is a hack too --- and put them into some new slot
> > type. I'm not sure if that's enough nicer to be worth the
> > conversion pain.
>
> A year ago I would have seriously considered doing it that way.
> But now that we have code to dump-n-restore stats, that code would
> have to be adjusted to convert the old representation. It's not
> worth it for this case.
>
> Hence, v1 attached, now with a commit message.
>
> regards, tom lane
>
>
