Re: Slow performance of collate "en_US.utf8"

2025-02-28 Thread Joe Conway

On 2/28/25 17:49, Thomas Munro wrote:

On Sat, Mar 1, 2025 at 9:03 AM Joe Conway  wrote:

On 2/28/25 09:16, Laurenz Albe wrote:
> On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote:
>> I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4.
>
> I'd say that you would have to complain to the authors of the
> GNU C library, which provides this collation.

Yep -- glibc starting with version 2.21 has a massive performance
regression for certain cases and the glibc folks have basically said
they will not fix it. If you try the same thing on RHEL 7.x with glibc
2.17 it will perform about the same as ICU.


I've idly wondered if this is the culprit, do you know?

https://github.com/bminor/glibc/commit/0742aef6e52a935f9ccd69594831b56d807feef3


Yes, that was definitely the one that caused the regression. Note that 
if you look closely you will find there is a revert of that patch on 
glibc on certain distros. But not on RHEL and RHEL-alike.


Someone else pointed out this thread to me:
https://sourceware.org/bugzilla/show_bug.cgi?id=18441

Note the last message on that thread:
8<--
 Carlos O'Donell 2019-05-09 20:44:56 UTC

(In reply to vectoroc from comment #13)
> Hello. Is there any chance that the issues will be fixed? Unfortunately
> PostgreSQL Is unable to use ICU some base features (e.g in analyze
> operation).

We haven't had anyone working on strcoll_l performance improvements. So 
it's unlikely that this will get merged or reviewed any time soon.

8<--


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: [PERFORM] Unused index influencing sequential scan plan

2025-02-28 Thread Thom Brown
On Thu, 18 Oct 2012, 18:01 Thom Brown,  wrote:

> On 18 October 2012 17:52, Tom Lane  wrote:
> > Thom Brown  writes:
> >> On 18 October 2012 17:44, Tom Lane  wrote:
> >>> Thom Brown  writes:
>  And as a side note, how come it's impossible to get the planner to use
>  an index-only scan to satisfy the query (disabling sequential and
>  regular index scans)?
> >
> >>> Implementation restriction - we don't yet have a way to match
> index-only
> >>> scans to expressions.
> >
> >> Ah, I suspected it might be, but couldn't find notes on what scenarios
> >> it's yet to be able to work in.  Thanks.
> >
> > I forgot to mention that there is a klugy workaround: add the required
> > variable(s) as extra index columns.  That is,
> >
> > create index i on t (foo(x), x);
> >
> > The planner isn't terribly bright about this, but it will use that index
> > for a query that only requires foo(x), and it won't re-evaluate foo()
> > (though I think it will cost the plan on the assumption it does :-().
>
> Ah, yes, I've tested this and got it using an index-only scan, and it
> was faster than than the sequential scan (index only scan 5024.545 ms
> vs seq scan 6627.072 ms).
>
> So this is probably a dumb question, but is it possible to achieve the
> optimisation provided by index statistics but without the index, and
> without a messy workaround using a supplementary column which stores
> function-derived values?  If not, is that something which can be
> introduced?
>

A very late thanks for extended statistics, Tomas.

Thom

>


Re: Slow performance of collate "en_US.utf8"

2025-02-28 Thread Thomas Munro
On Sat, Mar 1, 2025 at 9:03 AM Joe Conway  wrote:
> On 2/28/25 09:16, Laurenz Albe wrote:
> > On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote:
> >> I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4.
> >
> > I'd say that you would have to complain to the authors of the
> > GNU C library, which provides this collation.
>
> Yep -- glibc starting with version 2.21 has a massive performance
> regression for certain cases and the glibc folks have basically said
> they will not fix it. If you try the same thing on RHEL 7.x with glibc
> 2.17 it will perform about the same as ICU.

I've idly wondered if this is the culprit, do you know?

https://github.com/bminor/glibc/commit/0742aef6e52a935f9ccd69594831b56d807feef3

It seems to have bet that strings either differ in primary weight as
early as they do in synthetic natural language tests, or not at all
because they are equal and that is detected with a fast-path binary
comparison.  The average first different character word-to-word in my
/usr/shared/dict/words is at position ~5.6 (some kind of worst case as
it is already sorted), cf 14+ multibyte sequences in OP's example,
which must be well outside their test parameters I would guess.  I
didn't read the code but the description has a miasma of
quadratic-catching-on-fire about it: it's now rescanning the secondary
weights with repeated traversals, because the cache they ripped out
wasn't pulling its own weight at small common prefix sizes, or
something like that?  I wonder if 2.21 also got faster for PostgreSQL
sorting /usr/share/dict/words as you might expect from that
description.  Database keys with long common prefixes *probably*
shouldn't be using natural language sorting anyway, so "don't do
that", but knowledge of collations is not well distributed... on the
other hand I suspect you can dream up some real natural language
examples that lose the bet too: sort "lastname, firstname" across a
country of 300 million, maybe?




Re: Slow performance of collate "en_US.utf8"

2025-02-28 Thread Joe Conway

On 2/28/25 09:16, Laurenz Albe wrote:

On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote:

I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4.


I'd say that you would have to complain to the authors of the
GNU C library, which provides this collation.


Yep -- glibc starting with version 2.21 has a massive performance 
regression for certain cases and the glibc folks have basically said 
they will not fix it. If you try the same thing on RHEL 7.x with glibc 
2.17 it will perform about the same as ICU.


If you are using pg17 you should consider using the new builtin 
collation provider -- it will perform almost as well as the 'C' locale. 
Something like:


CREATE DATABASE builtincoll LOCALE_PROVIDER builtin
BUILTIN_LOCALE 'C.UTF-8' TEMPLATE template0;


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com




Re: Slow performance of collate "en_US.utf8"

2025-02-28 Thread Laurenz Albe
On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote:
> I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4.

I'd say that you would have to complain to the authors of the
GNU C library, which provides this collation.

Yours,
Laurenz Albe




Re: Slow performance of collate "en_US.utf8"

2025-02-28 Thread Achilleas Mantzios - cloud

Hi Alexey

On 2/27/25 15:54, Alexey Borschev wrote:


Hi everyone!

I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4.

Test query:

explain (analyze, costs, buffers, verbose)

SELECT

('БвЁжЫйяЙSёYz&$' || gen.id) collate "en_US.utf8"

FROM generate_series(1, 1) AS gen(id)

order by 1 desc;

I've got execution time like:

Execution Time: 73.068 ms

Same poor result with ru_RU.UTF8.

With other collations time is much better:

explain (analyze, costs, buffers, verbose)

select

('БвЁжЫйяЙSёYz&$' || gen.id) collate "C"

from generate_series(1, 1) AS gen(id)

order by 1 desc;

Execution Time: 4.792 ms

explain (analyze, costs, buffers, verbose)

SELECT

('БвЁжЫйяЙSёYz&$' || gen.id) collate "C.utf8"

FROM generate_series(1, 1) AS gen(id)

order by 1 desc;

Execution Time: 7.473 ms

explain (analyze, costs, buffers, verbose)

select

('БвЁжЫйяЙSёYz&$' || gen.id) collate "und-x-icu"

from generate_series(1, 1) AS gen(id)

order by 1 desc;

Execution Time: 13.282 ms

Yes, collate C is fastest, ICU collations is ~ 2 slower then C, but 
"en_US.utf8" is ~ 10x slower!


I suspect it is some performance issue over there.

Can someone of PG hackers reproduce this please?

1) This PG17 instance was installed with default options, and initdb 
got en_US.utf8 as system default collation and created PG cluster with it.


It seems like most PG databases are created this way with en_US.utf8 
by default.


It seems you initialized the cluster with libc as the locale provider. 
Have you tried with icu ?


2) Typical text\varchar columns are created with DB default en_US.utf8 
and performs poor.


explain (analyze, costs, buffers, verbose)

select

('БвЁжЫйяЙSёYz&$' || gen.id)

from generate_series(1, 1) AS gen(id)

order by 1 desc;

Execution Time: 73.600 ms

3) The index search operations are also slower with en_US.utf8, but 
the difference is not as high.


Please see attached file with test table and indexes, with tests on 
index performance.


System Details:

postgres@borschev-pg-copydb1:~$ uname -a Linux borschev-pg-copydb1 
6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19) 
x86_64 GNU/Linux


postgres@borschev-pg-copydb1:~$ cat /etc/issue Debian GNU/Linux 
trixie/sid \n \l


select version();

PostgreSQL 17.4 (Debian 17.4-1.pgdg110+2) on x86_64-pc-linux-gnu, 
compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit


SELECT * FROM pg_config();

|name |setting |

|-|--|

|BINDIR |/usr/lib/postgresql/17/bin |

|DOCDIR |/usr/share/doc/postgresql-doc-17 |

|HTMLDIR |/usr/share/doc/postgresql-doc-17 |

|INCLUDEDIR |/usr/include/postgresql |

|PKGINCLUDEDIR |/usr/include/postgresql |

|INCLUDEDIR-SERVER|/usr/include/postgresql/17/server |

|LIBDIR |/usr/lib/x86_64-linux-gnu |

|PKGLIBDIR |/usr/lib/postgresql/17/lib |

|LOCALEDIR |/usr/share/locale |

|MANDIR |/usr/share/postgresql/17/man |

|SHAREDIR |/usr/share/postgresql/17 |

|SYSCONFDIR |/etc/postgresql-common |

|PGXS |/usr/lib/postgresql/17/lib/pgxs/src/makefiles/pgxs.mk |

|CONFIGURE | '--build=x86_64-linux-gnu' '--prefix=/usr' 
'--includedir=${prefix}/include' '--mandir=${prefix}/share/man' 
'--infodir=${prefix}/share/info' '--sysconfdir=/etc' 
'--localstatedir=/var' '--disable-option-checking' 
'--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu' 
'--run