Re: Slow performance of collate "en_US.utf8"
On 2/28/25 17:49, Thomas Munro wrote: On Sat, Mar 1, 2025 at 9:03 AM Joe Conway wrote: On 2/28/25 09:16, Laurenz Albe wrote: > On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote: >> I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4. > > I'd say that you would have to complain to the authors of the > GNU C library, which provides this collation. Yep -- glibc starting with version 2.21 has a massive performance regression for certain cases and the glibc folks have basically said they will not fix it. If you try the same thing on RHEL 7.x with glibc 2.17 it will perform about the same as ICU. I've idly wondered if this is the culprit, do you know? https://github.com/bminor/glibc/commit/0742aef6e52a935f9ccd69594831b56d807feef3 Yes, that was definitely the one that caused the regression. Note that if you look closely you will find there is a revert of that patch on glibc on certain distros. But not on RHEL and RHEL-alike. Someone else pointed out this thread to me: https://sourceware.org/bugzilla/show_bug.cgi?id=18441 Note the last message on that thread: 8<-- Carlos O'Donell 2019-05-09 20:44:56 UTC (In reply to vectoroc from comment #13) > Hello. Is there any chance that the issues will be fixed? Unfortunately > PostgreSQL Is unable to use ICU some base features (e.g in analyze > operation). We haven't had anyone working on strcoll_l performance improvements. So it's unlikely that this will get merged or reviewed any time soon. 8<-- -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: [PERFORM] Unused index influencing sequential scan plan
On Thu, 18 Oct 2012, 18:01 Thom Brown, wrote: > On 18 October 2012 17:52, Tom Lane wrote: > > Thom Brown writes: > >> On 18 October 2012 17:44, Tom Lane wrote: > >>> Thom Brown writes: > And as a side note, how come it's impossible to get the planner to use > an index-only scan to satisfy the query (disabling sequential and > regular index scans)? > > > >>> Implementation restriction - we don't yet have a way to match > index-only > >>> scans to expressions. > > > >> Ah, I suspected it might be, but couldn't find notes on what scenarios > >> it's yet to be able to work in. Thanks. > > > > I forgot to mention that there is a klugy workaround: add the required > > variable(s) as extra index columns. That is, > > > > create index i on t (foo(x), x); > > > > The planner isn't terribly bright about this, but it will use that index > > for a query that only requires foo(x), and it won't re-evaluate foo() > > (though I think it will cost the plan on the assumption it does :-(). > > Ah, yes, I've tested this and got it using an index-only scan, and it > was faster than than the sequential scan (index only scan 5024.545 ms > vs seq scan 6627.072 ms). > > So this is probably a dumb question, but is it possible to achieve the > optimisation provided by index statistics but without the index, and > without a messy workaround using a supplementary column which stores > function-derived values? If not, is that something which can be > introduced? > A very late thanks for extended statistics, Tomas. Thom >
Re: Slow performance of collate "en_US.utf8"
On Sat, Mar 1, 2025 at 9:03 AM Joe Conway wrote: > On 2/28/25 09:16, Laurenz Albe wrote: > > On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote: > >> I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4. > > > > I'd say that you would have to complain to the authors of the > > GNU C library, which provides this collation. > > Yep -- glibc starting with version 2.21 has a massive performance > regression for certain cases and the glibc folks have basically said > they will not fix it. If you try the same thing on RHEL 7.x with glibc > 2.17 it will perform about the same as ICU. I've idly wondered if this is the culprit, do you know? https://github.com/bminor/glibc/commit/0742aef6e52a935f9ccd69594831b56d807feef3 It seems to have bet that strings either differ in primary weight as early as they do in synthetic natural language tests, or not at all because they are equal and that is detected with a fast-path binary comparison. The average first different character word-to-word in my /usr/shared/dict/words is at position ~5.6 (some kind of worst case as it is already sorted), cf 14+ multibyte sequences in OP's example, which must be well outside their test parameters I would guess. I didn't read the code but the description has a miasma of quadratic-catching-on-fire about it: it's now rescanning the secondary weights with repeated traversals, because the cache they ripped out wasn't pulling its own weight at small common prefix sizes, or something like that? I wonder if 2.21 also got faster for PostgreSQL sorting /usr/share/dict/words as you might expect from that description. Database keys with long common prefixes *probably* shouldn't be using natural language sorting anyway, so "don't do that", but knowledge of collations is not well distributed... on the other hand I suspect you can dream up some real natural language examples that lose the bet too: sort "lastname, firstname" across a country of 300 million, maybe?
Re: Slow performance of collate "en_US.utf8"
On 2/28/25 09:16, Laurenz Albe wrote: On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote: I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4. I'd say that you would have to complain to the authors of the GNU C library, which provides this collation. Yep -- glibc starting with version 2.21 has a massive performance regression for certain cases and the glibc folks have basically said they will not fix it. If you try the same thing on RHEL 7.x with glibc 2.17 it will perform about the same as ICU. If you are using pg17 you should consider using the new builtin collation provider -- it will perform almost as well as the 'C' locale. Something like: CREATE DATABASE builtincoll LOCALE_PROVIDER builtin BUILTIN_LOCALE 'C.UTF-8' TEMPLATE template0; -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Slow performance of collate "en_US.utf8"
On Thu, 2025-02-27 at 16:54 +0300, Alexey Borschev wrote: > I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4. I'd say that you would have to complain to the authors of the GNU C library, which provides this collation. Yours, Laurenz Albe
Re: Slow performance of collate "en_US.utf8"
Hi Alexey
On 2/27/25 15:54, Alexey Borschev wrote:
Hi everyone!
I see poor performance of text sorting of collate "en_US.utf8" in PG 17.4.
Test query:
explain (analyze, costs, buffers, verbose)
SELECT
('БвЁжЫйяЙSёYz&$' || gen.id) collate "en_US.utf8"
FROM generate_series(1, 1) AS gen(id)
order by 1 desc;
I've got execution time like:
Execution Time: 73.068 ms
Same poor result with ru_RU.UTF8.
With other collations time is much better:
explain (analyze, costs, buffers, verbose)
select
('БвЁжЫйяЙSёYz&$' || gen.id) collate "C"
from generate_series(1, 1) AS gen(id)
order by 1 desc;
Execution Time: 4.792 ms
explain (analyze, costs, buffers, verbose)
SELECT
('БвЁжЫйяЙSёYz&$' || gen.id) collate "C.utf8"
FROM generate_series(1, 1) AS gen(id)
order by 1 desc;
Execution Time: 7.473 ms
explain (analyze, costs, buffers, verbose)
select
('БвЁжЫйяЙSёYz&$' || gen.id) collate "und-x-icu"
from generate_series(1, 1) AS gen(id)
order by 1 desc;
Execution Time: 13.282 ms
Yes, collate C is fastest, ICU collations is ~ 2 slower then C, but
"en_US.utf8" is ~ 10x slower!
I suspect it is some performance issue over there.
Can someone of PG hackers reproduce this please?
1) This PG17 instance was installed with default options, and initdb
got en_US.utf8 as system default collation and created PG cluster with it.
It seems like most PG databases are created this way with en_US.utf8
by default.
It seems you initialized the cluster with libc as the locale provider.
Have you tried with icu ?
2) Typical text\varchar columns are created with DB default en_US.utf8
and performs poor.
explain (analyze, costs, buffers, verbose)
select
('БвЁжЫйяЙSёYz&$' || gen.id)
from generate_series(1, 1) AS gen(id)
order by 1 desc;
Execution Time: 73.600 ms
3) The index search operations are also slower with en_US.utf8, but
the difference is not as high.
Please see attached file with test table and indexes, with tests on
index performance.
System Details:
postgres@borschev-pg-copydb1:~$ uname -a Linux borschev-pg-copydb1
6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19)
x86_64 GNU/Linux
postgres@borschev-pg-copydb1:~$ cat /etc/issue Debian GNU/Linux
trixie/sid \n \l
select version();
PostgreSQL 17.4 (Debian 17.4-1.pgdg110+2) on x86_64-pc-linux-gnu,
compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
SELECT * FROM pg_config();
|name |setting |
|-|--|
|BINDIR |/usr/lib/postgresql/17/bin |
|DOCDIR |/usr/share/doc/postgresql-doc-17 |
|HTMLDIR |/usr/share/doc/postgresql-doc-17 |
|INCLUDEDIR |/usr/include/postgresql |
|PKGINCLUDEDIR |/usr/include/postgresql |
|INCLUDEDIR-SERVER|/usr/include/postgresql/17/server |
|LIBDIR |/usr/lib/x86_64-linux-gnu |
|PKGLIBDIR |/usr/lib/postgresql/17/lib |
|LOCALEDIR |/usr/share/locale |
|MANDIR |/usr/share/postgresql/17/man |
|SHAREDIR |/usr/share/postgresql/17 |
|SYSCONFDIR |/etc/postgresql-common |
|PGXS |/usr/lib/postgresql/17/lib/pgxs/src/makefiles/pgxs.mk |
|CONFIGURE | '--build=x86_64-linux-gnu' '--prefix=/usr'
'--includedir=${prefix}/include' '--mandir=${prefix}/share/man'
'--infodir=${prefix}/share/info' '--sysconfdir=/etc'
'--localstatedir=/var' '--disable-option-checking'
'--disable-silent-rules' '--libdir=${prefix}/lib/x86_64-linux-gnu'
'--run
