Re: Performance degradation in Index searches with special characters
Hi, Tom!
Thanks for your feedback. After looking into it further, it seems the
performance issue is indeed related to the default collation settings,
particularly when handling certain special characters like < in the glibc
strcoll_l function. This was confirmed during my testing on Debian 12 with
glibc version 2.36 (this OS and glibc are being used in our office's Docker
image: https://hub.docker.com/_/postgres).
My test database settings:
test_db=# \d+ test
Table "public.test"
Column | Type | Collation | Nullable | Default | Storage |
Compression | Stats target | Description
+---+---+--+-+--+-+--+-
value | character varying(10) | | not null | | extended
| | |
Indexes:
"idx_test" btree (value)
Access method: heap
test_db=# \l test_db
List of databases
Name | Owner | Encoding | Locale Provider | Collate | Ctype|
Locale | ICU Rules | Access privileges
-+--+--+-++++---+---
test_db | postgres | UTF8 | libc| en_US.utf8 | en_US.utf8 |
| |
(1 row)
strcoll_l tests:
root@715b19170a89:~# cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/";
SUPPORT_URL="https://www.debian.org/support";
BUG_REPORT_URL="https://bugs.debian.org/";
root@715b19170a89:~# ldd --version
ldd (Debian GLIBC 2.36-9+deb12u8) 2.36
Copyright (C) 2022 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Roland McGrath and Ulrich Drepper.
root@715b19170a89:~# cat test.c
#include
#include
#include
#include
int main() {
char str1[] = "<";
char *str2 = malloc(65536); // 65535 characters + 1 for null terminator
if (!str2) return 1;
memset(str2, '<', 65535);
str2[65535] = '\0';
locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);
int result = strcoll_l(str1, str2, locale);
printf("Comparison result: %d\n", result);
freelocale(locale);
free(str2);
return 0;
}
root@715b19170a89:~# time ./test
Comparison result: -1
real 0m4.487s
user 0m4.483s
sys 0m0.003s
I'm considering switching to ICU collations, as they might handle this more
efficiently. However, as I know ICU isn’t the default collation provider in
PostgreSQL, and switching to it in a live environment isn’t a
straightforward process. The main concern is that glibc’s default collation
(en_US.UTF-8) is widely used, and this opens up the potential for a Denial
of Service (DoS) attack. For instance, if user input includes long strings
of repeated characters like <, it can severely degrade performance due to
the extended processing time for string comparisons, especially in
high-traffic scenarios.
On Sun, 6 Oct 2024 at 19:39, Tom Lane wrote:
> Andrey Stikheev writes:
> >- Changing the collation to 'C' in the query significantly improves
> >performance.
>
> What collation are you using, pray tell? (And what database encoding?)
> >- Is this performance degradation expected due to collation handling
> of
> >certain special characters in PostgreSQL?
>
> It seems like a performance bug indeed, but not ours --- I'm thinking
> it must be in strcoll() or ICU.
> Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
> something similar but not quite as bad:
>
> u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
> ?column?
> --
> (0 rows)
>
> Time: 1383.033 ms (00:01.383)
>
> Poking into it with gdb says that the time is indeed spent inside
> strcoll:
>
> #0 get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900
> "",
> table=0x7fabb490e2b0, weights=,
> rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
> seq=) at strcoll_l.c:111
> #1 __GI___strcoll_l (s1=0x1785878 "<",
> s2=0x178587a '<' ..., l=)
> at strcoll_l.c:338
> #2 0x009527a6 in strncoll_libc (arg1=, len1=1,
> arg2=, len2=65536, locale=,
> locale=) at pg_locale.c:1964
> #3 0x009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
> arg2=0x17958cc '<' ..., len2=65536,
> collid=) at varlena.c:1567
> #4 0x009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at
> varlena.c:1820
> #5 0x009d75fa in FunctionCall2Coll (
> flinfo=flinfo@entry=0x7ffddd3b10e8, collation=,
> arg1=, arg2=) at fmgr.c:1161
> #6 0x00594948 in _bt_compare (rel=0x7fabcde7eed0,
> key=0x7ffddd3b10c0,
> page=, offn
Re: Performance degradation in Index searches with special characters
On 10/6/24 13:28, Andrey Stikheev wrote: Thanks for your feedback. After looking into it further, it seems the performance issue is indeed related to the default collation settings, particularly when handling certain special characters like |<| in the glibc |strcoll_l| function. This was confirmed during my testing on Debian 12 with glibc version 2.36 (this OS and glibc are being used in our office's Docker image: https://hub.docker.com/_/postgres hub.docker.com/_/postgres>). This is not surprising. There is a performance regression that started in glibc 2.21 with regard to sorting unicode. Test with RHEL 7.x (glibc 2.17) and I bet you will see comparable results to ICU. The best answer in the long term, IMHO, is likely to use the new built-in collation just released in Postgres 17. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
Re: Performance degradation in Index searches with special characters
Joe Conway writes: > This is not surprising. There is a performance regression that started > in glibc 2.21 with regard to sorting unicode. Test with RHEL 7.x (glibc > 2.17) and I bet you will see comparable results to ICU. The best answer > in the long term, IMHO, is likely to use the new built-in collation just > released in Postgres 17. It seems unrelated to unicode though --- I also reproduced the issue in a database with LATIN1 encoding. Whatever, it is pretty awful, but the place to be complaining to is the glibc maintainers. Not much we can do about it. regards, tom lane
Performance degradation in Index searches with special characters
Dear PostgreSQL Community,
I am facing significant performance issues when executing queries that
involve string comparisons with special characters, such as <, #, !, @,
etc., especially when dealing with long strings. The query execution time
increases drastically when these characters are used, whereas queries with
alphabetic characters do not show such degradation. This behavior is
observed both on macOS (using the official postgres:17 image via Docker)
and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.
Here is a minimal example:
testdb=# SELECT version();
version
---
PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu,
compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
(1 row)
testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
CREATE TABLE
Time: 3.562 ms
testdb=# CREATE INDEX idx_test ON test (value);
CREATE INDEX
Time: 3.080 ms
testdb=# INSERT INTO test (value) VALUES ('<');
INSERT 0 1
Time: 3.365 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
?column?
--
(0 rows)
Time: 4454.535 ms (00:04.455)
testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536);
?column?
--
(0 rows)
Time: 3.772 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a';
?column?
--
(0 rows)
Time: 4.352 ms
Time: 9.503 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
?column?
--
(0 rows)
Time: 3.299 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192);
?column?
--
(0 rows)
Time: 77.171 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384);
?column?
--
(0 rows)
Time: 325.190 ms
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768);
?column?
--
(0 rows)
Time: 1154.850 ms (00:01.155)
testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536);
?column?
--
(0 rows)
Time: 4490.206 ms (00:04.490)
testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal)
SELECT 1 FROM test WHERE value = repeat('<', 65000);
-
Bitmap Heap Scan on public.test (cost=4.20..13.67 rows=6 width=4) (actual
time=4425.459..4425.459 rows=0 loops=1)
Output: 1
Recheck Cond: ((test.value)::text = ' ... a
lot symbols ...
'::text)
Buffers: shared hit=1
-> Bitmap Index Scan on idx_test (cost=0.00..4.20 rows=6 width=0)
(actual time=4425.432..4425.432 rows=0 loops=1)
Index Cond: ((test.value)::text = '
... a lot symbols ...<<'::text)
Buffers: shared hit=1
Planning Time: 1.082 ms
Execution Time: 4425.602 ms
(9 rows)
Time: 4433.001 ms (00:04.433)
Observations*:*
- The performance degradation occurs with certain special characters
like < , !, >, @ , #, ... .
- Queries using alphabetic characters or appending/prepending characters
execute much faster.
- The execution time increases exponentially with the length of the
string composed of special characters.
- Changing the collation to 'C' in the query significantly improves
performance.
Questions*:*
- Is this performance degradation expected due to collation handling of
certain special characters in PostgreSQL?
- Are there any recommendations to improve performance without changing
the column or database collation?
--
Best regards,
Andrey Stikheev
Re: Performance degradation in Index searches with special characters
Andrey Stikheev writes:
>- Changing the collation to 'C' in the query significantly improves
>performance.
What collation are you using, pray tell? (And what database encoding?)
>- Is this performance degradation expected due to collation handling of
>certain special characters in PostgreSQL?
It seems like a performance bug indeed, but not ours --- I'm thinking
it must be in strcoll() or ICU.
Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
something similar but not quite as bad:
u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
?column?
--
(0 rows)
Time: 1383.033 ms (00:01.383)
Poking into it with gdb says that the time is indeed spent inside
strcoll:
#0 get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900 "",
table=0x7fabb490e2b0, weights=,
rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
seq=) at strcoll_l.c:111
#1 __GI___strcoll_l (s1=0x1785878 "<",
s2=0x178587a '<' ..., l=)
at strcoll_l.c:338
#2 0x009527a6 in strncoll_libc (arg1=, len1=1,
arg2=, len2=65536, locale=,
locale=) at pg_locale.c:1964
#3 0x009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
arg2=0x17958cc '<' ..., len2=65536,
collid=) at varlena.c:1567
#4 0x009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at varlena.c:1820
#5 0x009d75fa in FunctionCall2Coll (
flinfo=flinfo@entry=0x7ffddd3b10e8, collation=,
arg1=, arg2=) at fmgr.c:1161
#6 0x00594948 in _bt_compare (rel=0x7fabcde7eed0, key=0x7ffddd3b10c0,
page=, offnum=) at nbtsearch.c:762
#7 0x00594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0,
key=key@entry=0x7ffddd3b10c0, buf=) at nbtsearch.c:394
It's not the fault of the index machinery, because a single comparison
takes the same amount of time:
u8=# select '<' <= repeat('<', 65536);
?column?
--
t
(1 row)
Time: 1391.550 ms (00:01.392)
I didn't try it with ICU.
regards, tom lane
Re: Performance degradation in Index searches with special characters
Hi Andrey,
I have tried my best to answer your queries below:
### Performance Degradation with Special Characters in PostgreSQL
**Explanation**:
The performance degradation you're experiencing when using special
characters like `<`, `@`, `#`, etc., is likely due to how PostgreSQL
handles **collations**. By default, PostgreSQL uses locale-aware collations
(typically UTF-8) for string comparisons, which involve complex character
sorting and encoding. Special characters may be treated differently under
certain locales, resulting in slower comparisons, especially with longer
strings.
**Key Observations**:
1. **Alphabetic vs. Special Characters**:
- Queries with alphabetic characters perform significantly faster than
those with special characters.
- This is due to differences in how the collation handles comparisons
for alphabetic and non-alphabetic characters. Special characters often
require more computational resources for comparison, resulting in longer
execution times.
2. **Impact of Collation**:
- When you switched the collation to **"C"**, the performance improved
substantially. This is because the `"C"` collation uses **byte-wise**
comparisons rather than locale-aware sorting, which simplifies the
comparison logic, especially for special characters.
3. **String Length Impact**:
- As the string length increases, the performance degrades exponentially
when using special characters. This is due to the collation’s computational
complexity for each additional character comparison.
**Recommendations**:
1. **Use the "C" Collation**:
- The most effective improvement you observed came from using the `"C"`
collation, which performs simple byte-by-byte comparisons. You can apply
the `"C"` collation at the query level or alter the column definition:
```sql
SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
```
or
```sql
ALTER TABLE test ALTER COLUMN value SET DATA TYPE VARCHAR(10) COLLATE
"C";
```
2. **Partial Indexing**:
- Create an index with a specific collation for certain characters. This
allows PostgreSQL to use the optimized comparison method only when dealing
with special characters:
```sql
CREATE INDEX idx_test_special_chars ON test (value COLLATE "C");
```
3. **Reduce String Length**:
- If possible, reduce the string length for comparisons involving
special characters. The performance impact grows with longer strings due to
the nature of locale-aware collation.
4. **Use Functional Indexes**:
- You can use functional indexes that apply transformations (e.g.,
trimming special characters or converting to ASCII) before comparison:
```sql
CREATE INDEX idx_test_transformed ON test (lower(value));
```
5. **Optimize Queries**:
- For repeated queries involving long strings with special characters,
try caching results or using materialized views where the collation
overhead is reduced by pre-computing.
6. **Custom Collations**:
- If you need locale-aware sorting but can compromise on certain
aspects, consider creating a **custom collation** that simplifies special
character handling.
**Answers to Your Questions**:
1. **Is the performance degradation expected due to collation handling of
certain special characters?**
- Yes, this behavior is expected. Locale-aware collations can be complex
for special characters, leading to longer comparison times.
2. **Are there any recommendations to improve performance without changing
the column or database collation?**
- Without changing the collation entirely, you can use:
- **Query-level collation adjustments** (using the `"C"` collation in
specific queries).
- **Partial indexes** with the `"C"` collation.
- **Functional indexes** or optimizations like materialized views.
On Sun, Oct 6, 2024 at 3:53 PM Andrey Stikheev
wrote:
> Dear PostgreSQL Community,
>
> I am facing significant performance issues when executing queries that
> involve string comparisons with special characters, such as <, #, !, @,
> etc., especially when dealing with long strings. The query execution time
> increases drastically when these characters are used, whereas queries with
> alphabetic characters do not show such degradation. This behavior is
> observed both on macOS (using the official postgres:17 image via Docker)
> and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.
>
> Here is a minimal example:
>
> testdb=# SELECT version();
>
> version
>
>
>
> ---
>
> PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu,
> compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
>
> (1 row)
>
> testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
> CREATE TABLE
> Time: 3.562 ms
>
> testdb=# CREATE INDEX idx_test ON test (value);
> CREATE
Re: Performance degradation in Index searches with special characters
On Mon, Oct 7, 2024 at 9:02 AM Shiv Iyer wrote: >- As the string length increases, the performance degrades exponentially > when using special characters. This is due to the collation’s computational > complexity for each additional character comparison. That's a pretty interesting observation, worthy of a bug report. I don't know the details offhand but the algorithm used should be basically the same, or more likely a simpler subset, of what other collation implementations are using IIRC (don't quote me but I think it's supposed to be ISO 14651 which is essentially a subset of the UCA stuff that ICU is using (it is "aligned with" UCA DUCET), without CLDR customisations and perhaps various other complications), so it doesn't sound like it should be fundamentally required to be *more* expensive than ICU...
