date:20241006

Re: Performance degradation in Index searches with special characters

2024-10-06 Thread Andrey Stikheev

Hi, Tom!

Thanks for your feedback. After looking into it further, it seems the
performance issue is indeed related to the default collation settings,
particularly when handling certain special characters like < in the glibc
strcoll_l function. This was confirmed during my testing on Debian 12  with
glibc version 2.36 (this OS and glibc are being used in our office's Docker
image: https://hub.docker.com/_/postgres).

My test database settings:

test_db=# \d+ test

  Table "public.test"

 Column | Type  | Collation | Nullable | Default | Storage  |
Compression | Stats target | Description

+---+---+--+-+--+-+--+-

 value  | character varying(10) |   | not null | | extended
| |  |

Indexes:

"idx_test" btree (value)

Access method: heap

test_db=# \l test_db

 List of databases

  Name   |  Owner   | Encoding | Locale Provider |  Collate   |   Ctype|
Locale | ICU Rules | Access privileges

-+--+--+-++++---+---

 test_db | postgres | UTF8 | libc| en_US.utf8 | en_US.utf8 |
  |   |

(1 row)

strcoll_l tests:

root@715b19170a89:~# cat /etc/os-release

PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"

NAME="Debian GNU/Linux"

VERSION_ID="12"

VERSION="12 (bookworm)"

VERSION_CODENAME=bookworm

ID=debian

HOME_URL="https://www.debian.org/";

SUPPORT_URL="https://www.debian.org/support";

BUG_REPORT_URL="https://bugs.debian.org/";

root@715b19170a89:~# ldd --version

ldd (Debian GLIBC 2.36-9+deb12u8) 2.36

Copyright (C) 2022 Free Software Foundation, Inc.

This is free software; see the source for copying conditions.  There is NO

warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Written by Roland McGrath and Ulrich Drepper.

root@715b19170a89:~# cat test.c

#include 

#include 

#include 

#include 

int main() {

char str1[] = "<";

char *str2 = malloc(65536); // 65535 characters + 1 for null terminator

if (!str2) return 1;

memset(str2, '<', 65535);

str2[65535] = '\0';

locale_t locale = newlocale(LC_COLLATE_MASK, "en_US.UTF-8", NULL);

int result = strcoll_l(str1, str2, locale);

printf("Comparison result: %d\n", result);

freelocale(locale);

free(str2);

return 0;

}

root@715b19170a89:~# time ./test

Comparison result: -1

real 0m4.487s

user 0m4.483s

sys 0m0.003s

I'm considering switching to ICU collations, as they might handle this more
efficiently. However, as I know ICU isn’t the default collation provider in
PostgreSQL, and switching to it in a live environment isn’t a
straightforward process. The main concern is that glibc’s default collation
(en_US.UTF-8) is widely used, and this opens up the potential for a Denial
of Service (DoS) attack. For instance, if user input includes long strings
of repeated characters like <, it can severely degrade performance due to
the extended processing time for string comparisons, especially in
high-traffic scenarios.

On Sun, 6 Oct 2024 at 19:39, Tom Lane  wrote:

> Andrey Stikheev  writes:
> >- Changing the collation to 'C' in the query significantly improves
> >performance.
>
> What collation are you using, pray tell?  (And what database encoding?)

> >- Is this performance degradation expected due to collation handling
> of
> >certain special characters in PostgreSQL?
>
> It seems like a performance bug indeed, but not ours --- I'm thinking
> it must be in strcoll() or ICU.

> Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
> something similar but not quite as bad:
>
> u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
>  ?column?
> --
> (0 rows)
>
> Time: 1383.033 ms (00:01.383)
>
> Poking into it with gdb says that the time is indeed spent inside
> strcoll:
>
> #0  get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900
> "",
> table=0x7fabb490e2b0, weights=,
> rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4,
> seq=) at strcoll_l.c:111
> #1  __GI___strcoll_l (s1=0x1785878 "<",
> s2=0x178587a '<' ..., l=)
> at strcoll_l.c:338
> #2  0x009527a6 in strncoll_libc (arg1=, len1=1,
> arg2=, len2=65536, locale=,
> locale=) at pg_locale.c:1964
> #3  0x009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1,
> arg2=0x17958cc '<' ..., len2=65536,
> collid=) at varlena.c:1567
> #4  0x009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at
> varlena.c:1820
> #5  0x009d75fa in FunctionCall2Coll (
> flinfo=flinfo@entry=0x7ffddd3b10e8, collation=,
> arg1=, arg2=) at fmgr.c:1161
> #6  0x00594948 in _bt_compare (rel=0x7fabcde7eed0,
> key=0x7ffddd3b10c0,
> page=, offn

Re: Performance degradation in Index searches with special characters

2024-10-06 Thread Joe Conway


On 10/6/24 13:28, Andrey Stikheev wrote:
Thanks for your feedback. After looking into it further, it seems the 
performance issue is indeed related to the default collation settings, 
particularly when handling certain special characters like |<| in the 
glibc |strcoll_l| function. This was confirmed during my testing on 
Debian 12  with glibc version 2.36 (this OS and glibc are being used in 
our office's Docker image: https://hub.docker.com/_/postgres hub.docker.com/_/postgres>).


This is not surprising. There is a performance regression that started 
in glibc 2.21 with regard to sorting unicode. Test with RHEL 7.x (glibc 
2.17) and I bet you will see comparable results to ICU. The best answer 
in the long term, IMHO, is likely to use the new built-in collation just 
released in Postgres 17.


--
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: Performance degradation in Index searches with special characters

2024-10-06 Thread Tom Lane

Joe Conway  writes:
> This is not surprising. There is a performance regression that started 
> in glibc 2.21 with regard to sorting unicode. Test with RHEL 7.x (glibc 
> 2.17) and I bet you will see comparable results to ICU. The best answer 
> in the long term, IMHO, is likely to use the new built-in collation just 
> released in Postgres 17.

It seems unrelated to unicode though --- I also reproduced the issue
in a database with LATIN1 encoding.

Whatever, it is pretty awful, but the place to be complaining to
is the glibc maintainers.  Not much we can do about it.

regards, tom lane

Performance degradation in Index searches with special characters

2024-10-06 Thread Andrey Stikheev

Dear PostgreSQL Community,

I am facing significant performance issues when executing queries that
involve string comparisons with special characters, such as <, #, !, @,
etc., especially when dealing with long strings. The query execution time
increases drastically when these characters are used, whereas queries with
alphabetic characters do not show such degradation. This behavior is
observed both on macOS (using the official postgres:17 image via Docker)
and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.

Here is a minimal example:

testdb=# SELECT version();

  version


---

 PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu,
compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit

(1 row)

testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
CREATE TABLE
Time: 3.562 ms

testdb=# CREATE INDEX idx_test ON test (value);
CREATE INDEX
Time: 3.080 ms

testdb=# INSERT INTO test (value) VALUES ('<');
INSERT 0 1
Time: 3.365 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column?
--
(0 rows)
Time: 4454.535 ms (00:04.455)

testdb=# SELECT 1 FROM test WHERE value = repeat('a', 65536);
 ?column?
--
(0 rows)
Time: 3.772 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) || 'a';
 ?column?
--
(0 rows)
Time: 4.352 ms


Time: 9.503 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";

 ?column?

--

(0 rows)


Time: 3.299 ms


testdb=# SELECT 1 FROM test WHERE value = repeat('@', 8192);

 ?column?

--

(0 rows)


Time: 77.171 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 16384);

 ?column?

--

(0 rows)


Time: 325.190 ms

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 32768);

 ?column?

--

(0 rows)


Time: 1154.850 ms (00:01.155)

testdb=# SELECT 1 FROM test WHERE value = repeat('@', 65536);

 ?column?

--

(0 rows)


Time: 4490.206 ms (00:04.490)


testdb=# explain (analyze, verbose, buffers, costs, settings, timing, wal)
SELECT 1 FROM test WHERE value = repeat('<', 65000);

-
 Bitmap Heap Scan on public.test  (cost=4.20..13.67 rows=6 width=4) (actual
time=4425.459..4425.459 rows=0 loops=1)
   Output: 1
   Recheck Cond: ((test.value)::text = ' ... a
lot symbols ...

 
'::text)
   Buffers: shared hit=1
   ->  Bitmap Index Scan on idx_test  (cost=0.00..4.20 rows=6 width=0)
(actual time=4425.432..4425.432 rows=0 loops=1)
 Index Cond: ((test.value)::text = '
... a lot symbols ...<<'::text)
 Buffers: shared hit=1
 Planning Time: 1.082 ms
 Execution Time: 4425.602 ms
(9 rows)

Time: 4433.001 ms (00:04.433)

Observations*:*

   - The performance degradation occurs with certain special characters
   like < , !,  >, @ , #, ... .
   - Queries using alphabetic characters or appending/prepending characters
   execute much faster.
   - The execution time increases exponentially with the length of the
   string composed of special characters.
   - Changing the collation to 'C' in the query significantly improves
   performance.

Questions*:*


   - Is this performance degradation expected due to collation handling of
   certain special characters in PostgreSQL?
   - Are there any recommendations to improve performance without changing
   the column or database collation?

-- 
Best regards,
Andrey Stikheev

Re: Performance degradation in Index searches with special characters

2024-10-06 Thread Tom Lane

Andrey Stikheev  writes:
>- Changing the collation to 'C' in the query significantly improves
>performance.

What collation are you using, pray tell?  (And what database encoding?)

>- Is this performance degradation expected due to collation handling of
>certain special characters in PostgreSQL?

It seems like a performance bug indeed, but not ours --- I'm thinking
it must be in strcoll() or ICU.

Trying it here (on a RHEL8 machine) with en_US.utf8 locale, I see
something similar but not quite as bad:

u8=# SELECT 1 FROM test WHERE value = repeat('<', 65536);
 ?column? 
--
(0 rows)

Time: 1383.033 ms (00:01.383)

Poking into it with gdb says that the time is indeed spent inside
strcoll:

#0  get_next_seq (pass=1, indirect=0x7fabb4988ea8, extra=0x7fabb4984900 "", 
table=0x7fabb490e2b0, weights=, 
rulesets=0x7fabb490e2a8 "\001\002\001\005\001\001\001\005", nrules=4, 
seq=) at strcoll_l.c:111
#1  __GI___strcoll_l (s1=0x1785878 "<", 
s2=0x178587a '<' ..., l=)
at strcoll_l.c:338
#2  0x009527a6 in strncoll_libc (arg1=, len1=1, 
arg2=, len2=65536, locale=, 
locale=) at pg_locale.c:1964
#3  0x009ac760 in varstr_cmp (arg1=0x7fabc2dcbfe9 "<", len1=1, 
arg2=0x17958cc '<' ..., len2=65536, 
collid=) at varlena.c:1567
#4  0x009acfe3 in bttextcmp (fcinfo=0x7ffddd3b0590) at varlena.c:1820
#5  0x009d75fa in FunctionCall2Coll (
flinfo=flinfo@entry=0x7ffddd3b10e8, collation=, 
arg1=, arg2=) at fmgr.c:1161
#6  0x00594948 in _bt_compare (rel=0x7fabcde7eed0, key=0x7ffddd3b10c0, 
page=, offnum=) at nbtsearch.c:762
#7  0x00594e32 in _bt_binsrch (rel=rel@entry=0x7fabcde7eed0, 
key=key@entry=0x7ffddd3b10c0, buf=) at nbtsearch.c:394

It's not the fault of the index machinery, because a single comparison
takes the same amount of time:

u8=# select '<' <= repeat('<', 65536);
 ?column? 
--
 t
(1 row)

Time: 1391.550 ms (00:01.392)

I didn't try it with ICU.

regards, tom lane

Re: Performance degradation in Index searches with special characters

2024-10-06 Thread Shiv Iyer

Hi Andrey,

I have tried my best to answer your queries below:


### Performance Degradation with Special Characters in PostgreSQL

 **Explanation**:
The performance degradation you're experiencing when using special
characters like `<`, `@`, `#`, etc., is likely due to how PostgreSQL
handles **collations**. By default, PostgreSQL uses locale-aware collations
(typically UTF-8) for string comparisons, which involve complex character
sorting and encoding. Special characters may be treated differently under
certain locales, resulting in slower comparisons, especially with longer
strings.

 **Key Observations**:
1. **Alphabetic vs. Special Characters**:
   - Queries with alphabetic characters perform significantly faster than
those with special characters.
   - This is due to differences in how the collation handles comparisons
for alphabetic and non-alphabetic characters. Special characters often
require more computational resources for comparison, resulting in longer
execution times.

2. **Impact of Collation**:
   - When you switched the collation to **"C"**, the performance improved
substantially. This is because the `"C"` collation uses **byte-wise**
comparisons rather than locale-aware sorting, which simplifies the
comparison logic, especially for special characters.

3. **String Length Impact**:
   - As the string length increases, the performance degrades exponentially
when using special characters. This is due to the collation’s computational
complexity for each additional character comparison.

 **Recommendations**:

1. **Use the "C" Collation**:
   - The most effective improvement you observed came from using the `"C"`
collation, which performs simple byte-by-byte comparisons. You can apply
the `"C"` collation at the query level or alter the column definition:
 ```sql
 SELECT 1 FROM test WHERE value = repeat('<', 65536) COLLATE "C";
 ```
 or
 ```sql
 ALTER TABLE test ALTER COLUMN value SET DATA TYPE VARCHAR(10) COLLATE
"C";
 ```

2. **Partial Indexing**:
   - Create an index with a specific collation for certain characters. This
allows PostgreSQL to use the optimized comparison method only when dealing
with special characters:
 ```sql
 CREATE INDEX idx_test_special_chars ON test (value COLLATE "C");
 ```

3. **Reduce String Length**:
   - If possible, reduce the string length for comparisons involving
special characters. The performance impact grows with longer strings due to
the nature of locale-aware collation.

4. **Use Functional Indexes**:
   - You can use functional indexes that apply transformations (e.g.,
trimming special characters or converting to ASCII) before comparison:
 ```sql
 CREATE INDEX idx_test_transformed ON test (lower(value));
 ```

5. **Optimize Queries**:
   - For repeated queries involving long strings with special characters,
try caching results or using materialized views where the collation
overhead is reduced by pre-computing.

6. **Custom Collations**:
   - If you need locale-aware sorting but can compromise on certain
aspects, consider creating a **custom collation** that simplifies special
character handling.

 **Answers to Your Questions**:
1. **Is the performance degradation expected due to collation handling of
certain special characters?**
   - Yes, this behavior is expected. Locale-aware collations can be complex
for special characters, leading to longer comparison times.

2. **Are there any recommendations to improve performance without changing
the column or database collation?**
   - Without changing the collation entirely, you can use:
 - **Query-level collation adjustments** (using the `"C"` collation in
specific queries).
 - **Partial indexes** with the `"C"` collation.
 - **Functional indexes** or optimizations like materialized views.

On Sun, Oct 6, 2024 at 3:53 PM Andrey Stikheev 
wrote:

> Dear PostgreSQL Community,
>
> I am facing significant performance issues when executing queries that
> involve string comparisons with special characters, such as <, #, !, @,
> etc., especially when dealing with long strings. The query execution time
> increases drastically when these characters are used, whereas queries with
> alphabetic characters do not show such degradation. This behavior is
> observed both on macOS (using the official postgres:17 image via Docker)
> and on an Ubuntu 20.04 server running PostgreSQL in an LXC container.
>
> Here is a minimal example:
>
> testdb=# SELECT version();
>
>   version
>
>
>
> ---
>
>  PostgreSQL 17.0 (Debian 17.0-1.pgdg120+1) on aarch64-unknown-linux-gnu,
> compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
>
> (1 row)
>
> testdb=# CREATE TABLE test (value VARCHAR(10) NOT NULL);
> CREATE TABLE
> Time: 3.562 ms
>
> testdb=# CREATE INDEX idx_test ON test (value);
> CREATE

Re: Performance degradation in Index searches with special characters

2024-10-06 Thread Thomas Munro

On Mon, Oct 7, 2024 at 9:02 AM Shiv Iyer  wrote:
>- As the string length increases, the performance degrades exponentially 
> when using special characters. This is due to the collation’s computational 
> complexity for each additional character comparison.

That's a pretty interesting observation, worthy of a bug report.  I
don't know the details offhand but the algorithm used should be
basically the same, or more likely a simpler subset, of what other
collation implementations are using IIRC (don't quote me but I think
it's supposed to be ISO 14651 which is essentially a subset of the UCA
stuff that ICU is using (it is "aligned with" UCA DUCET), without CLDR
customisations and perhaps various other complications), so it doesn't
sound like it should be fundamentally required to be *more* expensive
than ICU...

Re: Performance degradation in Index searches with special characters

Re: Performance degradation in Index searches with special characters

Re: Performance degradation in Index searches with special characters

Performance degradation in Index searches with special characters

Re: Performance degradation in Index searches with special characters

Re: Performance degradation in Index searches with special characters

Re: Performance degradation in Index searches with special characters

7 matches

Site Navigation

Mail list logo

Footer information