ICU, locale and collation question

2023-05-08 Thread Oscar Carlberg

Hello,

We have a bunch of existing Postgres 10 clusters running on CentOS 7, 
which have been initialized (initdb) with these collation options;


-E 'UTF-8'
--lc-collate=sv_SE.UTF-8
--lc-ctype=sv_SE.UTF-8
--lc_monetary=sv_SE.UTF-8
--lc-numeric=sv_SE.UTF-8
--lc-time=sv_SE.UTF-8
--lc-messages=en_US.UTF-8

And createdb were provided with these locale options when setting up 
databases:

--lc-collate=sv_SE.UTF-8
--lc-ctype=sv_SE.UTF-8

\l in psql gives:

 Name |  Owner   | Encoding | Collate   
|    Ctype    |

--+--+--+-+-+
 test-db  | test-user    | UTF8 | 
sv_SE.UTF-8 | sv_SE.UTF-8 |



We're upgrading the servers using logical replication, and would like to 
take the opportunity to switch to ICU rather than relying on glibc, to 
avoid future problems with index corruption if using physical 
replication between servers with different versions of glibc.


We're trying to figure out the most correct way to configure postgres to 
achieve this. Currently we have:


-E 'UTF-8'
--locale-provider=icu
--icu-locale=sv-SE-x-icu

And createdb are provided with locale options:
--lc-collate=C
--lc-ctype=C

\l in psql now gives:

    Name |    Owner    | Encoding | Collate   |    
Ctype    | ICU Locale  | Locale Provider |

-+-+--+-+-+-+-+
 test-db | test-user   | UTF8 | C   | 
C   | sv-SE-x-icu | icu |


Is this a safe configuration to avoid index corruption, and other 
problems, while still being compatible with the previous locale 
settings? We have done some testing and it appears ORDER BY does sort 
rows according to Swedish localization in the ICU configured test-db.


We're uncertain since this blogpost -> 
https://peter.eisentraut.org/blog/2022/09/26/icu-features-in-postgresql-15
mentions that there are still some postgres code relying on libc locale 
facilities. Should we set lc-collate and lc-ctype to sv_SE.UTF-8 when 
creating databases in addition to the ICU options provided to initdb due 
to this? Will we still be safe from glibc related corruption as long as 
--locale-provider=icu --icu-locale=sv-SE-x-icu is set?


Best Regards,

Oscar


--
Innehållet i detta e-postmeddelande är konfidentiellt och avsett endast för 
adressaten.Varje spridning, kopiering eller utnyttjande av innehållet är 
förbjuden utan tillåtelse av avsändaren. Om detta meddelande av misstag 
gått till fel adressat vänligen radera det ursprungliga meddelandet och 
underrätta avsändaren via e-post





Re: ICU, locale and collation question

2023-05-08 Thread Oscar Carlberg

On 5/8/23 22:07, Laurenz Albe wrote:


Yes, that is safe.

But it is not compatible with the old setup when it comes to lc_time,
lc_messages and the others.  You should use sv_SE.UTF-8 for these
locale categories.

Note that that won't cause problems with upgrading the C library.

Yours,
Laurenz Albe


Thank you very much for clarifying!

Our initdb setup would then look like this for compatibility;
-E 'UTF-8'
--locale-provider=icu
--icu-locale=sv-SE-x-icu
--lc_monetary=sv_SE.UTF-8
--lc-numeric=sv_SE.UTF-8
--lc-time=sv_SE.UTF-8
--lc-messages=en_US.UTF-8

Should we still provide createdb with --lc-collate=C and --lc-ctype=C, 
or should we set those to sv_SE.UTF-8 as well?


Best regards,
Oscar


--
Innehållet i detta e-postmeddelande är konfidentiellt och avsett endast för 
adressaten.Varje spridning, kopiering eller utnyttjande av innehållet är 
förbjuden utan tillåtelse av avsändaren. Om detta meddelande av misstag 
gått till fel adressat vänligen radera det ursprungliga meddelandet och 
underrätta avsändaren via e-post