> > Before 7.4, to be handled by regex routines, UTF-8 are converted to
> > ISO 10646. There was a limitaion in regex routines in that they cannot
> > handle multibyte characters > 2bytes. In another word only 16bit UCS-2
> > are supported. That's why ISO 10646 > 0x10000 is rejected.
> 
> Is this still an issue? The sanity check is still in wchar.c, but I
> can store and retrieve UTF-8 characters quite fine in 8.0, and Paco
> Avila's example ("Cañón") works fine:
> 
>   # insert into test values('Cañón');
> 
>   # select * from test where x ~ '^C.*';
>      x
>    -------
>     Cañón
>  
>   # select * from test where x ~ 'ñó';
>      x
>    -------
>     Cañón
> 
> Does anybody have an example that fails?

If the character is not rejected by the sanity check, it definitely is
not the UTF-8 > 3 bytes character. Can you provide me a hex dump for
the character?

> Tom, can this check in wchar.c finally be dropped?

I think things are not that simple.
Quick glance through source codes showed me at least following changes
are necessary:

1) pg_utf2wchar_with_len() needs to be modified

2) character conversion maps under backend/utils/mb/Unicode needs to
   be expaneded to UCS-4 range

3) Perl script backend/utils/mb/Unicode/ucs2utf.pl need to be modified
   to handle UCS-4

4) pg_local_to_utf.utf structure needs to be modified so that it can
   store >4 bytes data (currently it's unsigned int). Same thing can
   be said to pg_utf_to_local

5) 4) implies UtfToLocal/LocalToUtf needs to be modified

etc...
--
SRA OSS, Inc. Japan
Tatsuo Ishii

Reply via email to