> > Before 7.4, to be handled by regex routines, UTF-8 are converted to > > ISO 10646. There was a limitaion in regex routines in that they cannot > > handle multibyte characters > 2bytes. In another word only 16bit UCS-2 > > are supported. That's why ISO 10646 > 0x10000 is rejected. > > Is this still an issue? The sanity check is still in wchar.c, but I > can store and retrieve UTF-8 characters quite fine in 8.0, and Paco > Avila's example ("Cañón") works fine: > > # insert into test values('Cañón'); > > # select * from test where x ~ '^C.*'; > x > ------- > Cañón > > # select * from test where x ~ 'ñó'; > x > ------- > Cañón > > Does anybody have an example that fails?
If the character is not rejected by the sanity check, it definitely is not the UTF-8 > 3 bytes character. Can you provide me a hex dump for the character? > Tom, can this check in wchar.c finally be dropped? I think things are not that simple. Quick glance through source codes showed me at least following changes are necessary: 1) pg_utf2wchar_with_len() needs to be modified 2) character conversion maps under backend/utils/mb/Unicode needs to be expaneded to UCS-4 range 3) Perl script backend/utils/mb/Unicode/ucs2utf.pl need to be modified to handle UCS-4 4) pg_local_to_utf.utf structure needs to be modified so that it can store >4 bytes data (currently it's unsigned int). Same thing can be said to pg_utf_to_local 5) 4) implies UtfToLocal/LocalToUtf needs to be modified etc... -- SRA OSS, Inc. Japan Tatsuo Ishii