Re: [Asterisk-Dev] IAX spec: Text formats and character sets

Steve Underwood Sun, 01 May 2005 06:19:51 -0700

Tzafrir Cohen wrote:

On Sat, Apr 30, 2005 at 06:07:16PM +0800, Steve Underwood wrote:
Michael Giagnocavo wrote:
Michael Giagnocavo wrote:
Hmm, you're right. That's doesn't look bad at all.
But... what about for comparisons and other Unicode operations? Do the
libraries available support some UTF-8 version of strcmp, strchr,
strcasecmp, etc.?
Some of them are easy (strcmp, for example). Most of them are harder, because they either need to know character boundaries, or need case mappings (strcasecmp, for example). Any function that searches for a 'char' in a string also won't work if the character being searched for is a multi-byte one.
Not even strcmp works, because you have things like combinations where you can represent in Unicode a character using different code points, but it's still considered the same. Say, a Latin o with an accent mark. Using wide char internally solves these issues, and is most likely faster, depending on the data.
Too right. Look at IBM's internationalisation classes for Unicode. It takes megabytes of code to compare two strings.
Do you just wan't to tell if they're equal, or to sort them?
Telling if they're eaul is basically simple: just compare the raw bytes.
One small twist: it may be required to use canonical unicode strings (I
hope I use the right term here). so you first convert them to a
canonical form and then compare them. Or simpler: mandrate all strings
to be in canonical form.
Sorting is more complicated issue if you don't like the literal order.

To tell if two strings are equal (really equal, rather than just byte for byte the same) you must bring both copies to canonical form. Very little Unicode is in canonical form, so this is not a small twist. It is a big PITA, that must be done in every case. The process of creating canonical unicode is slow, complex, and takes lots of code. I am unclear if insisting on canonical form is something you can really do, and if it is a complete answer. You need to find a linguistics expert who knows Unicode inside out to get a proper answer. There seems to be very little software that does proper comparisons at present.

Having said this, for most data processing purposes this can be skipped, and a byte by byte comparison used. If we just define that all text is UTF-8, the only complexity which is unavoidable is ensuring strings do not overrun buffers, while also do not stop mid-character. As others have shown, this is trivial. UTF-8 can be scanned forwards and backwards in a context free manner.

Unicode is one of the classic botchups. They had the opportunity to really clean things up. Instead they took all the mess from earlier codes, and added a whole lot more. :-(

Regards,
Steve

_______________________________________________
Asterisk-Dev mailing list
[email protected]
http://lists.digium.com/mailman/listinfo/asterisk-dev
To UNSUBSCRIBE or update options visit:
  http://lists.digium.com/mailman/listinfo/asterisk-dev

Re: [Asterisk-Dev] IAX spec: Text formats and character sets

Reply via email to