On Tuesday 07 December 2004 12:44 am, Peter Samuelson wrote: > > Defining the character set as utf-8 means that any non-unicode > > capable application is going to have issues, yes. > > Postulate an app that is ignorant of character sets - we'll call it > "aptitude". Fixing it to make it accept utf-8 and spit out the correct > encoding for its LC_CTYPE is no harder than fixing it to make it accept > iso-8859-1 and spit out the correct encoding for its LC_CTYPE. > > And if the app already deals with charset conversions but assumes > iso-8859-1 input, then it's trivial to fix it to assume utf-8 input.
This is not true. iso-8859-1 is an 8-bit charset, while Unicode is a 32-bit [0] charset. Storing and manipulating iso-8859-1 strings requires no changes to internal datatypes (only conversions for input and output); storing and manipulating Unicode means you have to switch to a completely different set of string-handling functions for all internal operations. In C++ you might be able to partly finesse this by creating a replacement string class, but if our program (call it "aptitude") is already using a complex replacement string class for some tasks, and this class assumes that characters are 8 bits wide, this might be a slightly non-trivial task, especially compared to handling iso-8859-1. Hypothetically speaking. :-) On the other hand, once the program is using Unicode internally, taking iso-8859-1 as input and producing it as output should be no problem. Daniel [0] According to the libc manual, only 16 bits have been assigned, but GNU systems use 32-bit encoding internally if the libc transcoding functions are used. -- /------------------- Daniel Burrows <[EMAIL PROTECTED]> ------------------\ | swapon /dev/ram | \--- News without the $$ -- National Public Radio -- http://www.npr.org ---/
pgpuGzR6Woq1o.pgp
Description: PGP signature