> Should the translation be "accurate" or should it be "useful"?
I think it should be accurate for file systems. Such a "useful" translation is a good concept for displaying output (maybe even that of the DIR command) but not for actually working with the file system. Keyboard input can't map one key to several characters at once (unless you randomly (-; decide which one to use) so input handling should use one-to-one translation too. > From a technical perspective, you will also at a minimum need to concern > yourself with translating strings vs. translating single characters > (UniCode > strings can/should include an Endian-defining character at the > beginning, as > well as needing to define how the length of the string is determined), > UTF-8 > vs. UTF-16 vs. UTF-32, and Big- vs. Little-endian. None of this is > trivial, > and I think this is WAY too complicated to be in the kernel -- it should > be > a separate program/driver. UTF-8 is independent of byte-order. The exact encoding (and byte-order) should always either be implicit (in the interface's or format's definition) or be marked in some way. The definition of a string's length (possibly number of bytes/words/dwords, number of code-points, number of "characters") need not be addressed by such an interface. If there is a need for a buffer or string length (see below) a new interface should just define that all "length" fields/parameters give the length in bytes. If there was a DOS (kernel) interface, it should probably accept a single character (usually one byte, two byte for DBCS) encoded in the currently selected code page and return a Unicode code-point. All code-points fit into a 24-bit (= 3-byte) number; though such an interface can be limited to Unicode's BMP (16-bit numbers (= words)) like the DOSLFN/VC tables. Of course there should be an "accurate" reverse interface which accepts a 24-bit (or 16-bit) number and returns a one- or two-byte character in the current code page if one exists for that Unicode code-point. Notably, some code pages might contain characters that should map to several code-points and some code-points might require more than two bytes when represented in the current code page's encoding. A string translation interface might therefore be more appropriate. (As an aside, this would solve the need for a DBCS kludge because multi-byte mappings could be supported intrinsically.) In this case, the interface should exactly define what Unicode encoding to use (UTF-8, -16BE, -16LE, -32BE, -32LE) - applications have to figure out on their own what encoding their data uses. Regards, Christian ------------------------------------------------------------------------------ Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev _______________________________________________ Freedos-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/freedos-devel
