Thanks for letting us know about this issue.

Asheesh's analysis is correct. Without linguistic analysis (which is probably more than we want to get into!) or user guidance, guessing the correct character set is a matter of trial and error of "which one of these character sets can do a lossless conversion of the Unicode source text."

Alpine offers the ability to provide user guidance by setting the posting-character-set to the user's preferred character set.

That list for trial and error is arbitrary, and is not intended to snub anyone. US-ASCII at one end, and UTF-8 at the other end are both no-brainers. ISO-8859-15 is for our Euro users, ISO-8859-1 is for legacy (note that most non-Unicode 8-bit stuff in North America is 8859-1), and ISO-2022-JP and KOI8-R are for the long-time Japanese and Russian user communities.

We probably could add to that list if there is a sufficient constituency; but as Asheesh says, the overlap between some ISO 8859 variants is such that a "wrong" albeit encoding can still be chosen.

It's difficult to justify adding yet another Latin script variant charset while neglecting entire non-Latin scripts e.g., Arabic, Chinese (simplified and traditional), Greek, Hebrew, Korean, Thai, etc. However, the more that is added to the list, the slower the overall process for everyone, especially with the larger scripts.

There is one other possibility; we could have some mechanism to make a note of original source charset and try that charset specially. This would be similar to how Pine worked (but hopefully without all the incredible complexity and kluginess that the old Pine i18n code had!). Determining widths of "ambiguous" class characters would also be aided by such a mechanism. But we're talking possible futures here, and there's no guarantee that we'll do it. [I'm not the person to convince.]

In my personal opinion, "posting-character-set=UTF-8" is the only right setting; thus messages are either US-ASCII or UTF-8. However, practical considerations dictate otherwise for now, because some people will flame if they receive a message in UTF-8 rather than the local character set. That's why "posting-character-set=ISO-8859-1" in my own configuration, so I can't criticize others without facing a "pot, kettle, black" problem!

We are, however, interested in receiving feedback on this general issue. If a clear concensus evolves we would certainly consider it. Thanks!

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.


--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]

Reply via email to