Thanks for letting us know about this issue.
Asheesh's analysis is correct. Without linguistic analysis (which is
probably more than we want to get into!) or user guidance, guessing the
correct character set is a matter of trial and error of "which one of
these character sets can do a lossless conversion of the Unicode source
text."
Alpine offers the ability to provide user guidance by setting the
posting-character-set to the user's preferred character set.
That list for trial and error is arbitrary, and is not intended to snub
anyone. US-ASCII at one end, and UTF-8 at the other end are both
no-brainers. ISO-8859-15 is for our Euro users, ISO-8859-1 is for legacy
(note that most non-Unicode 8-bit stuff in North America is 8859-1), and
ISO-2022-JP and KOI8-R are for the long-time Japanese and Russian user
communities.
We probably could add to that list if there is a sufficient constituency;
but as Asheesh says, the overlap between some ISO 8859 variants is such
that a "wrong" albeit encoding can still be chosen.
It's difficult to justify adding yet another Latin script variant charset
while neglecting entire non-Latin scripts e.g., Arabic, Chinese
(simplified and traditional), Greek, Hebrew, Korean, Thai, etc. However,
the more that is added to the list, the slower the overall process for
everyone, especially with the larger scripts.
There is one other possibility; we could have some mechanism to make a
note of original source charset and try that charset specially. This
would be similar to how Pine worked (but hopefully without all the
incredible complexity and kluginess that the old Pine i18n code had!).
Determining widths of "ambiguous" class characters would also be aided by
such a mechanism. But we're talking possible futures here, and there's no
guarantee that we'll do it. [I'm not the person to convince.]
In my personal opinion, "posting-character-set=UTF-8" is the only right
setting; thus messages are either US-ASCII or UTF-8. However, practical
considerations dictate otherwise for now, because some people will flame
if they receive a message in UTF-8 rather than the local character set.
That's why "posting-character-set=ISO-8859-1" in my own configuration, so
I can't criticize others without facing a "pot, kettle, black" problem!
We are, however, interested in receiving feedback on this general issue.
If a clear concensus evolves we would certainly consider it. Thanks!
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]