George wrote:
On Mon, 2017-06-05 at 16:16 -0700, L A Walsh wrote:
George wrote:
On Mon, 2017-06-05 at 15:59 +0700, Peter & Kelly Passchier wrote:
On 05/06/2560 15:52, George wrote:
there's not a reliable mechanism in place to run a script in a
locale whose character encoding doesn't match that of the script
From my experience running such scripts is no problem, but correct
rendering it might depend on the client/editor.
It depends on the source and target encodings. For most pairs of
source and target encoding there is some case where reinterpreting a
string from the source encoding as a string in the target encoding
(without proper conversion) will result in an invalid string in the
target encoding. For instance, if a script were written in
ISO-8859-1, many possible sequences involving accented characters
would actually be invalid in UTF-8.
---
Um... I think you are answering a case that is different than
what is stated (i.e. locale being same as used in script). So no
conversion should take place.
Eduardo's patch ... can only work correctly if the character set
configured in the locale is the same as the character set of the script.
----
Right. The 1st paragraph (written by you), above, mentions that.
Given the 1st paragraph (which no one is contesting), we are only
talking about the case where the run locale and script locale are the same.
The Passciers wrote that regardless of such agreement, you can still find
editors that may be ignoring the locale r have no locale support at all
and only display characters where the editor was written. While that is
also true, it can't really be helped: if your local editor only writes
in Chinese and the script is written in ASCII, you may be out of luck
in having it display properly.
Broadly speaking I think the approach taken in Eduardo's patch
(interpreting the byte sequence according to the rules of its
character encoding) is better than the approach taken in current
versions of Bash (letting 0x80-0xFF slide through the parser) - but
that approach only works if you know the correct character encoding to
use when processing the script. The information has to be provided in
the script somehow.
---
Not exactly -- as the only variable-length-encoding scheme that
linux systems have had to worry about is UTF-8. So if you encounter
UTF-8 in the
input, it is probable that you can use UTF-8 for the whole script.
Otherwise, use a binary decoding stream (letting 0x80-0FF) be
treated as a 2nd half of a 256-byte charset, OR a 128-byte charset
without the a parity bit stripped that is left "as-is".
The utility "file" is one example of such a utility that can usually
tell
the encoding type of a text file -- at least telling the difference between
UTF-8, ASCII and some 8-bit local charset.
While such methods may not be 100% accurate, they are usually good
enough
for most usages where one isn't running (we hope) random scripts of unknown
origin off the web.
FWIW, I think we are in agreement, though it may not be clear! ;-)