Re: International characters in source files and SVN commit messages (was: RE:r1525975)

Mark Thomas Wed, 25 Sep 2013 08:54:38 -0700

On 25/09/2013 08:36, Konstantin Preißer wrote:
> Hi Mark,
> 
> thanks for the reply.
> 
>> -----Original Message----- From: Mark Thomas
>> [mailto:ma...@apache.org] Sent: Wednesday, September 25, 2013 5:01
>> PM
> 
>>> One way I can think would be to XML-encode such characters ("ß"
>>> as "&#xDF;"). However, personally I would rather not do this, but
>>> write such characters directly ("ß"), so that the source is
>>> better readable (and encodings like UTF-8 guarantee that the
>>> characters are interpreted the same on each system, independently
>>> from the system language or geographic location).
>> 
>> I don't like the idea of using XML encoding at all.
> 
> Just to avoid a misunderstanding, with "XML encoding" you mean
> numeric character references like &#nnn; ?


Yes.

>>> Could it be possible to change SVN Commit E-Mail system so that
>>> it may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming
>>> all files which contain bytes > 0x7F are encoded as UTF-8)? (Or,
>>> that it tries to decode it as UTF-8, and if it fails, decode it
>>> as ISO-8859-1 ?)
>> 
>> This is a question for infra. If UTF-8 fails then ISO-8859-1 is
>> going to fail as well.
> 
> I mean, to guess a character encoding by first decoding it as UTF-8,
> and if it fails, assume the file was encoded as
> ISO-8859-1/Windows-1252. This approach seems to be used by some
> programs to decide if the file was encoded as UTF-8 or as ANSI when
> it doesn't have BOM bytes.
> 
> For example, consider a file that contains only ASCII characters (<
> 0x7F) stored as single-byte-per-character. As UTF-8 is
> ASCII-compatble, you will get the same results if you interpret it as
> UTF-8 and with ISO-8859-1.
> 
> However, if you have a file that contains "äöü" (german umlaut
> characters) as ISO-8859-1 (Bytes: E4 F6 FC), then UTF-8 decoding will
> fail because the bytes after the one which starts with 11xxxxxx
> (binary) don't start with 10xxxxxx; but decoding as ISO-8859-1 will
> succeed.
> 
> This approach to guess the encoding (UTF-8 vs.
> ISO-8859-1/Windows-1252) seems to be used by programs like Notepad++
> when opening text files without a BOM, and by TortoiseSVN when
> displaying file changes, and seems to be working well if you have
> files with either UTF-8 or ISO-8859-1/Windows-1252 (or other local
> encodings). Of course, this will not always work, e.g. if your text
> file that is encoded with ISO-8859-1 actually contains text like
> "ÃŸ". (Personally, for my projects I use UTF-8 for everything :) )
> 
> 
> I was asking because I saw some i18n files like
> "LocalStrings_ja.properties" that encode non-ASCII characters with
> "\uXXXX", and I'd like to know if it is okay to put characters "ß"
> character in the XML file without encoding it by a numeric character
> reference,

I'd say yes. Property files are a 'special' case:
http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-resource-properties-with-resourcebundle

> while the Commit E-Mails don't use UTF-8. If you are okay
> with this, then I don't mind changing the encoding for the SVN Commit
> E-Mails.

It doesn't bother me but I'm only one committer. I think this falls
under the category if someone cares enough about the commit e-mails
using UTF-8 then they need to work with infra to make that happen. I'm
happy with things as they are.

Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

Re: International characters in source files and SVN commit messages (was: RE:r1525975)

Reply via email to