On 25/09/2013 08:36, Konstantin Preißer wrote: > Hi Mark, > > thanks for the reply. > >> -----Original Message----- From: Mark Thomas >> [mailto:ma...@apache.org] Sent: Wednesday, September 25, 2013 5:01 >> PM > >>> One way I can think would be to XML-encode such characters ("ß" >>> as "ß"). However, personally I would rather not do this, but >>> write such characters directly ("ß"), so that the source is >>> better readable (and encodings like UTF-8 guarantee that the >>> characters are interpreted the same on each system, independently >>> from the system language or geographic location). >> >> I don't like the idea of using XML encoding at all. > > Just to avoid a misunderstanding, with "XML encoding" you mean > numeric character references like &#nnn; ?
Yes. >>> Could it be possible to change SVN Commit E-Mail system so that >>> it may interpret diffs as UTF-8 instead of ISO-8859-1 (assuming >>> all files which contain bytes > 0x7F are encoded as UTF-8)? (Or, >>> that it tries to decode it as UTF-8, and if it fails, decode it >>> as ISO-8859-1 ?) >> >> This is a question for infra. If UTF-8 fails then ISO-8859-1 is >> going to fail as well. > > I mean, to guess a character encoding by first decoding it as UTF-8, > and if it fails, assume the file was encoded as > ISO-8859-1/Windows-1252. This approach seems to be used by some > programs to decide if the file was encoded as UTF-8 or as ANSI when > it doesn't have BOM bytes. > > For example, consider a file that contains only ASCII characters (< > 0x7F) stored as single-byte-per-character. As UTF-8 is > ASCII-compatble, you will get the same results if you interpret it as > UTF-8 and with ISO-8859-1. > > However, if you have a file that contains "äöü" (german umlaut > characters) as ISO-8859-1 (Bytes: E4 F6 FC), then UTF-8 decoding will > fail because the bytes after the one which starts with 11xxxxxx > (binary) don't start with 10xxxxxx; but decoding as ISO-8859-1 will > succeed. > > This approach to guess the encoding (UTF-8 vs. > ISO-8859-1/Windows-1252) seems to be used by programs like Notepad++ > when opening text files without a BOM, and by TortoiseSVN when > displaying file changes, and seems to be working well if you have > files with either UTF-8 or ISO-8859-1/Windows-1252 (or other local > encodings). Of course, this will not always work, e.g. if your text > file that is encoded with ISO-8859-1 actually contains text like > "ß". (Personally, for my projects I use UTF-8 for everything :) ) > > > I was asking because I saw some i18n files like > "LocalStrings_ja.properties" that encode non-ASCII characters with > "\uXXXX", and I'd like to know if it is okay to put characters "ß" > character in the XML file without encoding it by a numeric character > reference, I'd say yes. Property files are a 'special' case: http://stackoverflow.com/questions/4659929/how-to-use-utf-8-in-resource-properties-with-resourcebundle > while the Commit E-Mails don't use UTF-8. If you are okay > with this, then I don't mind changing the encoding for the SVN Commit > E-Mails. It doesn't bother me but I'm only one committer. I think this falls under the category if someone cares enough about the commit e-mails using UTF-8 then they need to work with infra to make that happen. I'm happy with things as they are. Mark --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org