-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA256 On 2/11/20 2:37 AM, Martin Grigorov wrote: > I guess you use Java 8. Newer versions of Java try UTF-8 first and > then fallback to ISO-8859-1: https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/P ropertyResourceBundle.html Correct, I am using Java 8:
$ java -version openjdk version "1.8.0_232" OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09) OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode) This is the version that Debian 9 provides. I could install a a higher patch-version but would it help? On 2/11/20 6:38 AM, Mark Thomas wrote: > On 10/02/2020 20:58, Christopher Schultz wrote: >> All, >> >> I've recently begun making a change to my application's resource >> bundles, converting them into UTF-8 for readability and >> converting them to ISO-8859-1 during my build process to make >> ResourceBundle happy. >> >> I have everything working, except that Eclipse still thinks that >> my files ought to be ISO-8859-1 and ruins them when I load them. >> Sometimes, it's very obvious and that's not a problem: a >> developer will see that and fix it before continuing. But some >> files are only *slightly* broken by this and someone might make a >> mistake. > > I don't think we have seen this with Tomcat. Or have we (since we > switched to UTF-8)? > > The thing that bugged me was having to manually switch properties > files to UTF-8 to view them "properly". You mail motivated me to > track down where I can change that in Eclipse: > > Window->Preferences->General->Content Types > > and I have changed Java properties files to use UTF-8. So that is > my personal niggle fixed. Thanks for the motivation. Yes, this *will* fix things, but: 1. It's a global setting, so it can't be set on a per-project basis. That means you have to be willing to convert ALL your properties files across ALL your projects to UTF-8. That may be okay for some people, but not all. 2. This is a guess: Tomcat's ide-eclipse ant target can't set that setting for the Tomcat project(s) because it's a global setting. Therefore, anyone using Eclipse as an IDE will have to manually set their content-type in order to NOT damage any of the files we ship. >> NOTE: We don't keep Eclipse settings in revision-control, so I >> can't modify everyone's Eclipse configuration. We are using svn >> and svn:mime-type is correctly set for these files; Eclipse just >> ignores tha t. > > I've seen that too. While I found it rather annoying, it wasn't > annoying enough to try and find a fix as that looked like it would > require patching Eclipse and/or the svn plug-in. > >> Anyway, I found that adding a UTF-8 BOM to the beginning of the >> file fixes that issue and Eclipse does the right thing. > > Ah. So Eclipse *is* doing content scanning. Interesting. Well, it's not really *content* scanning. But a BOM is the official way to tell the difference between a UTF-8 encoded file and one that just happens to have a whole bunch of valid UTF-8 byte sequences through (most of) the file. >> As a sanity check. I looked at how Tomcat's files are laid-out >> and I don't see any BOMs. > > Correct. The only files in the code base that should have BOMs at > the moment are the ones in the test web application (under > bug49nnn) for testing the default Servlet's handling of files with > BOMs. > >> Should we add BOMs? Is there any reason NOT to use a BOM? These >> are file types that are officially supposed to be ISO-8859-1 but >> everyone wants to handle them differently, so I think adding BOMs >> might be a good idea so that editors are always informed of >> exactly what's happenin g. >> >> WDYT? > > I was concerned that adding a BOM would cause problems when > reading property files. I've seen reports of that with Java in the > past. A quick test suggests that the issue is no longer present > with latest Java 8. I actually had another problem after I implemented all of this: any property file without a blank and/or comment line at the top ended up with a mangled and unusable *first* property key. A file like this: first.property=foo second.property=bar Would end up line this after a trip through "native2ascii -encoding UTF-8": \ufefffirst.property=foo second.property=bar native2ascii stupidly interprets the UTF-8 BOM as an actual character, and encodes it in the output. This appears to be a bug in (at least old versions of) Java and/or native2ascii. I've got local installations of Java 8, 11 (Adopt), 11 (Oracle), and 13 (OpenJDK), and only Java 8 has a "native2ascii" binary present. I see ant's <native2ascii> task has its own implementation, but it's probably very simple, just like the native2ascii program itself. Java's Reader classes incorrectly interpret the BOM as an actual character instead of an ignorable UTF-8 control sequence. I can confirm that Java 13 still seems to have this problem: running ant's <native2ascii> under Java 13 still corrupts the first line of the file. Ensuring that the first line of the file is a comment or a blank line fixes things: # BOM first.property=foo second.property=bar becomes: \ufeff# BOM first.property=foo second.property=bar > With the use of POEditor and the import/export scripts we have, it > would be unusual for someone to be editing any of the property > files where UTF-8 vs ISO-8859-1 matters. Thinking about it a little > more, there would be a need to do this to edit non-English strings > in the older branches where the key doesn't exist in the latest > code. That strikes me as a fairly rare use case. > > My other worry is that some editors will fail to handle the BOM > correctly and we'll end up causing more issues than we solve. I've > little basis for that worry other than (possibly out of date) > experience. > > Overall, I guess I am -0 on adding BOMs. Okay. This is a fairly recent change to Tomcat, and frankly, we (a) don't get a huge number of outside contributions which include changes to the localized properties files (except for the translation-only contributions, which have been great!) and (b) often ignore the non-English translations in the first place because we are lazy. I think maybe this can stay on the back-burner until we see if we end up with any problems. Does/can "checkstyle" check for valid UTF-8 byte sequences in .properties files? I think that may be a helpful check to add if it's not already in there. - -chris -----BEGIN PGP SIGNATURE----- Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/ iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5CuasACgkQHPApP6U8 pFgKWBAAuQiF6fMD+LWDPkdiCWRIYPzPIPjSqHIOvn6iORC/RnJ2S2s8tsvu0K6E IVypbd016lOP5Mn1hLGNU80eYPo3xNzz8GrZgjXImG+xeFcZ0VL+FGCkpsE6UrlT LuxHi7Axq+sRhxf/iEuTxr/vS9sD5ggc5oc/TnVR1b1NETRX0M43uQFqoraOtHUE mCW6KgzqteEu8ca00YH8k73eeCOhIUybFdTXBBaf5VgxT+uQhM0ogIUFkls0KbSE sq+SCzIlb1ftSVI1Dp4ORRTH6sjaiBnboZLduJaBbyiqHCIBAwnyO++Qk3RBaWCS 4SoOfVF0LFGS5CRG/IZcKMhNctS/NzCa5ShsTFGhaDxqhn+CaaMq9jJlhNb7j1vG La/+cSYSp9h63ZohMh5M2r9FbT3nP3q6Tt7N2X40ALGxpMReSf4zF/lV9feHT9wM Yq4u6sPO7ACHfL+a4FST1jNPYeLJ4PfiSSv6LY663VZOg06JlVnT0P0SxWKvm7r8 Y38Guw0m75jWPhM1s0wNGYvQ8t2rCMvjpIIedptmuk9IGyfBux20ms9RGjiir1wB BEdL/0opnJALG3qx1ver+vqfWMJbXpyUCnCPgVCPCtnprmSYrdpaif2hiGcIEqG+ Q5aS3KPvmXN722ORgSXpRn/5Lym2dznMH2alRLbo/Gz/z3g2k4w= =T4mh -----END PGP SIGNATURE----- --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org