Re: UTF-8 properties files and BOMs

Christopher Schultz Tue, 11 Feb 2020 06:27:19 -0800

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On 2/11/20 2:37 AM, Martin Grigorov wrote:
> I guess you use Java 8. Newer versions of Java try UTF-8 first and
> then fallback to ISO-8859-1:
https://docs.oracle.com/en/java/javase/11/docs/api/java.base/java/util/P
ropertyResourceBundle.html
Correct, I am using Java 8:

$ java -version
openjdk version "1.8.0_232"
OpenJDK Runtime Environment (build 1.8.0_232-8u232-b09-1~deb9u1-b09)
OpenJDK 64-Bit Server VM (build 25.232-b09, mixed mode)

This is the version that Debian 9 provides. I could install a a higher
patch-version but would it help?

On 2/11/20 6:38 AM, Mark Thomas wrote:
> On 10/02/2020 20:58, Christopher Schultz wrote:
>> All,
>> 
>> I've recently begun making a change to my application's resource 
>> bundles, converting them into UTF-8 for readability and
>> converting them to ISO-8859-1 during my build process to make
>> ResourceBundle happy.
>> 
>> I have everything working, except that Eclipse still thinks that
>> my files ought to be ISO-8859-1 and ruins them when I load them. 
>> Sometimes, it's very obvious and that's not a problem: a
>> developer will see that and fix it before continuing. But some
>> files are only *slightly* broken by this and someone might make a
>> mistake.
> 
> I don't think we have seen this with Tomcat. Or have we (since we 
> switched to UTF-8)?
> 
> The thing that bugged me was having to manually switch properties
> files to UTF-8 to view them "properly". You mail motivated me to
> track down where I can change that in Eclipse:
> 
> Window->Preferences->General->Content Types
> 
> and I have changed Java properties files to use UTF-8. So that is
> my personal niggle fixed. Thanks for the motivation.

Yes, this *will* fix things, but:

1. It's a global setting, so it can't be set on a per-project basis.
That means you have to be willing to convert ALL your properties files
across ALL your projects to UTF-8. That may be okay for some people,
but not all.

2. This is a guess: Tomcat's ide-eclipse ant target can't set that
setting for the Tomcat project(s) because it's a global setting.
Therefore, anyone using Eclipse as an IDE will have to manually set
their content-type in order to NOT damage any of the files we ship.

>> NOTE: We don't keep Eclipse settings in revision-control, so I
>> can't modify everyone's Eclipse configuration. We are using svn
>> and svn:mime-type is correctly set for these files; Eclipse just
>> ignores tha t.
> 
> I've seen that too. While I found it rather annoying, it wasn't
> annoying enough to try and find a fix as that looked like it would
> require patching Eclipse and/or the svn plug-in.
> 
>> Anyway, I found that adding a UTF-8 BOM to the beginning of the
>> file fixes that issue and Eclipse does the right thing.
> 
> Ah. So Eclipse *is* doing content scanning. Interesting.

Well, it's not really *content* scanning. But a BOM is the official
way to tell the difference between a UTF-8 encoded file and one that
just happens to have a whole bunch of valid UTF-8 byte sequences
through (most of) the file.

>> As a sanity check. I looked at how Tomcat's files are laid-out
>> and I don't see any BOMs.
> 
> Correct. The only files in the code base that should have BOMs at
> the moment are the ones in the test web application (under
> bug49nnn) for testing the default Servlet's handling of files with
> BOMs.
> 
>> Should we add BOMs? Is there any reason NOT to use a BOM? These
>> are file types that are officially supposed to be ISO-8859-1 but
>> everyone wants to handle them differently, so I think adding BOMs
>> might be a good idea so that editors are always informed of
>> exactly what's happenin g.
>> 
>> WDYT?
> 
> I was concerned that adding a BOM would cause problems when
> reading property files. I've seen reports of that with Java in the
> past. A quick test suggests that the issue is no longer present
> with latest Java 8.

I actually had another problem after I implemented all of this: any
property file without a blank and/or comment line at the top ended up
with a mangled and unusable *first* property key. A file like this:

first.property=foo
second.property=bar

Would end up line this after a trip through "native2ascii -encoding
UTF-8":

\ufefffirst.property=foo
second.property=bar

native2ascii stupidly interprets the UTF-8 BOM as an actual character,
and encodes it in the output.

This appears to be a bug in (at least old versions of) Java and/or
native2ascii. I've got local installations of Java 8, 11 (Adopt), 11
(Oracle), and 13 (OpenJDK), and only Java 8 has a "native2ascii"
binary present. I see ant's <native2ascii> task has its own
implementation, but it's probably very simple, just like the
native2ascii program itself. Java's Reader classes incorrectly
interpret the BOM as an actual character instead of an ignorable UTF-8
control sequence.

I can confirm that Java 13 still seems to have this problem: running
ant's <native2ascii> under Java 13 still corrupts the first line of
the file.

Ensuring that the first line of the file is a comment or a blank line
fixes things:

# BOM
first.property=foo
second.property=bar

becomes:

\ufeff# BOM
first.property=foo
second.property=bar

> With the use of POEditor and the import/export scripts we have, it
> would be unusual for someone to be editing any of the property
> files where UTF-8 vs ISO-8859-1 matters. Thinking about it a little
> more, there would be a need to do this to edit non-English strings
> in the older branches where the key doesn't exist in the latest
> code. That strikes me as a fairly rare use case.
> 
> My other worry is that some editors will fail to handle the BOM 
> correctly and we'll end up causing more issues than we solve. I've 
> little basis for that worry other than (possibly out of date)
> experience.
> 
> Overall, I guess I am -0 on adding BOMs.

Okay. This is a fairly recent change to Tomcat, and frankly, we (a)
don't get a huge number of outside contributions which include changes
to the localized properties files (except for the translation-only
contributions, which have been great!) and (b) often ignore the
non-English translations in the first place because we are lazy.

I think maybe this can stay on the back-burner until we see if we end
up with any problems.

Does/can "checkstyle" check for valid UTF-8 byte sequences in
.properties files? I think that may be a helpful check to add if it's
not already in there.

- -chris
-----BEGIN PGP SIGNATURE-----
Comment: Using GnuPG with Thunderbird - https://www.enigmail.net/

iQIzBAEBCAAdFiEEMmKgYcQvxMe7tcJcHPApP6U8pFgFAl5CuasACgkQHPApP6U8
pFgKWBAAuQiF6fMD+LWDPkdiCWRIYPzPIPjSqHIOvn6iORC/RnJ2S2s8tsvu0K6E
IVypbd016lOP5Mn1hLGNU80eYPo3xNzz8GrZgjXImG+xeFcZ0VL+FGCkpsE6UrlT
LuxHi7Axq+sRhxf/iEuTxr/vS9sD5ggc5oc/TnVR1b1NETRX0M43uQFqoraOtHUE
mCW6KgzqteEu8ca00YH8k73eeCOhIUybFdTXBBaf5VgxT+uQhM0ogIUFkls0KbSE
sq+SCzIlb1ftSVI1Dp4ORRTH6sjaiBnboZLduJaBbyiqHCIBAwnyO++Qk3RBaWCS
4SoOfVF0LFGS5CRG/IZcKMhNctS/NzCa5ShsTFGhaDxqhn+CaaMq9jJlhNb7j1vG
La/+cSYSp9h63ZohMh5M2r9FbT3nP3q6Tt7N2X40ALGxpMReSf4zF/lV9feHT9wM
Yq4u6sPO7ACHfL+a4FST1jNPYeLJ4PfiSSv6LY663VZOg06JlVnT0P0SxWKvm7r8
Y38Guw0m75jWPhM1s0wNGYvQ8t2rCMvjpIIedptmuk9IGyfBux20ms9RGjiir1wB
BEdL/0opnJALG3qx1ver+vqfWMJbXpyUCnCPgVCPCtnprmSYrdpaif2hiGcIEqG+
Q5aS3KPvmXN722ORgSXpRn/5Lym2dznMH2alRLbo/Gz/z3g2k4w=
=T4mh
-----END PGP SIGNATURE-----

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: UTF-8 properties files and BOMs

Reply via email to