[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

bugzilla Wed, 06 Mar 2019 11:05:16 -0800

https://bz.apache.org/bugzilla/show_bug.cgi?id=63235


--- Comment #5 from Phillip Webb <pw...@pivotal.io> ---
@Mark Thanks for pointing me at those previous issues, I should have thought to
search first!

It seems like I may have misunderstood the reason for the cache existing in the
first place. My original assumption was that it was necessary for
case-insensitive name lookups. It appears, on OpenJDK at least, calling
`Charset.forName` works with any case combination works. The following will run
without error:

        Set<String> names = new LinkedHashSet<>();
        Charset.availableCharsets().forEach((key, value) -> {
                names.add(value.name());
                names.addAll(value.aliases());
        });
        names.forEach(name -> {
                System.out.println(Charset.forName(name.toLowerCase()));
        });

So the main reasons for the cache, as I understand it, are:

1) Improve lookup performance for initial requests by having a hot cache
2) Prevent DoS attacks by limiting the impact of a user requesting an invalid
charset

Looking at each of these in turn:


1) Improve lookup performance for initial requests by having a hot cache

> populate the cache values at start-up for 'popular' charsets (ISO-8859-1, 
> UTF-8 plus any others we think are likely to be used)

I like this suggestion and I think this would be a good match for most users.
Popular charsets would be fast, and initial requests to others would take a
slight hit. The `Charset` class itself includes a cache, so this could be as
simple as calling `Charset.forName` a few times.


2) Prevent DoS attacks by limiting the impact of a user requesting an invalid
charset

Looking at the existing JDK `Charset` and `AbstractCharsetProvider`[1] class,
it appears that asking for charsets that don't exist does not have an impact on
memory. When I run the following code, I don't see a large change:

        for (int i = 0; i < 10000000; i++) {
                try {
                        Charset.forName("missing-" + i);
                }
                catch (Exception ex) {
                }
                if (i % 10000 == 0) {
                        System.out.println(Runtime.getRuntime().totalMemory());
                        System.out.println(Runtime.getRuntime().maxMemory());
                        System.out.println(Runtime.getRuntime().freeMemory());
                        System.out.println("--");
                }
        }

Of course, this might be JVM specific so it may well be worth still protecting
against that. One simple option might be to count lookup misses and after a
certain threshold switch back to a set limited by availableCharsets. Something
like:


    public static Charset getCharset(String enc) throws
UnsupportedEncodingException {
        Map<String, Charset> cache = encodingToCharsetCache;
        if (cache != null) {
            Charset charset = cache.get(enc.toLowerCase());
            if (charset == null) {
                throw new UnsupportedEncodingException(
                        sm.getString("b2cConverter.unknownEncoding", enc));
            }
        }
        try {
            return Charset.forName(enc);
        } catch (Exception ex) {
            int count = charsetMissCount.incrementAndGet();
            if (count > THRESHOLD) {
                encodingToCharsetCache = populateEncodingCache();
            }
            throw ex;
        }
    }


I'm happy to rework the PR if that's a sensible suggestion? Even if it were
opt-int, I think it would very nice to not have to pay the startup cost that
the existing code incurs, especially if you use an embedded Tomcat and restart
frequently during application development.

[1]
http://cr.openjdk.java.net/~sherman/6898310/webrev/src/share/classes/sun/nio/cs/AbstractCharsetProvider.java.html

-- 
You are receiving this mail because:
You are the assignee for the bug.
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org
For additional commands, e-mail: dev-h...@tomcat.apache.org

[Bug 63235] Defer potentially expensive Charset.availableCharsets() call

Reply via email to