Hi all, I am writing this up as this is a change I'd like to make in Tomcat 10 that I think is important to get right. It may also get back-ported.
This first arose in this mod_jk bug: https://bz.apache.org/bugzilla/show_bug.cgi?id=62459 Ignoring the mod_jk aspects for now (they will come later) the bug report raises the important question of how to handle the case where the ID for a resource in a RESTful API includes a "/". At the moment, Tomcat does not handle this correctly. If ALLOW_ENCODED_SLASH is false, the request is rejected. If it is true, the wrong resource identifier will be used. This is an edge case, but one I'd like to fix. My research led me back to RFC 3986. Quoting from section 2.2: <quote> The purpose of reserved characters is to provide a set of delimiting characters that are distinguishable from other data within a URI. URIs that differ in the replacement of a reserved character with its corresponding percent-encoded octet are not equivalent. Percent- encoding a reserved character, or decoding a percent-encoded octet that corresponds to a reserved character, will change how the URI is interpreted by most applications. Thus, characters in the reserved set are protected from normalization and are therefore safe to be used by scheme-specific and producer-specific algorithms for delimiting data subcomponents within a URI. </quote> My reading of this is that there are some %nn sequences that we should *never* decode. The values we pass to applications for ServletPath, PathInfo etc. should still include these %nn sequences and the application should decode them. My next thought was "Which %nn sequences should be leave alone?". That got me thinking about URIEncoding values and how to differentiate between a %nn sequence we wanted to leave alone and the same sequence appearing where a code point is represented by multiple bytes. Fortunately, RFC7230 saves us from that complication as it requires all encodings to be supersets of US-ASCII. Or to put is another way, the only time %nn appears where nn is in the range 00 to 7F that %nn sequence will *always* be representing the equivalent US-ASCII code point. So, that simplifies things a little as we go back to considering which %nn sequences we have to leave alone. The starting point is "reserved" characters. From RFC 3986: reserved = gen-delims / sub-delims gen-delims = ":" / "/" / "?" / "#" / "[" / "]" / "@" sub-delims = "!" / "$" / "&" / "'" / "(" / ")" / "*" / "+" / "," / ";" / "=" We are talking about URIs in Tomcat which, at the point we %nn decode, is just the path. The path parameters and query string have been removed. >From RFC 7230: absolute-path = 1*( "/" segment ) and from RFC 3986: segment = *pchar pchar = unreserved / pct-encoded / sub-delims / ":" / "@" So the question is, which reserved characters cannot be safely decoded from their %nn form. We know all subdelims because: - they are valid characters in a segment and with the query string and path parameters removed, none of those characters have special meaning That leaves gen-delims Of those ":" and "@" are explicitly allowed in a segment. So that leaves: "/" "?" "#" "[" "]" "?" is the query delimiter but the query string has been removed so it is safe to %nn decode to "?". "#" is the fragment delimiter. The fragment will never reach the server so it is safe to %nn decode to "#". "[" and "]" are delimiters in the host but not the path so they are safe. That just leaves "/". My proposal is, therefore, actually very simple: 1. Remove the UDecoder.ALLOW_ENCODED_SLASH option. 2. Replace it with a per Connector setting that has three options: a) deny (equivalent to ALLOW_ENCODED_SLASH="false") b) decode (equivalent to ALLOW_ENCODED_SLASH="true") c) allow (leaves as is) Thoughts? Mark --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org