2013/3/2 Mark Thomas <ma...@apache.org>: > Due to bug 54602 [1] I have been writing some test cases to examine how we > handle invalid bytes sequences in URIs. > > My expectation was: > - valid byte sequence for expected encoding -> 200 (assuming no other > problems) > - invalid byte sequence for expected encoding -> 400 > - partial byte sequence for expected encoding -> 400 > > However, that isn't what happens and I currently believe that this should > happen. The purpose of this e-mail is, therefore, to get agreement on what > should happen. > > There are multiple moving parts here so forgive me if this e-mail gets a > little long. There are multiple decisions and I expect some to be less > contentious than others. > > These issues were observed with UTF-8. Other encodings may have similar > issues. May aim is to get a consistent approach regardless of encoding. > > Issue 1: URI ends with partial byte sequence > Currently the partial byte sequences are ignored. I think the B2CConverter > should throw an Exception if the full input (i.e. when endOfInput == true) > ends in with a partial byte sequence > > Issue 2: URI ends with invalid byte sequence > This appears to be a bug in the UTF-8 decoder provided by the JVM. [1] has > provided one set of input bytes that triggers this. Currently the invalid > data is ignored. I think that B2CConverter should throw an Exception as soon > as it can determine that input is invalid. This would require: > - switching to the Harmony based UTF-8 decoder used by WebSocket > - further testing of the JRE and Harmony UTF-8 decoders to check for other > potential issues > > Issue 3: Fall back to 'ASCII' > If the conversion fails (i.e. throws an exception for any reason) [2], the > CoyoteAdapter attempts to decode the provided URI using 'ASCII' rather than > the configured connector encoding. I say 'ASCII" because the comments say > ASCII but it is actually ISO-8859-1. > I don't believe it appropriate to fall back to anything here. The fall back > code has been present since conversion support was added but I can't think > of any scenario where this stands any chance of working reliably. I would > like to remove this fall back code. > > I would like to make these changes in trunk and 7.0.x. > > I expect to have a similar discussion about request bodies once URIs are > resolved where I have essentially the same view - a decoding error should > lead to a request failure. > > Thoughts? > > Mark > > > [1] https://issues.apache.org/bugzilla/show_bug.cgi?id=54602 > [2] > http://svn.apache.org/viewvc/tomcat/trunk/java/org/apache/catalina/connector/CoyoteAdapter.java?view=annotate > (line ~1054)
1. I have not tested, but as proposed rejection happens in CoyoteAdapter, I think it will happen too early for ErrorReportValve to work. As such, a user will receive a response that consists of HTTP status line only, which browsers display as a blank page. I suspect that such errors can be triggered easily by a human user, e.g. by mistyping an URL. I would not like to respond to those with a blank page. 2. In many cases it would make sense to know some correct part of the URL to choose a web application, and as such to handle the error in webapp-specific manner. 3. I remember discussions on the mailing lists regarding whether it is possible to make uri-encoding to be webapp-specific setting. Looking in Bugzilla, https://issues.apache.org/bugzilla/show_bug.cgi?id=50504 https://issues.apache.org/bugzilla/show_bug.cgi?id=48899 At least it means that there are people that are interested in more smart handling of URLs that use wrong encoding. 4. As such I think there are two ways to handle wrong and incomplete characters: a) replace them with substitute character b) throw exception and let fallback to ISO-8859-1 for the entire URL. Going with a) will provide slightly better handling if webapp name is non-ASCII, as it can be selected more reliably. Going with b) should be easier to implement. If an application has an ASCII name, either a) or b) works. If the request is going to result in some error page, I think there is no much difference between a) and b). If someone is going to process such broken URLs more smartly, I think questions are (1) whether one is able to detect broken URLs (2) whether one is able to recover original URL as submitted by client Using ISO-8859-1 might make recovery easier, but I think regardless of that - one can use request.getRequestURI(), as that one is not affected by encodings. - in a Valve one can access the bytes in the byte chunk and do custom decoding? Best regards, Konstantin Kolinko --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@tomcat.apache.org For additional commands, e-mail: dev-h...@tomcat.apache.org