On Jan 20, 2014, at 1:57 PM, Mark Thomas <ma...@apache.org> wrote: > Signed PGP part > On 20/01/2014 21:38, Jeremy Boynes wrote: > > I started to look at using HttpParser for the Cookie header but > > there are some differences in the way it works compared to the > > existing parser in Cookies that I wanted to check direction before > > getting too far in. > > > > The area I’m concerned about is the need to copy the bytes in > > order to parse the header. The Cookies parser relies heavily on > > MessageBytes and avoids copying to a String as far as possible. > > HttpParser, however, operates on a StringReader which requires > > converting to a String before parsing. > > > > After digging into the usage of Cookies I think there are only two > > places that read them: 1) Request#getCookies(), which needs to > > copy to Strings anyway in order to create the Cookie instances it > > returns 2) CoyoteAdapter#parseSessionCookiesId(), which parses the > > header and compares names as MessageBytes, only allocating a String > > for the value if the session cookie is found > > > > It’s this second one that has me concerned about switching to > > HttpParser as this gets called for every request. If we switch > > then there is going to be allocation and copying of the header that > > we currently don’t do. > > I share your concern. Worst case, we'll need to do a specific > MessageBytes based parse just for the session cookie. Assuming that > the session cookie name and value will remain US-ASCII (see no reason > why this should not be the case), we could get away with this as long > as we are mindful that there might be some UTF-8 we need to skip over.
I started implementing this and a patch can be found here: http://people.apache.org/~jboynes/patches/0013-Start-of-a-more-lenient-RFC6265-cookie-parser.patch The best illustration of how it is parsing is probably the test cases. The main difference is that it does not apply RFC2109 rules unless $Version=1 is specified and because of that I did not end up using as much of HttpParser as I’d hoped. I believe the implementation handles correctly formatted headers according to the rules from Netscape, RFC2109 and RFC6265. It is lenient with respect to malformed input, breaking the header down into pairs separated by semi-colons (allowing for RFC2109 quoting/escaping where applicable) and attempting to accept each pair in isolation. If an individual pair is invalid (e.g. contains disallowed characters), it defaults to accepting that rather than dropping the cookie. The only cookies that actually get dropped are ones that do not specify a name or ones where Cookie’s constructor throws an IAE on the name. What this means is that an application will see every cookie Cookie will allow, making it symmetrical, but it will not see cookies whose names the Servlet spec configuration (i.e. STRICT_NAMING) would disallow. I kept the de-quoting behavior with one difference: for V0 cookies, RFC6265 quoting rules are applied which means any escapes are not undone. IOW for V0 «"a\bc"» will be returned as «a\bc» whereas for V1 it would be returned as «abc». I’m not convinced this is useful and it may be more predictable for users if we simply returned the actual value, quotes and all, by default with a config option to enable unquoting for V1 values only (perhaps the same option as would enable alternative G1a). Feedback welcome, ideally in the form of a test case showing what should be returned for a header text input. Cheers Jeremy
signature.asc
Description: Message signed with OpenPGP using GPGMail