On Jan 20, 2014, at 1:57 PM, Mark Thomas <ma...@apache.org> wrote:

> Signed PGP part
> On 20/01/2014 21:38, Jeremy Boynes wrote:
> > I started to look at using HttpParser for the Cookie header but
> > there are some differences in the way it works compared to the
> > existing parser in Cookies that I wanted to check direction before
> > getting too far in.
> >
> > The area I’m concerned about is the need to copy the bytes in
> > order to parse the header. The Cookies parser relies heavily on
> > MessageBytes and avoids copying to a String as far as possible.
> > HttpParser, however, operates on a StringReader which requires
> > converting to a String before parsing.
> >
> > After digging into the usage of Cookies I think there are only two
> > places that read them: 1) Request#getCookies(), which needs to
> > copy to Strings anyway in order to create the Cookie instances it
> > returns 2) CoyoteAdapter#parseSessionCookiesId(), which parses the
> > header and compares names as MessageBytes, only allocating a String
> > for the value if the session cookie is found
> >
> > It’s this second one that has me concerned about switching to
> > HttpParser as this gets called for every request. If we switch
> > then there is going to be allocation and copying of the header that
> > we currently don’t do.
> 
> I share your concern. Worst case, we'll need to do a specific
> MessageBytes based parse just for the session cookie. Assuming that
> the session cookie name and value will remain US-ASCII (see no reason
> why this should not be the case), we could get away with this as long
> as we are mindful that there might be some UTF-8 we need to skip over.

I started implementing this and a patch can be found here:
  
http://people.apache.org/~jboynes/patches/0013-Start-of-a-more-lenient-RFC6265-cookie-parser.patch

The best illustration of how it is parsing is probably the test cases. The main 
difference is that it does not apply RFC2109 rules unless $Version=1 is 
specified and because of that I did not end up using as much of HttpParser as 
I’d hoped.

I believe the implementation handles correctly formatted headers according to 
the rules from Netscape, RFC2109 and RFC6265. 

It is lenient with respect to malformed input, breaking the header down into 
pairs separated by semi-colons (allowing for RFC2109 quoting/escaping where 
applicable) and attempting to accept each pair in isolation. If an individual 
pair is invalid (e.g. contains disallowed characters), it defaults to accepting 
that rather than dropping the cookie. The only cookies that actually get 
dropped are ones that do not specify a name or ones where Cookie’s constructor 
throws an IAE on the name. What this means is that an application will see 
every cookie Cookie will allow, making it symmetrical, but it will not see 
cookies whose names the Servlet spec configuration (i.e. STRICT_NAMING) would 
disallow.

I kept the de-quoting behavior with one difference: for V0 cookies, RFC6265 
quoting rules are applied which means any escapes are not undone. IOW for V0 
«"a\bc"» will be returned as «a\bc» whereas for V1 it would be returned as 
«abc». I’m not convinced this is useful and it may be more predictable for 
users if we simply returned the actual value, quotes and all, by default with a 
config option to enable unquoting for V1 values only (perhaps the same option 
as would enable alternative G1a).

Feedback welcome, ideally in the form of a test case showing what should be 
returned for a header text input.

Cheers
Jeremy

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to