mixing of utf8 and latin1 in URI->query_form\rr

Bill Moseley Sat, 03 Mar 2012 00:26:05 -0800

I use HTTP::Request::Common to build an application/x-www-form-urlencoded
POST from a passed-in hash.  The hash contains strings as values.


$req = POST /foo, \%parameters;


This uses URI->query_form to build the url-encoded body, and sets the
Content-Type to application/x-www-form-urlencoded (without any charset).

The problem I have is that the %parameters hash contains valid Perl
character strings, but the resulting url-encoded request differs depending
on the mix of Perl stings -- by mix I mean strings with and without Perl's
utf8 flag.  The resulting url-encoded request ends up either latin1 or
utf-8 url-encoded octets.

It's not easy to know what charset to add to the request, and likewise,
things break if the server handling the request assumes it's a utf8
url-encoded request.

Perhaps some code might help:


Consider these three very normal, valid, *character* strings in Perl:

my $ascii =     'Hello';
my $latin1 =    'Ue: ' . chr(220);
my $unicode =   "Happy \x{263A}";



What you would expect is that only the $unicode would have Perl' utf8 flag
set.  And indeed that is true:

print_var($_) for $ascii, $latin1, $unicode;

str [Hello]  with flag: NO
str [Ue: Ü]  with flag: NO
str [Happy ☺]  with flag: YES



And if the strings are concatenated Perl will utf8::upgrade $latin1, and
you can see that is true because the umlaut survived the trip from latin1
to utf8.

print_var( "Joined = '$ascii : $latin1 : $unicode'" );

str [Joined = 'Hello : Ue: Ü : Happy ☺']  with flag: YES



Those three strings are perfectly fine Perl character strings, and they
could be combined into a hash and fed to query_form(),


This code simply passes a hashref to $uri->query_form then prints
$uri->query;

print_query( { ascii => $ascii } );
print_query( { ascii => $ascii, latin1 => $latin1 } );
print_query( { ascii => $ascii, latin1 => $latin1, unicode => $unicode } );

URI query_form = str [ascii=Hello]  with flag: NO
URI query_form = str [ascii=Hello&*latin1=Ue%3A+%DC*]  with flag: NO
URI query_form = str [ascii=Hello&unicode=Happy+%E2%98%BA&*
latin1=Ue%3A+%C3%9C*]  with flag: NO



The thing to notice here is how the encoding for $latin1 changed just
because of the addition into the hash of the $unicode string.  Things thus
break when the server tries to decode the query parameters on the server
side if it assumes either latin1 or utf8 encoding.


The problem is I have code that accepts a hash and passes it directly to
POST.  But, if there happens to be a latin1 string in there then the
request changes depending if there's also a string with the utf8 flag set
or not.

Am I missing something here?  Seems like if query_form is passed a hash
then the resulting encoding should not change based on what else is in that
hash.


I can think of two solutions.  One would be to build the query string in a
different way by explicitly encoding to utf8 first:

    my %encoded_params = map { uri_escape( encode_utf8($_) ) }  %{$params};
    my $query = join '&', map { "$_=$encoded_params{$_}" } keys
%encoded_params;

Another way would be to explicitly utf8::upgrade every key and value in the
hash before query_form() does it's work.

Obviously, that would break anyone that is only using latin1 strings and
assuming latin1 url-encoded request body.


My ugly test script is here:  http://hank.org/utf8post.pl




-- 
Bill Moseley
[email protected]

mixing of utf8 and latin1 in URI->query_form\rr

Reply via email to