I use HTTP::Request::Common to build an application/x-www-form-urlencoded
POST from a passed-in hash. The hash contains strings as values.
$req = POST /foo, \%parameters;
This uses URI->query_form to build the url-encoded body, and sets the
Content-Type to application/x-www-form-urlencoded (without any charset).
The problem I have is that the %parameters hash contains valid Perl
character strings, but the resulting url-encoded request differs depending
on the mix of Perl stings -- by mix I mean strings with and without Perl's
utf8 flag. The resulting url-encoded request ends up either latin1 or
utf-8 url-encoded octets.
It's not easy to know what charset to add to the request, and likewise,
things break if the server handling the request assumes it's a utf8
url-encoded request.
Perhaps some code might help:
Consider these three very normal, valid, *character* strings in Perl:
my $ascii = 'Hello';
my $latin1 = 'Ue: ' . chr(220);
my $unicode = "Happy \x{263A}";
What you would expect is that only the $unicode would have Perl' utf8 flag
set. And indeed that is true:
print_var($_) for $ascii, $latin1, $unicode;
str [Hello] with flag: NO
str [Ue: Ü] with flag: NO
str [Happy ☺] with flag: YES
And if the strings are concatenated Perl will utf8::upgrade $latin1, and
you can see that is true because the umlaut survived the trip from latin1
to utf8.
print_var( "Joined = '$ascii : $latin1 : $unicode'" );
str [Joined = 'Hello : Ue: Ü : Happy ☺'] with flag: YES
Those three strings are perfectly fine Perl character strings, and they
could be combined into a hash and fed to query_form(),
This code simply passes a hashref to $uri->query_form then prints
$uri->query;
print_query( { ascii => $ascii } );
print_query( { ascii => $ascii, latin1 => $latin1 } );
print_query( { ascii => $ascii, latin1 => $latin1, unicode => $unicode } );
URI query_form = str [ascii=Hello] with flag: NO
URI query_form = str [ascii=Hello&*latin1=Ue%3A+%DC*] with flag: NO
URI query_form = str [ascii=Hello&unicode=Happy+%E2%98%BA&*
latin1=Ue%3A+%C3%9C*] with flag: NO
The thing to notice here is how the encoding for $latin1 changed just
because of the addition into the hash of the $unicode string. Things thus
break when the server tries to decode the query parameters on the server
side if it assumes either latin1 or utf8 encoding.
The problem is I have code that accepts a hash and passes it directly to
POST. But, if there happens to be a latin1 string in there then the
request changes depending if there's also a string with the utf8 flag set
or not.
Am I missing something here? Seems like if query_form is passed a hash
then the resulting encoding should not change based on what else is in that
hash.
I can think of two solutions. One would be to build the query string in a
different way by explicitly encoding to utf8 first:
my %encoded_params = map { uri_escape( encode_utf8($_) ) } %{$params};
my $query = join '&', map { "$_=$encoded_params{$_}" } keys
%encoded_params;
Another way would be to explicitly utf8::upgrade every key and value in the
hash before query_form() does it's work.
Obviously, that would break anyone that is only using latin1 strings and
assuming latin1 url-encoded request body.
My ugly test script is here: http://hank.org/utf8post.pl
--
Bill Moseley
[email protected]