On Dec 17, 2007 1:33 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote:
> >> It looks like SolrJ uses percent encoded UTF8 in the POST body for
> >> parameters, just as it does in the URL.
> >> Does anyone know if this double-encoding (percent encoding of UTF-8
> >> bytes) is a standard for application/x-www-form-urlencoded?
> >
> > I don't believe it is.
> >
>
> It is the way it is because it worked and then I moved on ;)  char-set
> stuff has always felt a bit like voodoo to me.
>
> I think we should do whatever is most standard and likely to work on
> most servers with limited fuss.

It looks to me like HttpClient is doing the encoding... I quickly
tried to change it via setting the header to include the charset, but
the body turns out the same:

$ nc -l -p 8983
POST /solr/select HTTP/1.1
Content-type: application/x-www-form-urlencoded; charset=UTF-8
User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0
Host: localhost:8983
Content-Length: 42

q=features%3Ah%C3%A9llo&wt=xml&version=2.2

This charset declaration makes me feel uncomfortable though, as the
body is *not* straight UTF8, but uses the double-coded URI standard
from http://www.ietf.org/rfc/rfc2396.txt

Unfortunately, I haven't been able to find any standard that relates
to the application/x-www-form-urlencoded mime type, and unicode.  In
the absense of any special way to do it, it seems like it should just
obey the normal charset rules for post bodies... of course we need to
work with what people have actually implemented.

-Yonik

Reply via email to