On Dec 17, 2007 1:33 PM, Ryan McKinley <[EMAIL PROTECTED]> wrote: > >> It looks like SolrJ uses percent encoded UTF8 in the POST body for > >> parameters, just as it does in the URL. > >> Does anyone know if this double-encoding (percent encoding of UTF-8 > >> bytes) is a standard for application/x-www-form-urlencoded? > > > > I don't believe it is. > > > > It is the way it is because it worked and then I moved on ;) char-set > stuff has always felt a bit like voodoo to me. > > I think we should do whatever is most standard and likely to work on > most servers with limited fuss.
It looks to me like HttpClient is doing the encoding... I quickly tried to change it via setting the header to include the charset, but the body turns out the same: $ nc -l -p 8983 POST /solr/select HTTP/1.1 Content-type: application/x-www-form-urlencoded; charset=UTF-8 User-Agent: Solr[org.apache.solr.client.solrj.impl.CommonsHttpSolrServer] 1.0 Host: localhost:8983 Content-Length: 42 q=features%3Ah%C3%A9llo&wt=xml&version=2.2 This charset declaration makes me feel uncomfortable though, as the body is *not* straight UTF8, but uses the double-coded URI standard from http://www.ietf.org/rfc/rfc2396.txt Unfortunately, I haven't been able to find any standard that relates to the application/x-www-form-urlencoded mime type, and unicode. In the absense of any special way to do it, it seems like it should just obey the normal charset rules for post bodies... of course we need to work with what people have actually implemented. -Yonik