I'm trying to post a PDF along with a whole bunch of metadata fields to the ExtractingRequestHandler as multipart/form-data. It works fine except for the utf-8 character handling. Here is what my post looks like (abridged):
POST /solr/update/extract HTTP/1.1 TE: deflate,gzip;q=0.3 Connection: TE, close Host: localhost:8983 Content-Length: 21418 Content-Type: multipart/form-data; boundary=wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX Content-Disposition: form-data; name=literal.title smart >>‘<< quote --wyAjGU0yDXmvWK8IWqY50a67Z2lsu2yU1UpEiPDX Content-Disposition: form-data; name="myfile"; filename="text.pdf.1174588823" Content-Type: application/pdf Content-Transfer-Encoding: binary ...binary pdf data I've verified on the network that the quote character, a LEFT SINGLE QUOTATION MARK (U+2018) is going across the wire as the utf-8 bytes "e2 80 98" which is correct. However, when I search for the document in Solr, it's coming back as the byte sequence "c3 a2 c2 80 c2 98" which I'm guessing is it being double-utf8-encoded. The multipart/form-data is MIME, which is supposed to be 7-bit, so I've tried encoding any non-ascii fields as quoted-printable Content-Disposition: form-data; name=literal.title Content-Transfer-Encoding: quoted-printable smart >>=E2=80=98<< quote= as well as base64 Content-Disposition: form-data; name=literal.title Content-Transfer-Encoding: base64 c21hcnQgPj7igJg8PCBxdW90ZSBmb29iYXI= but what sold puts in its index is just that value, it's not decoding either the quoted-printable or the base64. I've tried encoding the utf-8 values as HTML entities, but then Solr doesn't unescape them either, and any accented characters are stored as the HTML entities, not as the unicode characters. Can anybody give me any pointers as to where I might be going wrong, where to look for solutions, or any different/better ways to handle this? Thanks! -- View this message in context: http://lucene.472066.n3.nabble.com/form-data-post-to-ExtractingRequestHandler-with-utf-8-characters-not-handled-tp3461731p3461731.html Sent from the Solr - User mailing list archive at Nabble.com.