RE: Can SOLR Index UTF-16 Text

Fuad Efendi Wed, 03 Oct 2012 10:46:51 -0700

Hi, my previous message was partially wrong:


Please note that ANY IMAGINABLE SOLUTION will use encoding/decoding; and the
real question is "where should it happen?"
        A. (Solr) Java Container is responsible for UTF-16 <-> Java String
        B. "Client" will do UTF-8 <->UTF-16 before submitting data to (Solr)
Java Container

And the correct answer is A. Because Java internally stores everything in
UTF-16. So that overhead of (Document)UTF16<->(Java)UTF16 is absolutely
minimal (and performance is the best possible; although file sizes could be
higher...)

You need to start SOLR (Tomcat Java) with the parameter 

        java -Dfile.encoding=UTF-16

http://docs.oracle.com/javase/7/docs/api/java/nio/charset/Charset.html


And, possibly, configure HTTP Connector of Tomcat to UTF-16
        <Connector port="8080" URIEncoding="UTF-16"/>

(and use proper "encoding" HTTP Request Headers when you POST your file to
Solr)



-Fuad Efendi
http://www.tokenizer.ca




-----Original Message-----
From: Fuad Efendi [mailto:f...@efendi.ca] 
Sent: October-03-12 1:30 PM
To: solr-user@lucene.apache.org
Subject: RE: Can SOLR Index UTF-16 Text

Something is missing from the body of your Email... As I pointed in my
previous message, "in general" Solr can index _everything_ (provided that
you have Tokenizer for that); but, additionally to _indexing_ you need an
HTTP-based _search_ which must understand UTF-16 (for instance)

Easiest solution is to transfer files to UTF-8 before indexing and to use
UTF-8 as a as default Java character encoding ( java -Dfile.encoding=UTF-8
...; including even Tomcat HTTP settings). This is really the simplest...
and fastest by performance... and you should be able to use Highlighter
feature and etc...


-Fuad Efendi
http://www.tokenizer.ca





-----Original Message-----
From: vybe3142 [mailto:vybe3...@gmail.com]
Sent: October-03-12 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: Can SOLR Index UTF-16 Text

Thanks for all the responses. Problem partially solved (see below)

1. In a sense, my question is theoretical since the input to out SOLR server
is (currently) UTF-8 files produced by a third party text extraction utility
(not Tika). On the server side, we read and index the text via a custom data
handler. Last week, I tried a UTF-16 file to see what would happen, and it
wasn't handled correctly, as explained in my original question.

2. The file is UTF 16


3. We can either (a)stream the data to SOLR in the call or (b)use the
stream.file parameter to provide the file path to the SOLR handler.

Assuming case (a)

Here's how the SOLRJ request is constructed (code edited for conciseness)



If I replace the last line with

things work !!!!

What would I need to do in case (b), . wherer the raw file is loaded
remotely  i.e. my handler reads the file directly



In this case, how can I control what the content type is ?

Thanks




--
View this message in context:
http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011
634.html
Sent from the Solr - User mailing list archive at Nabble.com.

RE: Can SOLR Index UTF-16 Text

Reply via email to