Something is missing from the body of your Email... As I pointed in my previous message, "in general" Solr can index _everything_ (provided that you have Tokenizer for that); but, additionally to _indexing_ you need an HTTP-based _search_ which must understand UTF-16 (for instance)
Easiest solution is to transfer files to UTF-8 before indexing and to use UTF-8 as a as default Java character encoding ( java -Dfile.encoding=UTF-8 ...; including even Tomcat HTTP settings). This is really the simplest... and fastest by performance... and you should be able to use Highlighter feature and etc... -Fuad Efendi http://www.tokenizer.ca -----Original Message----- From: vybe3142 [mailto:vybe3...@gmail.com] Sent: October-03-12 12:30 PM To: solr-user@lucene.apache.org Subject: Re: Can SOLR Index UTF-16 Text Thanks for all the responses. Problem partially solved (see below) 1. In a sense, my question is theoretical since the input to out SOLR server is (currently) UTF-8 files produced by a third party text extraction utility (not Tika). On the server side, we read and index the text via a custom data handler. Last week, I tried a UTF-16 file to see what would happen, and it wasn't handled correctly, as explained in my original question. 2. The file is UTF 16 3. We can either (a)stream the data to SOLR in the call or (b)use the stream.file parameter to provide the file path to the SOLR handler. Assuming case (a) Here's how the SOLRJ request is constructed (code edited for conciseness) If I replace the last line with things work !!!! What would I need to do in case (b), . wherer the raw file is loaded remotely i.e. my handler reads the file directly In this case, how can I control what the content type is ? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/Can-SOLR-Index-UTF-16-Text-tp4010834p4011 634.html Sent from the Solr - User mailing list archive at Nabble.com.