If it is a simple text file, does that text file start with the UTF-16 "BOM" marker? http://unicode.org/faq/utf_bom.html
Also, do UTF-8 files work? If not, then your setup has a basic encoding problem. And, when you post such a text file (for example, with curl), use the UTF-16 charset mime-type: I think it is "text/plain; charset=utf-16". ----- Original Message ----- | From: "Chris Hostetter" <hossman_luc...@fucit.org> | To: solr-user@lucene.apache.org | Sent: Friday, September 28, 2012 5:17:15 PM | Subject: Re: Can SOLR Index UTF-16 Text | | | : Our SOLR setup (4.0.BETA on Tomcat 6) works as expected when | indexing UTF-8 | : files. Recently, however, we noticed that it has issues with | indexing | : certain text files eg. UTF-16 files. See attachment for an example | : (tarred+zipped) | : | : tesla-utf16.txt | : <http://lucene.472066.n3.nabble.com/file/n4010834/tesla-utf16.txt> | | No attachment came through to the list, and the URL nabble seems to | have | provided when you posted your message leads to a 404. | | IN general, the question of "is indexing a UTF-16 file supported" | largely | depneds on *how* you are indexing this file -- if it's plain text, | are you | parsing it yourself using some client code, and then sending it to | solr, | are you using DIH to read it from disk? are you using | ExtractingRequestHandler? | | those are all very differnet ways to index data in Solr -- and | depending | on what you are doing determins how/where the encoding of that file | is | processed. | | | -Hoss |