Hi, You've likely got some non-character code points in your data and they need to be stripped. http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]
See the patch for NUTCH-1016 for an example on how to strip them. It's easily ported to other languages. https://issues.apache.org/jira/browse/NUTCH-1016 Cheers, -----Original message----- > From:Sujatha Arun <suja.a...@gmail.com> > Sent: Tue 22-Jan-2013 12:35 > To: solr-user@lucene.apache.org > Subject: solr 3.6.1 Indexing and utf8 issue > > Hi, > > We are on solr 3.6.1 on Tomcat 5.5.25 . The Indexing of polish content > throws the following error . > > Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0x77 > (at char #166, byte #127) > at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708) > at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086) > at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:309) > at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156) > at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79) > ... 20 more > Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x77 > > > > I have added a patch to enable utf-8 encoding in solrDispatchFilter.java file > > The same content file in 1.3 with utf8 patch works fine .Please find attached > content file > > Please let me know what could be missing? > > Regards > Sujatga >