Hi,

You've likely got some non-character code points in your data and they need to 
be stripped.
http://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:Noncharacter_Code_Point=True:]

See the patch for NUTCH-1016 for an example on how to strip them. It's easily 
ported to other languages.
https://issues.apache.org/jira/browse/NUTCH-1016

Cheers,

 
 
-----Original message-----
> From:Sujatha Arun <suja.a...@gmail.com>
> Sent: Tue 22-Jan-2013 12:35
> To: solr-user@lucene.apache.org
> Subject: solr 3.6.1 Indexing and utf8 issue
> 
> Hi,
> 
> We are on solr 3.6.1 on  Tomcat 5.5.25 . The Indexing of polish content 
> throws the following error  . 
> 
> Caused by: com.ctc.wstx.exc.WstxIOException: Invalid UTF-8 middle byte 0x77 
> (at char #166, byte #127)
> at com.ctc.wstx.sr.StreamScanner.throwFromIOE(StreamScanner.java:708)
> at com.ctc.wstx.sr.BasicStreamReader.next(BasicStreamReader.java:1086)
> at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:309)
> at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:156)
> at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:79)
> ... 20 more
> Caused by: java.io.CharConversionException: Invalid UTF-8 middle byte 0x77
> 
> 
> 
> I have added a patch to enable utf-8 encoding in solrDispatchFilter.java file
> 
> The same content file in 1.3 with utf8 patch works fine .Please find attached 
> content file
> 
> Please let me know what could be missing?
> 
> Regards
> Sujatga
> 

Reply via email to