I suggest avoid illegal UTF-8 characters by pre-filtering your
contentstream before loading.

Unicode   UTF-8(hex)
U+07FF    df bf
U+0800    e0 a0 80

So there is no UTF-8 0xffff. It is illegal.

Regards


Am 27.06.2011 12:40, schrieb Markus Jelsma:
Hi,

I came across the indexing error below. It happened in a huge batch update
from Nutch with SolrJ 3.1. Since the crawl was huge it is very hard to trace
the error back to a specific document. So i try my luck here: anyone seen this
before with SolrJ 3.1? Anything else on the Nutch part i should have taken
care off?

Thanks!


Jun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 
QTime=423
Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] 
Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
         at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
         at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
         at 
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
         at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287)
         at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146)
         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
         at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
         at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
         at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
         at orJun 27, 2011 10:24:28 AM org.apache.solr.core.SolrCore execute
INFO: [] webapp=/solr path=/update params={wt=javabin&version=2} status=500 
QTime=423
Jun 27, 2011 10:24:28 AM org.apache.solr.common.SolrException log
SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] 
Invalid UTF-8 character 0xffff at char #1142033, byte #1155068)
         at 
com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18)
         at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731)
         at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657)
         at 
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
         at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:287)
         at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:146)
         at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77)
         at 
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67)
         at 
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129)
         at org.apache.solr.core.SolrCore.execute(SolrCore.java:1368)
         at 
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:356)
         at 
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
         at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
         at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
         at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
         at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
         at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
         at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
         at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
         at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
         at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
         at org.mortbay.jetty.Server.handle(Server.java:326)
         at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
         at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
         at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
         at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff at 
char #1142033, byte #1155068)
         at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
         at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
         at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
         at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
         at 
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
         at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
         at 
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
         at 
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
         at 
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
         at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
         ... 26 
moreg.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:252)
         at 
org.mortbay.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1212)
         at 
org.mortbay.jetty.servlet.ServletHandler.handle(ServletHandler.java:399)
         at 
org.mortbay.jetty.security.SecurityHandler.handle(SecurityHandler.java:216)
         at 
org.mortbay.jetty.servlet.SessionHandler.handle(SessionHandler.java:182)
         at 
org.mortbay.jetty.handler.ContextHandler.handle(ContextHandler.java:766)
         at 
org.mortbay.jetty.webapp.WebAppContext.handle(WebAppContext.java:450)
         at 
org.mortbay.jetty.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:230)
         at 
org.mortbay.jetty.handler.HandlerCollection.handle(HandlerCollection.java:114)
         at 
org.mortbay.jetty.handler.HandlerWrapper.handle(HandlerWrapper.java:152)
         at org.mortbay.jetty.Server.handle(Server.java:326)
         at 
org.mortbay.jetty.HttpConnection.handleRequest(HttpConnection.java:542)
         at 
org.mortbay.jetty.HttpConnection$RequestHandler.content(HttpConnection.java:945)
         at org.mortbay.jetty.HttpParser.parseNext(HttpParser.java:843)
         at org.mortbay.jetty.HttpParser.parseAvailable(HttpParser.java:218)
         at org.mortbay.jetty.HttpConnection.handle(HttpConnection.java:404)
         at 
org.mortbay.jetty.bio.SocketConnector$Connection.run(SocketConnector.java:228)
         at 
org.mortbay.thread.QueuedThreadPool$PoolThread.run(QueuedThreadPool.java:582)
Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff at 
char #1142033, byte #1155068)
         at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335)
         at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249)
         at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101)
         at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84)
         at 
com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57)
         at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992)
         at 
com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628)
         at 
com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126)
         at 
com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701)
         at 
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649)
         ... 26 more


--
*************************************************************
Bernd Fehling                Universitätsbibliothek Bielefeld
Dipl.-Inform. (FH)                        Universitätsstr. 25
Tel. +49 521 106-4060                   Fax. +49 521 106-4052
bernd.fehl...@uni-bielefeld.de                33615 Bielefeld

BASE - Bielefeld Academic Search Engine - www.base-search.net
*************************************************************

Reply via email to