Hello, I am crawling with apache nutche some sites and index it with solr. It has been working fine until a few days ago. The crawled data can have 200K or more documents inside. When I send it to SOLR to index with
bin/nutch solrindex http://xxxx.com:8080/solr crawl/crawldb -linkdb crawl/linkdb crawl/segments/* nutch is getting "SORL server internal error". SOLR 4.1 logs are getting this error below: It is very tough to find which document are causing this issue. What I need is either to configure SOLR so that it will ignore documents that has bad data inside and continue to index next documents coming from nutch. Or even though I am new to SOLR, maybe, I can write update pre/post processor plugin to SORL update job to ignore XML errors. Do we have solution for this problem? I appreciate your help class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #478803, byte #606190).%trace?..java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #478803, byte #606190) .at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) .at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) .at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) .at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) .at org.apache.solr.handler.loader.XMLLoader.readDoc(XMLLoader.java:393) .at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:245) .at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:173) .at or g.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) .at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) .at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) .at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) .at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:448) .at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:269) .at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) .at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) .at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:222) .at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:123) .at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:171) .at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorRepo rtValve.java:99) .at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:931) .at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) .at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) .at org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:1004) .at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:589) .at org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:312) .at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1110) .at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603) .at java.lang.Thread.run(Thread.java:722) Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff at char #478803, byte #606190) .at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) .at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) .at com.ctc.wstx.io.MergedReader.read(Merge dReader.java:101) .at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) .at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) .at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) .at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) .at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) .at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) .at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) .... 25 more -- View this message in context: http://lucene.472066.n3.nabble.com/java-io-CharConversionException-Invalid-UTF-8-character-0xffff-at-char-478803-byte-606190-tp4055323.html Sent from the Solr - User mailing list archive at Nabble.com.