I use solr 4 trunk to index some sites with nutch 1-2-rc4. When i try to index 300k documents with solr4 i get error. But when i use solr 1.4.1 version there is no problem like that. I install solr4 to tomcat5,6 jetty7,8 there is no change.
I use apache-solr-core-1.4.0.jar apache-solr-solrj-1.4.0.jar for solr 1.4.1 becouse of javabin errors. here is problematic chars. "Sao Tom���nd Princip���STP" SEVERE: java.lang.RuntimeException: [was class java.io.CharConversionException] Invalid UTF-8 character 0xffff at char #681112, byte #700315) at com.ctc.wstx.util.ExceptionUtil.throwRuntimeException(ExceptionUtil.java:18) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:731) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3657) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XMLLoader.readDoc(XMLLoader.java:266) at org.apache.solr.handler.XMLLoader.processUpdate(XMLLoader.java:126) at org.apache.solr.handler.XMLLoader.load(XMLLoader.java:77) at org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:67) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:129) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1308) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:353) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:248) at org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1323) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:476) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:119) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:480) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:225) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:937) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:406) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:183) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:871) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:117) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:247) at org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:149) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:110) at org.eclipse.jetty.server.Server.handle(Server.java:346) at org.eclipse.jetty.server.HttpConnection.handleRequest(HttpConnection.java:589) at org.eclipse.jetty.server.HttpConnection$RequestHandler.content(HttpConnection.java:1065) at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:915) at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:220) at org.eclipse.jetty.server.HttpConnection.handle(HttpConnection.java:411) at org.eclipse.jetty.io.nio.SelectChannelEndPoint.handle(SelectChannelEndPoint.java:535) at org.eclipse.jetty.io.nio.SelectChannelEndPoint$1.run(SelectChannelEndPoint.java:40) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:529) at java.lang.Thread.run(Thread.java:662) Caused by: java.io.CharConversionException: Invalid UTF-8 character 0xffff at char #681112, byte #700315) at com.ctc.wstx.io.UTF8Reader.reportInvalid(UTF8Reader.java:335) at com.ctc.wstx.io.UTF8Reader.read(UTF8Reader.java:249) at com.ctc.wstx.io.MergedReader.read(MergedReader.java:101) at com.ctc.wstx.io.ReaderSource.readInto(ReaderSource.java:84) at com.ctc.wstx.io.BranchingReaderSource.readInto(BranchingReaderSource.java:57) at com.ctc.wstx.sr.StreamScanner.loadMore(StreamScanner.java:992) at com.ctc.wstx.sr.BasicStreamReader.readTextSecondary(BasicStreamReader.java:4628) at com.ctc.wstx.sr.BasicStreamReader.readCoalescedText(BasicStreamReader.java:4126) at com.ctc.wstx.sr.BasicStreamReader.finishToken(BasicStreamReader.java:3701) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3649) ... 32 more -- View this message in context: http://lucene.472066.n3.nabble.com/strange-utf-8-problem-tp3094473p3094473.html Sent from the Solr - User mailing list archive at Nabble.com.