Also there's a custom loader here that is the culprit: com.lsegroup.solr.handler.CwsExtractingDocumentLoader
On Nov 14, 2013, at 10:20, Erick Erickson <erickerick...@gmail.com> wrote: > It looks like bad data. The XML you're sending to Solr looks mal-formed, so > I > suspect this is completely outside of Solr's purview. > > Best, > Erick > > > On Thu, Nov 14, 2013 at 9:26 AM, Marcello Lorenzi <mlore...@sorint.it>wrote: > >> Hi, >> I have installed a Solr 4.3 instance and we have configured manifoldcf to >> pass web content to the shard collection, but during the crawling we have >> noticed a lot of this exception: >> >> ERROR - 2013-11-14 15:13:57.954; org.apache.solr.common.SolrException; >> org.apache.solr.common.SolrException: >> org.apache.tika.exception.TikaException: >> XML parse error >> at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load( >> CwsExtractingDocumentLoader.java:150) >> at org.apache.solr.handler.ContentStreamHandlerBase. >> handleRequestBody(ContentStreamHandlerBase.java:74) >> at org.apache.solr.handler.RequestHandlerBase.handleRequest( >> RequestHandlerBase.java:135) >> at org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper. >> handleRequest(RequestHandlers.java:242) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1816) >> at org.apache.solr.servlet.SolrDispatchFilter.execute( >> SolrDispatchFilter.java:656) >> at org.apache.solr.servlet.SolrDispatchFilter.doFilter( >> SolrDispatchFilter.java:359) >> at org.apache.solr.servlet.SolrDispatchFilter.doFilter( >> SolrDispatchFilter.java:155) >> at org.apache.catalina.core.ApplicationFilterChain. >> internalDoFilter(ApplicationFilterChain.java:241) >> at org.apache.catalina.core.ApplicationFilterChain.doFilter( >> ApplicationFilterChain.java:208) >> at org.apache.catalina.core.StandardWrapperValve.invoke( >> StandardWrapperValve.java:221) >> at org.apache.catalina.core.StandardContextValve.invoke( >> StandardContextValve.java:107) >> at org.apache.catalina.core.StandardHostValve.invoke( >> StandardHostValve.java:155) >> at org.apache.catalina.valves.ErrorReportValve.invoke( >> ErrorReportValve.java:76) >> at org.apache.catalina.valves.AccessLogValve.invoke( >> AccessLogValve.java:934) >> at org.apache.catalina.core.StandardEngineValve.invoke( >> StandardEngineValve.java:90) >> at org.apache.catalina.connector.CoyoteAdapter.service( >> CoyoteAdapter.java:515) >> at org.apache.coyote.http11.AbstractHttp11Processor.process( >> AbstractHttp11Processor.java:1012) >> at org.apache.coyote.AbstractProtocol$AbstractConnectionHandler. >> process(AbstractProtocol.java:642) >> at org.apache.coyote.http11.Http11NioProtocol$ >> Http11ConnectionHandler.process(Http11NioProtocol.java:223) >> at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor. >> doRun(NioEndpoint.java:1597) >> at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor. >> run(NioEndpoint.java:1555) >> at java.util.concurrent.ThreadPoolExecutor.runWorker( >> ThreadPoolExecutor.java:1145) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >> ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:724) >> Caused by: org.apache.tika.exception.TikaException: XML parse error >> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:78) >> at org.apache.tika.parser.CompositeParser.parse( >> CompositeParser.java:242) >> at org.apache.tika.parser.CompositeParser.parse( >> CompositeParser.java:242) >> at org.apache.tika.parser.AutoDetectParser.parse( >> AutoDetectParser.java:120) >> at com.lsegroup.solr.handler.CwsExtractingDocumentLoader.load( >> CwsExtractingDocumentLoader.java:147) >> ... 24 more >> Caused by: org.xml.sax.SAXParseException; lineNumber: 91; columnNumber: >> 105; The element type "img" must be terminated by the matching end-tag >> "</img>". >> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper. >> createSAXParseException(ErrorHandlerWrapper.java:198) >> at com.sun.org.apache.xerces.internal.util.ErrorHandlerWrapper. >> fatalError(ErrorHandlerWrapper.java:177) >> at com.sun.org.apache.xerces.internal.impl. >> XMLErrorReporter.reportError(XMLErrorReporter.java:441) >> at com.sun.org.apache.xerces.internal.impl. >> XMLErrorReporter.reportError(XMLErrorReporter.java:368) >> at com.sun.org.apache.xerces.internal.impl.XMLScanner. >> reportFatalError(XMLScanner.java:1388) >> at com.sun.org.apache.xerces.internal.impl. >> XMLDocumentFragmentScannerImpl.scanEndElement( >> XMLDocumentFragmentScannerImpl.java:1753) >> at com.sun.org.apache.xerces.internal.impl. >> XMLDocumentFragmentScannerImpl$FragmentContentDriver.next( >> XMLDocumentFragmentScannerImpl.java:2951) >> at com.sun.org.apache.xerces.internal.impl. >> XMLDocumentScannerImpl.next(XMLDocumentScannerImpl.java:606) >> at com.sun.org.apache.xerces.internal.impl. >> XMLNSDocumentScannerImpl.next(XMLNSDocumentScannerImpl.java:116) >> at com.sun.org.apache.xerces.internal.impl. >> XMLDocumentFragmentScannerImpl.scanDocument(XMLDocumentFragmentScannerImpl >> .java:511) >> at com.sun.org.apache.xerces.internal.parsers. >> XML11Configuration.parse(XML11Configuration.java:846) >> at com.sun.org.apache.xerces.internal.parsers. >> XML11Configuration.parse(XML11Configuration.java:775) >> at com.sun.org.apache.xerces.internal.parsers.XMLParser. >> parse(XMLParser.java:123) >> at com.sun.org.apache.xerces.internal.parsers. >> AbstractSAXParser.parse(AbstractSAXParser.java:1210) >> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl$ >> JAXPSAXParser.parse(SAXParserImpl.java:628) >> at com.sun.org.apache.xerces.internal.jaxp.SAXParserImpl. >> parse(SAXParserImpl.java:332) >> at javax.xml.parsers.SAXParser.parse(SAXParser.java:195) >> at org.apache.tika.parser.xml.XMLParser.parse(XMLParser.java:72) >> ... 28 more >> >> Could it be not configured correctly the SOLR collection? >> >> Thanks, >> Marcello >>