Hi, I am using Apache POI parser to parse a Word Doc and extract the text content. Then i am passing the text content to SOLR. The Word document has many pictures, graphs and tables. But when i am passing the content to SOLR, it fails. Here is the exception trace.
09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM org.apache.solr.common.SolrException log SEVERE: [com.ctc.wstx.exc.WstxLazyException] com.ctc.wstx.exc.WstxParsingException: Illegal charact er entity: expansion character (code 0x7) not a valid XML character at [row,col {unknown-source}]: [40,18] at com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45) at com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729) at com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659) at com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809) at org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327 ) at org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.ja va:195) at org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandle r.java:123) at org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131) at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204) at org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303) at org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain. java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206 ) at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96) at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain. java:235) at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206 ) at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235) at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191) at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.j ava:190) at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92) at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextE stablishmentValve.java:126) at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEs tablishmentValve.java:70) at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127) at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102) at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java :158) at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109) at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330) at org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828) at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.j ava:601) at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447) at java.lang.Thread.run(Thread.java:595). Another error trace relating to POI is also throwing up: 09:31:04,828 ERROR [STDERR] java.io.IOException: Unable to read entire header; 130 bytes read; expe cted 512 bytes 09:31:04,828 ERROR [STDERR] at org.apache.poi.poifs.storage.HeaderBlockReader.alertShortRead(He aderBlockReader.java:130) 09:31:04,843 ERROR [STDERR] at org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBloc kReader.java:94) 09:31:04,843 ERROR [STDERR] at org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFile System.java:151) 09:31:04,843 ERROR [STDERR] at org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocumen t.java:133) 09:31:04,843 ERROR [STDERR] at org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor .java:51) 09:31:04,859 ERROR [STDERR] at com.apple.servlet.SearchApplicationServlet.parseWordFile(SearchA pplicationServlet.java:963) 09:31:04,859 ERROR [STDERR] at com.apple.servlet.SearchApplicationServlet.indexDirectory(Search ApplicationServlet.java:813) 09:31:04,859 ERROR [STDERR] at com.apple.servlet.SearchApplicationServlet.index(SearchApplicati onServlet.java:747) 09:31:04,859 ERROR [STDERR] at com.apple.servlet.SearchApplicationServlet.processAdd(SearchAppl icationServlet.java:331) 09:31:04,874 ERROR [STDERR] at com.apple.servlet.SearchApplicationServlet.doGet(SearchApplicati onServlet.java:160) 09:31:04,874 ERROR [STDERR] at com.apple.servlet.SearchApplicationServlet.doPost(SearchApplicat ionServlet.java:306) 09:31:04,874 ERROR [STDERR] at javax.servlet.http.HttpServlet.service(HttpServlet.java:710) 09:31:04,874 ERROR [STDERR] at javax.servlet.http.HttpServlet.service(HttpServlet.java:803) 09:31:04,874 ERROR [STDERR] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter (ApplicationFilterChain.java:290) 09:31:04,890 ERROR [STDERR] at org.apache.catalina.core.ApplicationFilterChain.doFilter(Applica tionFilterChain.java:206) 09:31:04,890 ERROR [STDERR] at org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHea derFilter.java:96) 09:31:04,890 ERROR [STDERR] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter (ApplicationFilterChain.java:235) 09:31:04,890 ERROR [STDERR] at org.apache.catalina.core.ApplicationFilterChain.doFilter(Applica tionFilterChain.java:206) 09:31:04,906 ERROR [STDERR] at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWra pperValve.java:235) 09:31:04,906 ERROR [STDERR] at org.apache.catalina.core.StandardContextValve.invoke(StandardCon textValve.java:191) 09:31:04,906 ERROR [STDERR] at org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(Se curityAssociationValve.java:190) 09:31:04,906 ERROR [STDERR] at org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContex tValve.java:92) 09:31:04,906 ERROR [STDERR] at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve. process(SecurityContextEstablishmentValve.java:126) 09:31:04,921 ERROR [STDERR] at org.jboss.web.tomcat.security.SecurityContextEstablishmentValve. invoke(SecurityContextEstablishmentValve.java:70) 09:31:04,921 ERROR [STDERR] at org.apache.catalina.core.StandardHostValve.invoke(StandardHostVa lve.java:127) 09:31:04,921 ERROR [STDERR] at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportVa lve.java:102) 09:31:04,968 ERROR [STDERR] at org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(Ca chedConnectionValve.java:158) 09:31:04,968 ERROR [STDERR] at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngi neValve.java:109) 09:31:04,968 ERROR [STDERR] at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapte r.java:330) 09:31:04,968 ERROR [STDERR] at org.apache.coyote.http11.Http11Processor.process(Http11Processor .java:828) 09:31:04,968 ERROR [STDERR] at org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler. process(Http11Protocol.java:601) 09:31:04,984 ERROR [STDERR] at org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.ja va:447) 09:31:04,984 ERROR [STDERR] at java.lang.Thread.run(Thread.java:595). Below mentioned is the source code. private static String parseWordFile(File f) { String text = null; try { WordExtractor we = new WordExtractor(new FileInputStream(f)); text = we.getText(); } catch (Exception ex){ System.out.println("exception occured for ::"+f.getName()); ex.printStackTrace(); } return text; } where WordExtractor belongs to the package - org.apache.poi.hwpf.extractor Highly appreciate a quick help in resolving this. Regards Suryasnat Das