Hi,

I am using Apache POI parser to parse a Word Doc and extract the text
content. Then i am passing the text content to SOLR. The Word document has
many pictures, graphs and tables. But when i am passing the content to SOLR,
it fails. Here is the exception trace.

09:31:04,516 ERROR [STDERR] Mar 14, 2009 9:31:04 AM
org.apache.solr.common.SolrException log
SEVERE: [com.ctc.wstx.exc.WstxLazyException]
com.ctc.wstx.exc.WstxParsingException: Illegal charact
er entity: expansion character (code 0x7) not a valid XML character
 at [row,col {unknown-source}]: [40,18]
        at
com.ctc.wstx.exc.WstxLazyException.throwLazily(WstxLazyException.java:45)
        at
com.ctc.wstx.sr.StreamScanner.throwLazyError(StreamScanner.java:729)
        at
com.ctc.wstx.sr.BasicStreamReader.safeFinishToken(BasicStreamReader.java:3659)
        at
com.ctc.wstx.sr.BasicStreamReader.getText(BasicStreamReader.java:809)
        at
org.apache.solr.handler.XmlUpdateRequestHandler.readDoc(XmlUpdateRequestHandler.java:327
)
        at
org.apache.solr.handler.XmlUpdateRequestHandler.processUpdate(XmlUpdateRequestHandler.ja
va:195)
        at
org.apache.solr.handler.XmlUpdateRequestHandler.handleRequestBody(XmlUpdateRequestHandle
r.java:123)
        at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:131)
        at org.apache.solr.core.SolrCore.execute(SolrCore.java:1204)
        at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:303)
        at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:232)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.
java:235)
        at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206
)
        at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHeaderFilter.java:96)
        at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.
java:235)
        at
org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:206
)
        at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:235)
        at
org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:191)
        at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(SecurityAssociationValve.j
ava:190)
        at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContextValve.java:92)
        at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.process(SecurityContextE
stablishmentValve.java:126)
        at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.invoke(SecurityContextEs
tablishmentValve.java:70)
        at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:127)
        at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:102)
        at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(CachedConnectionValve.java
:158)
        at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:109)
        at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:330)
        at
org.apache.coyote.http11.Http11Processor.process(Http11Processor.java:828)
        at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.process(Http11Protocol.j
ava:601)
        at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.java:447)
        at java.lang.Thread.run(Thread.java:595).

Another error trace relating to POI is also throwing up:

09:31:04,828 ERROR [STDERR] java.io.IOException: Unable to read entire
header; 130 bytes read; expe
cted 512 bytes
09:31:04,828 ERROR [STDERR]     at
org.apache.poi.poifs.storage.HeaderBlockReader.alertShortRead(He
aderBlockReader.java:130)
09:31:04,843 ERROR [STDERR]     at
org.apache.poi.poifs.storage.HeaderBlockReader.<init>(HeaderBloc
kReader.java:94)
09:31:04,843 ERROR [STDERR]     at
org.apache.poi.poifs.filesystem.POIFSFileSystem.<init>(POIFSFile
System.java:151)
09:31:04,843 ERROR [STDERR]     at
org.apache.poi.hwpf.HWPFDocument.verifyAndBuildPOIFS(HWPFDocumen
t.java:133)
09:31:04,843 ERROR [STDERR]     at
org.apache.poi.hwpf.extractor.WordExtractor.<init>(WordExtractor
.java:51)
09:31:04,859 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.parseWordFile(SearchA
pplicationServlet.java:963)
09:31:04,859 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.indexDirectory(Search
ApplicationServlet.java:813)
09:31:04,859 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.index(SearchApplicati
onServlet.java:747)
09:31:04,859 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.processAdd(SearchAppl
icationServlet.java:331)
09:31:04,874 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.doGet(SearchApplicati
onServlet.java:160)
09:31:04,874 ERROR [STDERR]     at
com.apple.servlet.SearchApplicationServlet.doPost(SearchApplicat
ionServlet.java:306)
09:31:04,874 ERROR [STDERR]     at
javax.servlet.http.HttpServlet.service(HttpServlet.java:710)
09:31:04,874 ERROR [STDERR]     at
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
09:31:04,874 ERROR [STDERR]     at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter
(ApplicationFilterChain.java:290)
09:31:04,890 ERROR [STDERR]     at
org.apache.catalina.core.ApplicationFilterChain.doFilter(Applica
tionFilterChain.java:206)
09:31:04,890 ERROR [STDERR]     at
org.jboss.web.tomcat.filters.ReplyHeaderFilter.doFilter(ReplyHea
derFilter.java:96)
09:31:04,890 ERROR [STDERR]     at
org.apache.catalina.core.ApplicationFilterChain.internalDoFilter
(ApplicationFilterChain.java:235)
09:31:04,890 ERROR [STDERR]     at
org.apache.catalina.core.ApplicationFilterChain.doFilter(Applica
tionFilterChain.java:206)
09:31:04,906 ERROR [STDERR]     at
org.apache.catalina.core.StandardWrapperValve.invoke(StandardWra
pperValve.java:235)
09:31:04,906 ERROR [STDERR]     at
org.apache.catalina.core.StandardContextValve.invoke(StandardCon
textValve.java:191)
09:31:04,906 ERROR [STDERR]     at
org.jboss.web.tomcat.security.SecurityAssociationValve.invoke(Se
curityAssociationValve.java:190)
09:31:04,906 ERROR [STDERR]     at
org.jboss.web.tomcat.security.JaccContextValve.invoke(JaccContex
tValve.java:92)
09:31:04,906 ERROR [STDERR]     at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.
process(SecurityContextEstablishmentValve.java:126)
09:31:04,921 ERROR [STDERR]     at
org.jboss.web.tomcat.security.SecurityContextEstablishmentValve.
invoke(SecurityContextEstablishmentValve.java:70)
09:31:04,921 ERROR [STDERR]     at
org.apache.catalina.core.StandardHostValve.invoke(StandardHostVa
lve.java:127)
09:31:04,921 ERROR [STDERR]     at
org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportVa
lve.java:102)
09:31:04,968 ERROR [STDERR]     at
org.jboss.web.tomcat.service.jca.CachedConnectionValve.invoke(Ca
chedConnectionValve.java:158)
09:31:04,968 ERROR [STDERR]     at
org.apache.catalina.core.StandardEngineValve.invoke(StandardEngi
neValve.java:109)
09:31:04,968 ERROR [STDERR]     at
org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapte
r.java:330)
09:31:04,968 ERROR [STDERR]     at
org.apache.coyote.http11.Http11Processor.process(Http11Processor
.java:828)
09:31:04,968 ERROR [STDERR]     at
org.apache.coyote.http11.Http11Protocol$Http11ConnectionHandler.
process(Http11Protocol.java:601)
09:31:04,984 ERROR [STDERR]     at
org.apache.tomcat.util.net.JIoEndpoint$Worker.run(JIoEndpoint.ja
va:447)
09:31:04,984 ERROR [STDERR]     at java.lang.Thread.run(Thread.java:595).

Below mentioned is the source code.

private static String parseWordFile(File f) {
        String text = null;
        try {
            WordExtractor we = new WordExtractor(new FileInputStream(f));
            text = we.getText();
        } catch (Exception ex){
            System.out.println("exception occured for ::"+f.getName());
            ex.printStackTrace();
        }


        return text;
    }
where WordExtractor belongs to the package - org.apache.poi.hwpf.extractor

Highly appreciate a quick help in resolving this.

Regards
Suryasnat Das

Reply via email to