Re: Indexing PDF and MS Office files

Andrea Gazzarini Tue, 14 Apr 2015 10:10:26 -0700

It seems something like https://issues.apache.org/jira/browse/TIKA-1251.

I see you're using Solr 4.10.2 which uses Tika 1.5 and that issue seemsto be fixed in Tika 1.6.


I agree with Erik: you should try with another version of Tika.

Best,
Andrea

On 04/14/2015 06:44 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Andrea,

Yes, I am using the stock schema.xml that comes with the example server of
Solr-4.10.2 Hence not sure why the PDF content is not getting extracted and
put into the content field in the index.

Please find the log information for the Parsing error below.


org.apache.solr.common.SolrException; org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
... 32 more
Caused by: java.lang.IllegalArgumentException: This paragraph is not the
first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
at
org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 35 more

ERROR - 2015-04-14 14:51:21.151; org.apache.solr.common.SolrException;
null:org.apache.solr.common.SolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:225)
at
org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at
org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at
org.apache.solr.core.RequestHandlers$LazyRequestHandlerWrapper.handleRequest(RequestHandlers.java:246)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1967)
at
org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:777)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:418)
at
org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:207)
at
org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1419)
at
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:455)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:137)
at
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:557)
at
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:231)
at
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1075)
at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:384)
at
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:193)
at
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1009)
at
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:135)
at
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:255)
at
org.eclipse.jetty.server.handler.HandlerCollection.handle(HandlerCollection.java:154)
at
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:116)
at org.eclipse.jetty.server.Server.handle(Server.java:368)
at
org.eclipse.jetty.server.AbstractHttpConnection.handleRequest(AbstractHttpConnection.java:489)
at
org.eclipse.jetty.server.BlockingHttpConnection.handleRequest(BlockingHttpConnection.java:53)
at
org.eclipse.jetty.server.AbstractHttpConnection.content(AbstractHttpConnection.java:953)
at
org.eclipse.jetty.server.AbstractHttpConnection$RequestHandler.content(AbstractHttpConnection.java:1014)
at org.eclipse.jetty.http.HttpParser.parseNext(HttpParser.java:953)
at org.eclipse.jetty.http.HttpParser.parseAvailable(HttpParser.java:240)
at
org.eclipse.jetty.server.BlockingHttpConnection.handle(BlockingHttpConnection.java:72)
at
org.eclipse.jetty.server.bio.SocketConnector$ConnectorEndPoint.run(SocketConnector.java:264)
at
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:608)
at
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:543)
at java.lang.Thread.run(Unknown Source)
Caused by: org.apache.tika.exception.TikaException: Unexpected
RuntimeException from org.apache.tika.parser.microsoft.OfficeParser@138b0c5
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:244)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
at org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:120)
at
org.apache.solr.handler.extraction.ExtractingDocumentLoader.load(ExtractingDocumentLoader.java:219)
... 32 more
Caused by: java.lang.IllegalArgumentException: This paragraph is not the
first one in the table
at org.apache.poi.hwpf.usermodel.Range.getTable(Range.java:932)
at
org.apache.tika.parser.microsoft.WordExtractor.handleParagraph(WordExtractor.java:188)
at
org.apache.tika.parser.microsoft.WordExtractor.handleHeaderFooter(WordExtractor.java:172)
at
org.apache.tika.parser.microsoft.WordExtractor.parse(WordExtractor.java:98)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:199)
at
org.apache.tika.parser.microsoft.OfficeParser.parse(OfficeParser.java:167)
at org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:242)
... 35 more


Thanks & Regards
Vijay


On 14 April 2015 at 17:06, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

Hi,

Here are the solr-config xml and the error log from Solr logs for your
reference. As mentioned earlier, I didnt make any changes to the
solr-config.xml as I am using the xml file out of the box one that came
with the default installation.

Please let me know your thoughts on why these issues are occuring.

Thanks & Regards
Vijay


*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW
*T:  +44 20 3475 7980*
*M: **+44 7481 298 360*
*W: *ww <http://www.whishworks.com/>w.whishworks.com
<http://www.whishworks.com/>

<https://www.linkedin.com/company/whishworks>
<http://www.whishworks.com/blog/>  <https://twitter.com/WHISHWORKS>
<https://www.facebook.com/whishworksit>

On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in
the Solr Query console, metadata information is displayed properly.
However, the PDF content field is empty. This is happening for all PDF
files I have tried. I have tried with some proprietary files, PDF eBooks
etc. Whatever be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and
the extracted content is visible in the query console. However, for others,
I see the below error message during the indexing process.

*Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

                         static String solrServerURL = "
http://localhost:8983/solr";;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
                         static ContentStreamUpdateRequest indexingReq =
new
         ContentStreamUpdateRequest("/update/extract");

                         indexingReq.addFile(file, fileType);
indexingReq.setParam("literal.id", literalId);
indexingReq.setParam("uprefix", "attr_");
indexingReq.setParam("fmap.content", "content");
indexingReq.setParam("literal.fileurl", fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks & Regards
Vijay

Re: Indexing PDF and MS Office files

Reply via email to