Re: Indexing PDF and MS Office files

Jack Krupansky Tue, 14 Apr 2015 20:19:12 -0700

Try doing a manual extraction request directly to Solr (not via SolrJ) and
use the extractOnly option to see if the content is actually extracted.


See:
https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

Also, some PDF files actually have the content as a bitmap image, so no
text is extracted.


-- Jack Krupansky

On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy <
vijaya.bhoomire...@whishworks.com> wrote:

> Hi,
>
> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
> Request to please let me know what is going wrong with the indexing
> process.
>
> I am using solr 4.10.2 and using the default example server configuration
> that comes with Solr distribution.
>
> PDF Files - Indexing as such works fine, but when I query using *.* in the
> Solr Query console, metadata information is displayed properly. However,
> the PDF content field is empty. This is happening for all PDF files I have
> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
> be the PDF file, content is not being displayed.
>
> MS Office files -  For some office files, everything works perfect and the
> extracted content is visible in the query console. However, for others, I
> see the below error message during the indexing process.
>
> *Exception in thread "main"
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.microsoft.OfficeParser*
>
>
> I am using SolrJ to index the documents and below is the code snippet
> related to indexing. Please let me know where the issue is occurring.
>
>                         static String solrServerURL = "
> http://localhost:8983/solr";;
> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
>                         static ContentStreamUpdateRequest indexingReq = new
>
>     ContentStreamUpdateRequest("/update/extract");
>
>                         indexingReq.addFile(file, fileType);
> indexingReq.setParam("literal.id", literalId);
> indexingReq.setParam("uprefix", "attr_");
> indexingReq.setParam("fmap.content", "content");
> indexingReq.setParam("literal.fileurl", fileURL);
> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
> solrServer.request(indexingReq);
>
> Thanks & Regards
> Vijay
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>

Re: Indexing PDF and MS Office files

Reply via email to