Re: Indexing PDF and MS Office files

Andrea Gazzarini Tue, 14 Apr 2015 08:27:59 -0700

Hi Vijay,

Please paste an extract of your schema, where the "content" field (thefield where the PDF text shoudl be) and its type are declared.

For the other issue, please paste the whole stacktrace because


org.apache.tika.parser.microsoft.OfficeParser*

says nothing. The complete stacktrace (or at least another three / fourlines) should contain some other detail.


Best,
Andrea

On 04/14/2015 04:57 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Hi,

I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt,
.pptx, .xlx, and .xlx) files into Solr. I am facing the following issues.
Request to please let me know what is going wrong with the indexing
process.

I am using solr 4.10.2 and using the default example server configuration
that comes with Solr distribution.

PDF Files - Indexing as such works fine, but when I query using *.* in the
Solr Query console, metadata information is displayed properly. However,
the PDF content field is empty. This is happening for all PDF files I have
tried. I have tried with some proprietary files, PDF eBooks etc. Whatever
be the PDF file, content is not being displayed.

MS Office files -  For some office files, everything works perfect and the
extracted content is visible in the query console. However, for others, I
see the below error message during the indexing process.

*Exception in thread "main"
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.microsoft.OfficeParser*


I am using SolrJ to index the documents and below is the code snippet
related to indexing. Please let me know where the issue is occurring.

                         static String solrServerURL = "
http://localhost:8983/solr";;
static SolrServer solrServer = new HttpSolrServer(solrServerURL);
                         static ContentStreamUpdateRequest indexingReq = new

     ContentStreamUpdateRequest("/update/extract");

                         indexingReq.addFile(file, fileType);
indexingReq.setParam("literal.id", literalId);
indexingReq.setParam("uprefix", "attr_");
indexingReq.setParam("fmap.content", "content");
indexingReq.setParam("literal.fileurl", fileURL);
indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true);
solrServer.request(indexingReq);

Thanks & Regards
Vijay

Re: Indexing PDF and MS Office files

Reply via email to