Vijay, You could try different excel files with different formats to rule out the issue is with TIKA version being used.
Thanks Murthy On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <trhodes...@gmail.com> wrote: > Perhaps the PDF is protected and the content can not be extracted? > > i have an unverified suspicion that the tika shipped with solr 4.10.2 may > not support some/all office 2013 document formats. > > > > > > On 4/14/2015 8:18 PM, Jack Krupansky wrote: > >> Try doing a manual extraction request directly to Solr (not via SolrJ) and >> use the extractOnly option to see if the content is actually extracted. >> >> See: >> https://cwiki.apache.org/confluence/display/solr/ >> Uploading+Data+with+Solr+Cell+using+Apache+Tika >> >> Also, some PDF files actually have the content as a bitmap image, so no >> text is extracted. >> >> >> -- Jack Krupansky >> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy < >> vijaya.bhoomire...@whishworks.com> wrote: >> >> Hi, >>> >>> I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. >>> Request to please let me know what is going wrong with the indexing >>> process. >>> >>> I am using solr 4.10.2 and using the default example server configuration >>> that comes with Solr distribution. >>> >>> PDF Files - Indexing as such works fine, but when I query using *.* in >>> the >>> Solr Query console, metadata information is displayed properly. However, >>> the PDF content field is empty. This is happening for all PDF files I >>> have >>> tried. I have tried with some proprietary files, PDF eBooks etc. Whatever >>> be the PDF file, content is not being displayed. >>> >>> MS Office files - For some office files, everything works perfect and >>> the >>> extracted content is visible in the query console. However, for others, I >>> see the below error message during the indexing process. >>> >>> *Exception in thread "main" >>> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >>> org.apache.tika.exception.TikaException: Unexpected RuntimeException >>> from >>> org.apache.tika.parser.microsoft.OfficeParser* >>> >>> >>> I am using SolrJ to index the documents and below is the code snippet >>> related to indexing. Please let me know where the issue is occurring. >>> >>> static String solrServerURL = " >>> http://localhost:8983/solr"; >>> static SolrServer solrServer = new HttpSolrServer(solrServerURL); >>> static ContentStreamUpdateRequest indexingReq = >>> new >>> >>> ContentStreamUpdateRequest("/update/extract"); >>> >>> indexingReq.addFile(file, fileType); >>> indexingReq.setParam("literal.id", literalId); >>> indexingReq.setParam("uprefix", "attr_"); >>> indexingReq.setParam("fmap.content", "content"); >>> indexingReq.setParam("literal.fileurl", fileURL); >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); >>> solrServer.request(indexingReq); >>> >>> Thanks & Regards >>> Vijay >>> >>> -- >>> The contents of this e-mail are confidential and for the exclusive use of >>> the intended recipient. If you receive this e-mail in error please delete >>> it from your system immediately and notify us either by e-mail or >>> telephone. You should not copy, forward or otherwise disclose the content >>> of the e-mail. The views expressed in this communication may not >>> necessarily be the view held by WHISHWORKS. >>> >>> > -- Ph: 9845704792