Try doing a manual extraction request directly to Solr (not via SolrJ) and use the extractOnly option to see if the content is actually extracted.
See: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika Also, some PDF files actually have the content as a bitmap image, so no text is extracted. -- Jack Krupansky On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi Reddy < vijaya.bhoomire...@whishworks.com> wrote: > Hi, > > I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, > .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. > Request to please let me know what is going wrong with the indexing > process. > > I am using solr 4.10.2 and using the default example server configuration > that comes with Solr distribution. > > PDF Files - Indexing as such works fine, but when I query using *.* in the > Solr Query console, metadata information is displayed properly. However, > the PDF content field is empty. This is happening for all PDF files I have > tried. I have tried with some proprietary files, PDF eBooks etc. Whatever > be the PDF file, content is not being displayed. > > MS Office files - For some office files, everything works perfect and the > extracted content is visible in the query console. However, for others, I > see the below error message during the indexing process. > > *Exception in thread "main" > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > org.apache.tika.exception.TikaException: Unexpected RuntimeException from > org.apache.tika.parser.microsoft.OfficeParser* > > > I am using SolrJ to index the documents and below is the code snippet > related to indexing. Please let me know where the issue is occurring. > > static String solrServerURL = " > http://localhost:8983/solr"; > static SolrServer solrServer = new HttpSolrServer(solrServerURL); > static ContentStreamUpdateRequest indexingReq = new > > ContentStreamUpdateRequest("/update/extract"); > > indexingReq.addFile(file, fileType); > indexingReq.setParam("literal.id", literalId); > indexingReq.setParam("uprefix", "attr_"); > indexingReq.setParam("fmap.content", "content"); > indexingReq.setParam("literal.fileurl", fileURL); > indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); > solrServer.request(indexingReq); > > Thanks & Regards > Vijay > > -- > The contents of this e-mail are confidential and for the exclusive use of > the intended recipient. If you receive this e-mail in error please delete > it from your system immediately and notify us either by e-mail or > telephone. You should not copy, forward or otherwise disclose the content > of the e-mail. The views expressed in this communication may not > necessarily be the view held by WHISHWORKS. >