Hi, I am trying to index PDF and Microsoft Office files (.doc, .docx, .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the following issues. Request to please let me know what is going wrong with the indexing process.
I am using solr 4.10.2 and using the default example server configuration that comes with Solr distribution. PDF Files - Indexing as such works fine, but when I query using *.* in the Solr Query console, metadata information is displayed properly. However, the PDF content field is empty. This is happening for all PDF files I have tried. I have tried with some proprietary files, PDF eBooks etc. Whatever be the PDF file, content is not being displayed. MS Office files - For some office files, everything works perfect and the extracted content is visible in the query console. However, for others, I see the below error message during the indexing process. *Exception in thread "main" org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: org.apache.tika.exception.TikaException: Unexpected RuntimeException from org.apache.tika.parser.microsoft.OfficeParser* I am using SolrJ to index the documents and below is the code snippet related to indexing. Please let me know where the issue is occurring. static String solrServerURL = " http://localhost:8983/solr"; static SolrServer solrServer = new HttpSolrServer(solrServerURL); static ContentStreamUpdateRequest indexingReq = new ContentStreamUpdateRequest("/update/extract"); indexingReq.addFile(file, fileType); indexingReq.setParam("literal.id", literalId); indexingReq.setParam("uprefix", "attr_"); indexingReq.setParam("fmap.content", "content"); indexingReq.setParam("literal.fileurl", fileURL); indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, true); solrServer.request(indexingReq); Thanks & Regards Vijay -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.