Re: Indexing PDF and MS Office files

Andrea Gazzarini Tue, 14 Apr 2015 09:25:23 -0700

Hi,

solrconfig.xml (especially if you didn't touch it) should be good. Whatabout the schema? Are you using the one that comes with the downloadbundle, too?


I don't see the stacktrace..did you forget to paste it?

Best,
Andrea

On 04/14/2015 06:06 PM, Vijaya Narayana Reddy Bhoomi Reddy wrote:

Hi,

Here are the solr-config xml and the error log from Solr logs for yourreference. As mentioned earlier, I didnt make any changes to thesolr-config.xml as I am using the xml file out of the box one thatcame with the default installation.


Please let me know your thoughts on why these issues are occuring.

Thanks & Regards
Vijay


        

*Vijay Bhoomireddy*, Big Data Architect

1000 Great West Road, Brentford, London, TW8 9DW

*T:+44 20 3475 7980*
*M:**+44 7481 298 360*

*W: *ww <http://www.whishworks.com/>w.whishworks.com<http://www.whishworks.com/>


<https://www.linkedin.com/company/whishworks><http://www.whishworks.com/blog/><https://twitter.com/WHISHWORKS><https://www.facebook.com/whishworksit>

On 14 April 2015 at 15:57, Vijaya Narayana Reddy Bhoomi Reddy<vijaya.bhoomire...@whishworks.com<mailto:vijaya.bhoomire...@whishworks.com>> wrote:


    Hi,

    I am trying to index PDF and Microsoft Office files (.doc, .docx,
    .ppt, .pptx, .xlx, and .xlx) files into Solr. I am facing the
    following issues. Request to please let me know what is going
    wrong with the indexing process.

    I am using solr 4.10.2 and using the default example server
    configuration that comes with Solr distribution.

    PDF Files - Indexing as such works fine, but when I query using
    *.* in the Solr Query console, metadata information is displayed
    properly. However, the PDF content field is empty. This is
    happening for all PDF files I have tried. I have tried with some
    proprietary files, PDF eBooks etc. Whatever be the PDF file,
    content is not being displayed.

    MS Office files -  For some office files, everything works perfect
    and the extracted content is visible in the query console.
    However, for others, I see the below error message during the
    indexing process.

    *Exception in thread "main"
    org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
    org.apache.tika.exception.TikaException: Unexpected
    RuntimeException from org.apache.tika.parser.microsoft.OfficeParser*
    *
    *

    I am using SolrJ to index the documents and below is the code
    snippet related to indexing. Please let me know where the issue is
    occurring.

                            static String solrServerURL =
    "http://localhost:8983/solr";;
    static SolrServer solrServer = new HttpSolrServer(solrServerURL);
                            static ContentStreamUpdateRequest
    indexingReq = new ContentStreamUpdateRequest("/update/extract");

                            indexingReq.addFile(file, fileType);
    indexingReq.setParam("literal.id <http://literal.id>", literalId);
    indexingReq.setParam("uprefix", "attr_");
    indexingReq.setParam("fmap.content", "content");
    indexingReq.setParam("literal.fileurl", fileURL);
    indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
    true);
    solrServer.request(indexingReq);

    Thanks & Regards
    Vijay

The contents of this e-mail are confidential and for the exclusive useof the intended recipient. If you receive this e-mail in error pleasedelete it from your system immediately and notify us either by e-mailor telephone. You should not copy, forward or otherwise disclose thecontent of the e-mail. The views expressed in this communication maynot necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to