Thanks Allison. I tried with the mentioned changes. But still no luck. I am using the code from lucidworks site provided by Erick and now included the changes mentioned by you. But still the issue persists with a small percentage of documents (both PDF and MS Office documents) failing. Unfortunately, these documents are proprietary and client-confidential and hence I am not sure whether they can be uploaded into Jira.
These files normally open in Adobe Reader and MS Office tools. Thanks & Regards Vijay On 16 April 2015 at 12:33, Allison, Timothy B. <talli...@mitre.org> wrote: > I entirely agree with Erick -- it is best to isolate Tika in its own jvm > if you can -- bad things can happen if you don't [1] [2]. > > Erick's blog on SolrJ is fantastic. If you want to have Tika parse > embedded documents/attachments, make sure to set the parser in the > ParseContext before parsing: > > ParseContext context = new ParseContext(); > //add this line: > context.set(Parser.class, _autoParser) > InputStream input = new FileInputStream(file); > > Tika 1.8 is soon to be released. If that doesn't fix your problems, > please submit stacktraces (and docs, if possible) to the Tika jira, and > we'll try to make the fixes. > > Cheers, > > Tim > > [1] > http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf > [2] > http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf > -----Original Message----- > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: > vijaya.bhoomire...@whishworks.com] > Sent: Thursday, April 16, 2015 7:10 AM > To: solr-user@lucene.apache.org > Subject: Re: Indexing PDF and MS Office files > > Erick, > > I tried indexing both ways - SolrJ / Tika's AutoParser and as well as > SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents > are getting parsed properly and indexed into Solr. However, a minority of > them keep failing wither PDFParser or OfficeParser error. > > Not sure if this behaviour can be modified so that all the documents can be > indexed. The business requirement we have is to index all the documents. > However, if a small percentage of them fails, not sure what other ways > exist to index them. > > Any help please? > > > Thanks & Regards > Vijay > > > > On 15 April 2015 at 15:20, Erick Erickson <erickerick...@gmail.com> wrote: > > > There's quite a discussion here: > > https://issues.apache.org/jira/browse/SOLR-7137 > > > > But, I personally am not a huge fan of pushing all the work on to Solr, > in > > a > > production environment the Solr server is responsible for indexing, > > parsing the > > docs through Tika, perhaps searching etc. This doesn't scale all that > well. > > > > So an alternative is to use SolrJ with Tika, which is totally independent > > of > > what version of Tika is on the Solr server. Here's an example. > > > > http://lucidworks.com/blog/indexing-with-solrj/ > > > > Best, > > Erick > > > > On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy > > <vijaya.bhoomire...@whishworks.com> wrote: > > > Thanks everyone for the responses. Now I am able to index PDF documents > > > successfully. I have implemented manual extraction using Tika's > > AutoParser > > > and PDF functionality is working fine. However, the error with some MS > > > office word documents still persist. > > > > > > The error message is "java.lang.IllegalArgumentException: This > paragraph > > is > > > not the first one in the table" which will eventually result in > > "Unexpected > > > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser" > > > > > > Upon some reading, it looks like its a bug with Tika 1.5 and seems to > > have > > > been fixed with Tika 1.6 ( > > https://issues.apache.org/jira/browse/TIKA-1251 ). > > > I am new to Solr / Tika and hence wondering whether I can change the > Tika > > > library alone to v1.6 without impacting any of the libraries within > Solr > > > 4.10.2? Please let me know your response and how to get away with this > > > issue. > > > > > > Many thanks in advance. > > > > > > Thanks & Regards > > > Vijay > > > > > > > > > On 15 April 2015 at 05:14, Shyam R <shyam.reme...@gmail.com> wrote: > > > > > >> Vijay, > > >> > > >> You could try different excel files with different formats to rule out > > the > > >> issue is with TIKA version being used. > > >> > > >> Thanks > > >> Murthy > > >> > > >> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <trhodes...@gmail.com> > > >> wrote: > > >> > > >> > Perhaps the PDF is protected and the content can not be extracted? > > >> > > > >> > i have an unverified suspicion that the tika shipped with solr > 4.10.2 > > may > > >> > not support some/all office 2013 document formats. > > >> > > > >> > > > >> > > > >> > > > >> > > > >> > On 4/14/2015 8:18 PM, Jack Krupansky wrote: > > >> > > > >> >> Try doing a manual extraction request directly to Solr (not via > > SolrJ) > > >> and > > >> >> use the extractOnly option to see if the content is actually > > extracted. > > >> >> > > >> >> See: > > >> >> https://cwiki.apache.org/confluence/display/solr/ > > >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika > > >> >> > > >> >> Also, some PDF files actually have the content as a bitmap image, > so > > no > > >> >> text is extracted. > > >> >> > > >> >> > > >> >> -- Jack Krupansky > > >> >> > > >> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi > Reddy > > < > > >> >> vijaya.bhoomire...@whishworks.com> wrote: > > >> >> > > >> >> Hi, > > >> >>> > > >> >>> I am trying to index PDF and Microsoft Office files (.doc, .docx, > > .ppt, > > >> >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following > > >> issues. > > >> >>> Request to please let me know what is going wrong with the > indexing > > >> >>> process. > > >> >>> > > >> >>> I am using solr 4.10.2 and using the default example server > > >> configuration > > >> >>> that comes with Solr distribution. > > >> >>> > > >> >>> PDF Files - Indexing as such works fine, but when I query using > *.* > > in > > >> >>> the > > >> >>> Solr Query console, metadata information is displayed properly. > > >> However, > > >> >>> the PDF content field is empty. This is happening for all PDF > files > > I > > >> >>> have > > >> >>> tried. I have tried with some proprietary files, PDF eBooks etc. > > >> Whatever > > >> >>> be the PDF file, content is not being displayed. > > >> >>> > > >> >>> MS Office files - For some office files, everything works perfect > > and > > >> >>> the > > >> >>> extracted content is visible in the query console. However, for > > >> others, I > > >> >>> see the below error message during the indexing process. > > >> >>> > > >> >>> *Exception in thread "main" > > >> >>> > > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > > >> >>> org.apache.tika.exception.TikaException: Unexpected > RuntimeException > > >> >>> from > > >> >>> org.apache.tika.parser.microsoft.OfficeParser* > > >> >>> > > >> >>> > > >> >>> I am using SolrJ to index the documents and below is the code > > snippet > > >> >>> related to indexing. Please let me know where the issue is > > occurring. > > >> >>> > > >> >>> static String solrServerURL = " > > >> >>> http://localhost:8983/solr"; > > >> >>> static SolrServer solrServer = new HttpSolrServer(solrServerURL); > > >> >>> static ContentStreamUpdateRequest > > indexingReq > > >> = > > >> >>> new > > >> >>> > > >> >>> ContentStreamUpdateRequest("/update/extract"); > > >> >>> > > >> >>> indexingReq.addFile(file, fileType); > > >> >>> indexingReq.setParam("literal.id", literalId); > > >> >>> indexingReq.setParam("uprefix", "attr_"); > > >> >>> indexingReq.setParam("fmap.content", "content"); > > >> >>> indexingReq.setParam("literal.fileurl", fileURL); > > >> >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, > > true); > > >> >>> solrServer.request(indexingReq); > > >> >>> > > >> >>> Thanks & Regards > > >> >>> Vijay > > >> >>> > > >> >>> -- > > >> >>> The contents of this e-mail are confidential and for the exclusive > > use > > >> of > > >> >>> the intended recipient. If you receive this e-mail in error please > > >> delete > > >> >>> it from your system immediately and notify us either by e-mail or > > >> >>> telephone. You should not copy, forward or otherwise disclose the > > >> content > > >> >>> of the e-mail. The views expressed in this communication may not > > >> >>> necessarily be the view held by WHISHWORKS. > > >> >>> > > >> >>> > > >> > > > >> > > >> > > >> -- > > >> Ph: 9845704792 > > >> > > > > > > -- > > > The contents of this e-mail are confidential and for the exclusive use > of > > > the intended recipient. If you receive this e-mail in error please > delete > > > it from your system immediately and notify us either by e-mail or > > > telephone. You should not copy, forward or otherwise disclose the > content > > > of the e-mail. The views expressed in this communication may not > > > necessarily be the view held by WHISHWORKS. > > > > -- > The contents of this e-mail are confidential and for the exclusive use of > the intended recipient. If you receive this e-mail in error please delete > it from your system immediately and notify us either by e-mail or > telephone. You should not copy, forward or otherwise disclose the content > of the e-mail. The views expressed in this communication may not > necessarily be the view held by WHISHWORKS. > -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.