Thanks Tim. I shall raise a Jira with the stack trace information.
Thanks & Regards Vijay On 16 April 2015 at 12:54, Allison, Timothy B. <talli...@mitre.org> wrote: > This sounds like a Tika issue, let's move discussion to that list. > > If you are still having problems after you upgrade to Tika 1.8, please at > least submit the stack traces (if you can) to the Tika jira. We may be > able to find a document that triggers that stack trace in govdocs1 or the > slice of CommonCrawl that Julien Nioche contributed to our eval effort. > > Tika is not perfect and it will fail on some files, but we are always > working to improve it. > > Best, > > Tim > > -----Original Message----- > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: > vijaya.bhoomire...@whishworks.com] > Sent: Thursday, April 16, 2015 7:44 AM > To: solr-user@lucene.apache.org > Subject: Re: Indexing PDF and MS Office files > > Thanks Allison. > > I tried with the mentioned changes. But still no luck. I am using the code > from lucidworks site provided by Erick and now included the changes > mentioned by you. But still the issue persists with a small percentage of > documents (both PDF and MS Office documents) failing. Unfortunately, these > documents are proprietary and client-confidential and hence I am not sure > whether they can be uploaded into Jira. > > These files normally open in Adobe Reader and MS Office tools. > > Thanks & Regards > Vijay > > > On 16 April 2015 at 12:33, Allison, Timothy B. <talli...@mitre.org> wrote: > > > I entirely agree with Erick -- it is best to isolate Tika in its own jvm > > if you can -- bad things can happen if you don't [1] [2]. > > > > Erick's blog on SolrJ is fantastic. If you want to have Tika parse > > embedded documents/attachments, make sure to set the parser in the > > ParseContext before parsing: > > > > ParseContext context = new ParseContext(); > > //add this line: > > context.set(Parser.class, _autoParser) > > InputStream input = new FileInputStream(file); > > > > Tika 1.8 is soon to be released. If that doesn't fix your problems, > > please submit stacktraces (and docs, if possible) to the Tika jira, and > > we'll try to make the fixes. > > > > Cheers, > > > > Tim > > > > [1] > > > http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf > > [2] > > > http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf > > -----Original Message----- > > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto: > > vijaya.bhoomire...@whishworks.com] > > Sent: Thursday, April 16, 2015 7:10 AM > > To: solr-user@lucene.apache.org > > Subject: Re: Indexing PDF and MS Office files > > > > Erick, > > > > I tried indexing both ways - SolrJ / Tika's AutoParser and as well as > > SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents > > are getting parsed properly and indexed into Solr. However, a minority of > > them keep failing wither PDFParser or OfficeParser error. > > > > Not sure if this behaviour can be modified so that all the documents can > be > > indexed. The business requirement we have is to index all the documents. > > However, if a small percentage of them fails, not sure what other ways > > exist to index them. > > > > Any help please? > > > > > > Thanks & Regards > > Vijay > > > > > > > > On 15 April 2015 at 15:20, Erick Erickson <erickerick...@gmail.com> > wrote: > > > > > There's quite a discussion here: > > > https://issues.apache.org/jira/browse/SOLR-7137 > > > > > > But, I personally am not a huge fan of pushing all the work on to Solr, > > in > > > a > > > production environment the Solr server is responsible for indexing, > > > parsing the > > > docs through Tika, perhaps searching etc. This doesn't scale all that > > well. > > > > > > So an alternative is to use SolrJ with Tika, which is totally > independent > > > of > > > what version of Tika is on the Solr server. Here's an example. > > > > > > http://lucidworks.com/blog/indexing-with-solrj/ > > > > > > Best, > > > Erick > > > > > > On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy > > > <vijaya.bhoomire...@whishworks.com> wrote: > > > > Thanks everyone for the responses. Now I am able to index PDF > documents > > > > successfully. I have implemented manual extraction using Tika's > > > AutoParser > > > > and PDF functionality is working fine. However, the error with some > MS > > > > office word documents still persist. > > > > > > > > The error message is "java.lang.IllegalArgumentException: This > > paragraph > > > is > > > > not the first one in the table" which will eventually result in > > > "Unexpected > > > > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser" > > > > > > > > Upon some reading, it looks like its a bug with Tika 1.5 and seems to > > > have > > > > been fixed with Tika 1.6 ( > > > https://issues.apache.org/jira/browse/TIKA-1251 ). > > > > I am new to Solr / Tika and hence wondering whether I can change the > > Tika > > > > library alone to v1.6 without impacting any of the libraries within > > Solr > > > > 4.10.2? Please let me know your response and how to get away with > this > > > > issue. > > > > > > > > Many thanks in advance. > > > > > > > > Thanks & Regards > > > > Vijay > > > > > > > > > > > > On 15 April 2015 at 05:14, Shyam R <shyam.reme...@gmail.com> wrote: > > > > > > > >> Vijay, > > > >> > > > >> You could try different excel files with different formats to rule > out > > > the > > > >> issue is with TIKA version being used. > > > >> > > > >> Thanks > > > >> Murthy > > > >> > > > >> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <trhodes...@gmail.com > > > > > >> wrote: > > > >> > > > >> > Perhaps the PDF is protected and the content can not be extracted? > > > >> > > > > >> > i have an unverified suspicion that the tika shipped with solr > > 4.10.2 > > > may > > > >> > not support some/all office 2013 document formats. > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > > > > >> > On 4/14/2015 8:18 PM, Jack Krupansky wrote: > > > >> > > > > >> >> Try doing a manual extraction request directly to Solr (not via > > > SolrJ) > > > >> and > > > >> >> use the extractOnly option to see if the content is actually > > > extracted. > > > >> >> > > > >> >> See: > > > >> >> https://cwiki.apache.org/confluence/display/solr/ > > > >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika > > > >> >> > > > >> >> Also, some PDF files actually have the content as a bitmap image, > > so > > > no > > > >> >> text is extracted. > > > >> >> > > > >> >> > > > >> >> -- Jack Krupansky > > > >> >> > > > >> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi > > Reddy > > > < > > > >> >> vijaya.bhoomire...@whishworks.com> wrote: > > > >> >> > > > >> >> Hi, > > > >> >>> > > > >> >>> I am trying to index PDF and Microsoft Office files (.doc, > .docx, > > > .ppt, > > > >> >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the > following > > > >> issues. > > > >> >>> Request to please let me know what is going wrong with the > > indexing > > > >> >>> process. > > > >> >>> > > > >> >>> I am using solr 4.10.2 and using the default example server > > > >> configuration > > > >> >>> that comes with Solr distribution. > > > >> >>> > > > >> >>> PDF Files - Indexing as such works fine, but when I query using > > *.* > > > in > > > >> >>> the > > > >> >>> Solr Query console, metadata information is displayed properly. > > > >> However, > > > >> >>> the PDF content field is empty. This is happening for all PDF > > files > > > I > > > >> >>> have > > > >> >>> tried. I have tried with some proprietary files, PDF eBooks etc. > > > >> Whatever > > > >> >>> be the PDF file, content is not being displayed. > > > >> >>> > > > >> >>> MS Office files - For some office files, everything works > perfect > > > and > > > >> >>> the > > > >> >>> extracted content is visible in the query console. However, for > > > >> others, I > > > >> >>> see the below error message during the indexing process. > > > >> >>> > > > >> >>> *Exception in thread "main" > > > >> >>> > > > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > > > >> >>> org.apache.tika.exception.TikaException: Unexpected > > RuntimeException > > > >> >>> from > > > >> >>> org.apache.tika.parser.microsoft.OfficeParser* > > > >> >>> > > > >> >>> > > > >> >>> I am using SolrJ to index the documents and below is the code > > > snippet > > > >> >>> related to indexing. Please let me know where the issue is > > > occurring. > > > >> >>> > > > >> >>> static String solrServerURL = " > > > >> >>> http://localhost:8983/solr"; > > > >> >>> static SolrServer solrServer = new > HttpSolrServer(solrServerURL); > > > >> >>> static ContentStreamUpdateRequest > > > indexingReq > > > >> = > > > >> >>> new > > > >> >>> > > > >> >>> ContentStreamUpdateRequest("/update/extract"); > > > >> >>> > > > >> >>> indexingReq.addFile(file, fileType); > > > >> >>> indexingReq.setParam("literal.id", literalId); > > > >> >>> indexingReq.setParam("uprefix", "attr_"); > > > >> >>> indexingReq.setParam("fmap.content", "content"); > > > >> >>> indexingReq.setParam("literal.fileurl", fileURL); > > > >> >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true, > > > true); > > > >> >>> solrServer.request(indexingReq); > > > >> >>> > > > >> >>> Thanks & Regards > > > >> >>> Vijay > > > >> >>> > > > >> >>> -- > > > >> >>> The contents of this e-mail are confidential and for the > exclusive > > > use > > > >> of > > > >> >>> the intended recipient. If you receive this e-mail in error > please > > > >> delete > > > >> >>> it from your system immediately and notify us either by e-mail > or > > > >> >>> telephone. You should not copy, forward or otherwise disclose > the > > > >> content > > > >> >>> of the e-mail. The views expressed in this communication may not > > > >> >>> necessarily be the view held by WHISHWORKS. > > > >> >>> > > > >> >>> > > > >> > > > > >> > > > >> > > > >> -- > > > >> Ph: 9845704792 > > > >> > > > > > > > > -- > > > > The contents of this e-mail are confidential and for the exclusive > use > > of > > > > the intended recipient. If you receive this e-mail in error please > > delete > > > > it from your system immediately and notify us either by e-mail or > > > > telephone. You should not copy, forward or otherwise disclose the > > content > > > > of the e-mail. The views expressed in this communication may not > > > > necessarily be the view held by WHISHWORKS. > > > > > > > -- > > The contents of this e-mail are confidential and for the exclusive use of > > the intended recipient. If you receive this e-mail in error please delete > > it from your system immediately and notify us either by e-mail or > > telephone. You should not copy, forward or otherwise disclose the content > > of the e-mail. The views expressed in this communication may not > > necessarily be the view held by WHISHWORKS. > > > > -- > The contents of this e-mail are confidential and for the exclusive use of > the intended recipient. If you receive this e-mail in error please delete > it from your system immediately and notify us either by e-mail or > telephone. You should not copy, forward or otherwise disclose the content > of the e-mail. The views expressed in this communication may not > necessarily be the view held by WHISHWORKS. > -- The contents of this e-mail are confidential and for the exclusive use of the intended recipient. If you receive this e-mail in error please delete it from your system immediately and notify us either by e-mail or telephone. You should not copy, forward or otherwise disclose the content of the e-mail. The views expressed in this communication may not necessarily be the view held by WHISHWORKS.