Re: Indexing PDF and MS Office files

Vijaya Narayana Reddy Bhoomi Reddy Thu, 16 Apr 2015 04:59:09 -0700

Thanks Tim.

I shall raise a Jira with the stack trace information.


Thanks & Regards
Vijay


On 16 April 2015 at 12:54, Allison, Timothy B. <talli...@mitre.org> wrote:

> This sounds like a Tika issue, let's move discussion to that list.
>
> If you are still having problems after you upgrade to Tika 1.8, please at
> least submit the stack traces (if you can) to the Tika jira.  We may be
> able to find a document that triggers that stack trace in govdocs1 or the
> slice of CommonCrawl that Julien Nioche contributed to our eval effort.
>
> Tika is not perfect and it will fail on some files, but we are always
> working to improve it.
>
> Best,
>
>           Tim
>
> -----Original Message-----
> From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
> vijaya.bhoomire...@whishworks.com]
> Sent: Thursday, April 16, 2015 7:44 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Indexing PDF and MS Office files
>
> Thanks Allison.
>
> I tried with the mentioned changes. But still no luck. I am using the code
> from lucidworks site provided by Erick and now included the changes
> mentioned by you. But still the issue persists with a small percentage of
> documents (both PDF and MS Office documents) failing. Unfortunately, these
> documents are proprietary and client-confidential and hence I am not sure
> whether they can be uploaded into Jira.
>
> These files normally open in Adobe Reader and MS Office tools.
>
> Thanks & Regards
> Vijay
>
>
> On 16 April 2015 at 12:33, Allison, Timothy B. <talli...@mitre.org> wrote:
>
> > I entirely agree with Erick -- it is best to isolate Tika in its own jvm
> > if you can -- bad things can happen if you don't [1] [2].
> >
> > Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
> > embedded documents/attachments, make sure to set the parser in the
> > ParseContext before parsing:
> >
> > ParseContext context = new ParseContext();
> > //add this line:
> > context.set(Parser.class, _autoParser)
> >  InputStream input = new FileInputStream(file);
> >
> > Tika 1.8 is soon to be released.  If that doesn't fix your problems,
> > please submit stacktraces (and docs, if possible) to the Tika jira, and
> > we'll try to make the fixes.
> >
> > Cheers,
> >
> >             Tim
> >
> > [1]
> >
> http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
> > [2]
> >
> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
> > -----Original Message-----
> > From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
> > vijaya.bhoomire...@whishworks.com]
> > Sent: Thursday, April 16, 2015 7:10 AM
> > To: solr-user@lucene.apache.org
> > Subject: Re: Indexing PDF and MS Office files
> >
> > Erick,
> >
> > I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
> > SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
> > are getting parsed properly and indexed into Solr. However, a minority of
> > them keep failing wither PDFParser or OfficeParser error.
> >
> > Not sure if this behaviour can be modified so that all the documents can
> be
> > indexed. The business requirement we have is to index all the documents.
> > However, if a small percentage of them fails, not sure what other ways
> > exist to index them.
> >
> > Any help please?
> >
> >
> > Thanks & Regards
> > Vijay
> >
> >
> >
> > On 15 April 2015 at 15:20, Erick Erickson <erickerick...@gmail.com>
> wrote:
> >
> > > There's quite a discussion here:
> > > https://issues.apache.org/jira/browse/SOLR-7137
> > >
> > > But, I personally am not a huge fan of pushing all the work on to Solr,
> > in
> > > a
> > > production environment the Solr server is responsible for indexing,
> > > parsing the
> > > docs through Tika, perhaps searching etc. This doesn't scale all that
> > well.
> > >
> > > So an alternative is to use SolrJ with Tika, which is totally
> independent
> > > of
> > > what version of Tika is on the Solr server. Here's an example.
> > >
> > > http://lucidworks.com/blog/indexing-with-solrj/
> > >
> > > Best,
> > > Erick
> > >
> > > On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
> > > <vijaya.bhoomire...@whishworks.com> wrote:
> > > > Thanks everyone for the responses. Now I am able to index PDF
> documents
> > > > successfully. I have implemented manual extraction using Tika's
> > > AutoParser
> > > > and PDF functionality is working fine. However,  the error with some
> MS
> > > > office word documents still persist.
> > > >
> > > > The error message is "java.lang.IllegalArgumentException: This
> > paragraph
> > > is
> > > > not the first one in the table" which will eventually result in
> > > "Unexpected
> > > > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"
> > > >
> > > > Upon some reading, it looks like its a bug with Tika 1.5 and seems to
> > > have
> > > > been fixed with Tika 1.6 (
> > > https://issues.apache.org/jira/browse/TIKA-1251 ).
> > > > I am new to Solr / Tika and hence wondering whether I can change the
> > Tika
> > > > library alone to v1.6 without impacting any of the libraries within
> > Solr
> > > > 4.10.2? Please let me know your response and how to get away with
> this
> > > > issue.
> > > >
> > > > Many thanks in advance.
> > > >
> > > > Thanks & Regards
> > > > Vijay
> > > >
> > > >
> > > > On 15 April 2015 at 05:14, Shyam R <shyam.reme...@gmail.com> wrote:
> > > >
> > > >> Vijay,
> > > >>
> > > >> You could try different excel files with different formats to rule
> out
> > > the
> > > >> issue is with TIKA version being used.
> > > >>
> > > >> Thanks
> > > >> Murthy
> > > >>
> > > >> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <trhodes...@gmail.com
> >
> > > >> wrote:
> > > >>
> > > >> > Perhaps the PDF is protected and the content can not be extracted?
> > > >> >
> > > >> > i have an unverified suspicion that the tika shipped with solr
> > 4.10.2
> > > may
> > > >> > not support some/all office 2013 document formats.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On 4/14/2015 8:18 PM, Jack Krupansky wrote:
> > > >> >
> > > >> >> Try doing a manual extraction request directly to Solr (not via
> > > SolrJ)
> > > >> and
> > > >> >> use the extractOnly option to see if the content is actually
> > > extracted.
> > > >> >>
> > > >> >> See:
> > > >> >> https://cwiki.apache.org/confluence/display/solr/
> > > >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika
> > > >> >>
> > > >> >> Also, some PDF files actually have the content as a bitmap image,
> > so
> > > no
> > > >> >> text is extracted.
> > > >> >>
> > > >> >>
> > > >> >> -- Jack Krupansky
> > > >> >>
> > > >> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi
> > Reddy
> > > <
> > > >> >> vijaya.bhoomire...@whishworks.com> wrote:
> > > >> >>
> > > >> >>  Hi,
> > > >> >>>
> > > >> >>> I am trying to index PDF and Microsoft Office files (.doc,
> .docx,
> > > .ppt,
> > > >> >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the
> following
> > > >> issues.
> > > >> >>> Request to please let me know what is going wrong with the
> > indexing
> > > >> >>> process.
> > > >> >>>
> > > >> >>> I am using solr 4.10.2 and using the default example server
> > > >> configuration
> > > >> >>> that comes with Solr distribution.
> > > >> >>>
> > > >> >>> PDF Files - Indexing as such works fine, but when I query using
> > *.*
> > > in
> > > >> >>> the
> > > >> >>> Solr Query console, metadata information is displayed properly.
> > > >> However,
> > > >> >>> the PDF content field is empty. This is happening for all PDF
> > files
> > > I
> > > >> >>> have
> > > >> >>> tried. I have tried with some proprietary files, PDF eBooks etc.
> > > >> Whatever
> > > >> >>> be the PDF file, content is not being displayed.
> > > >> >>>
> > > >> >>> MS Office files -  For some office files, everything works
> perfect
> > > and
> > > >> >>> the
> > > >> >>> extracted content is visible in the query console. However, for
> > > >> others, I
> > > >> >>> see the below error message during the indexing process.
> > > >> >>>
> > > >> >>> *Exception in thread "main"
> > > >> >>>
> > > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> > > >> >>> org.apache.tika.exception.TikaException: Unexpected
> > RuntimeException
> > > >> >>> from
> > > >> >>> org.apache.tika.parser.microsoft.OfficeParser*
> > > >> >>>
> > > >> >>>
> > > >> >>> I am using SolrJ to index the documents and below is the code
> > > snippet
> > > >> >>> related to indexing. Please let me know where the issue is
> > > occurring.
> > > >> >>>
> > > >> >>>                          static String solrServerURL = "
> > > >> >>> http://localhost:8983/solr";;
> > > >> >>> static SolrServer solrServer = new
> HttpSolrServer(solrServerURL);
> > > >> >>>                          static ContentStreamUpdateRequest
> > > indexingReq
> > > >> =
> > > >> >>> new
> > > >> >>>
> > > >> >>>      ContentStreamUpdateRequest("/update/extract");
> > > >> >>>
> > > >> >>>                          indexingReq.addFile(file, fileType);
> > > >> >>> indexingReq.setParam("literal.id", literalId);
> > > >> >>> indexingReq.setParam("uprefix", "attr_");
> > > >> >>> indexingReq.setParam("fmap.content", "content");
> > > >> >>> indexingReq.setParam("literal.fileurl", fileURL);
> > > >> >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
> > > true);
> > > >> >>> solrServer.request(indexingReq);
> > > >> >>>
> > > >> >>> Thanks & Regards
> > > >> >>> Vijay
> > > >> >>>
> > > >> >>> --
> > > >> >>> The contents of this e-mail are confidential and for the
> exclusive
> > > use
> > > >> of
> > > >> >>> the intended recipient. If you receive this e-mail in error
> please
> > > >> delete
> > > >> >>> it from your system immediately and notify us either by e-mail
> or
> > > >> >>> telephone. You should not copy, forward or otherwise disclose
> the
> > > >> content
> > > >> >>> of the e-mail. The views expressed in this communication may not
> > > >> >>> necessarily be the view held by WHISHWORKS.
> > > >> >>>
> > > >> >>>
> > > >> >
> > > >>
> > > >>
> > > >> --
> > > >> Ph: 9845704792
> > > >>
> > > >
> > > > --
> > > > The contents of this e-mail are confidential and for the exclusive
> use
> > of
> > > > the intended recipient. If you receive this e-mail in error please
> > delete
> > > > it from your system immediately and notify us either by e-mail or
> > > > telephone. You should not copy, forward or otherwise disclose the
> > content
> > > > of the e-mail. The views expressed in this communication may not
> > > > necessarily be the view held by WHISHWORKS.
> > >
> >
> > --
> > The contents of this e-mail are confidential and for the exclusive use of
> > the intended recipient. If you receive this e-mail in error please delete
> > it from your system immediately and notify us either by e-mail or
> > telephone. You should not copy, forward or otherwise disclose the content
> > of the e-mail. The views expressed in this communication may not
> > necessarily be the view held by WHISHWORKS.
> >
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to