Re: Indexing PDF and MS Office files

Vijaya Narayana Reddy Bhoomi Reddy Thu, 16 Apr 2015 04:46:34 -0700

Thanks Allison.

I tried with the mentioned changes. But still no luck. I am using the code
from lucidworks site provided by Erick and now included the changes
mentioned by you. But still the issue persists with a small percentage of
documents (both PDF and MS Office documents) failing. Unfortunately, these
documents are proprietary and client-confidential and hence I am not sure
whether they can be uploaded into Jira.


These files normally open in Adobe Reader and MS Office tools.

Thanks & Regards
Vijay


On 16 April 2015 at 12:33, Allison, Timothy B. <[email protected]> wrote:

> I entirely agree with Erick -- it is best to isolate Tika in its own jvm
> if you can -- bad things can happen if you don't [1] [2].
>
> Erick's blog on SolrJ is fantastic.  If you want to have Tika parse
> embedded documents/attachments, make sure to set the parser in the
> ParseContext before parsing:
>
> ParseContext context = new ParseContext();
> //add this line:
> context.set(Parser.class, _autoParser)
>  InputStream input = new FileInputStream(file);
>
> Tika 1.8 is soon to be released.  If that doesn't fix your problems,
> please submit stacktraces (and docs, if possible) to the Tika jira, and
> we'll try to make the fixes.
>
> Cheers,
>
>             Tim
>
> [1]
> http://events.linuxfoundation.org/sites/events/files/slides/1s_and_0s_1.pdf
> [2]
> http://events.linuxfoundation.org/sites/events/files/slides/TikaEval_ACNA15_allison_herceg_v2.pdf
> -----Original Message-----
> From: Vijaya Narayana Reddy Bhoomi Reddy [mailto:
> [email protected]]
> Sent: Thursday, April 16, 2015 7:10 AM
> To: [email protected]
> Subject: Re: Indexing PDF and MS Office files
>
> Erick,
>
> I tried indexing both ways - SolrJ / Tika's AutoParser and as well as
> SolrCell's ExtractRequestHandler. Majority of the PDF and Word documents
> are getting parsed properly and indexed into Solr. However, a minority of
> them keep failing wither PDFParser or OfficeParser error.
>
> Not sure if this behaviour can be modified so that all the documents can be
> indexed. The business requirement we have is to index all the documents.
> However, if a small percentage of them fails, not sure what other ways
> exist to index them.
>
> Any help please?
>
>
> Thanks & Regards
> Vijay
>
>
>
> On 15 April 2015 at 15:20, Erick Erickson <[email protected]> wrote:
>
> > There's quite a discussion here:
> > https://issues.apache.org/jira/browse/SOLR-7137
> >
> > But, I personally am not a huge fan of pushing all the work on to Solr,
> in
> > a
> > production environment the Solr server is responsible for indexing,
> > parsing the
> > docs through Tika, perhaps searching etc. This doesn't scale all that
> well.
> >
> > So an alternative is to use SolrJ with Tika, which is totally independent
> > of
> > what version of Tika is on the Solr server. Here's an example.
> >
> > http://lucidworks.com/blog/indexing-with-solrj/
> >
> > Best,
> > Erick
> >
> > On Wed, Apr 15, 2015 at 4:46 AM, Vijaya Narayana Reddy Bhoomi Reddy
> > <[email protected]> wrote:
> > > Thanks everyone for the responses. Now I am able to index PDF documents
> > > successfully. I have implemented manual extraction using Tika's
> > AutoParser
> > > and PDF functionality is working fine. However,  the error with some MS
> > > office word documents still persist.
> > >
> > > The error message is "java.lang.IllegalArgumentException: This
> paragraph
> > is
> > > not the first one in the table" which will eventually result in
> > "Unexpected
> > > RuntimeException from org.apache.tika.parser.microsoft.OfficeParser"
> > >
> > > Upon some reading, it looks like its a bug with Tika 1.5 and seems to
> > have
> > > been fixed with Tika 1.6 (
> > https://issues.apache.org/jira/browse/TIKA-1251 ).
> > > I am new to Solr / Tika and hence wondering whether I can change the
> Tika
> > > library alone to v1.6 without impacting any of the libraries within
> Solr
> > > 4.10.2? Please let me know your response and how to get away with this
> > > issue.
> > >
> > > Many thanks in advance.
> > >
> > > Thanks & Regards
> > > Vijay
> > >
> > >
> > > On 15 April 2015 at 05:14, Shyam R <[email protected]> wrote:
> > >
> > >> Vijay,
> > >>
> > >> You could try different excel files with different formats to rule out
> > the
> > >> issue is with TIKA version being used.
> > >>
> > >> Thanks
> > >> Murthy
> > >>
> > >> On Wed, Apr 15, 2015 at 9:35 AM, Terry Rhodes <[email protected]>
> > >> wrote:
> > >>
> > >> > Perhaps the PDF is protected and the content can not be extracted?
> > >> >
> > >> > i have an unverified suspicion that the tika shipped with solr
> 4.10.2
> > may
> > >> > not support some/all office 2013 document formats.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On 4/14/2015 8:18 PM, Jack Krupansky wrote:
> > >> >
> > >> >> Try doing a manual extraction request directly to Solr (not via
> > SolrJ)
> > >> and
> > >> >> use the extractOnly option to see if the content is actually
> > extracted.
> > >> >>
> > >> >> See:
> > >> >> https://cwiki.apache.org/confluence/display/solr/
> > >> >> Uploading+Data+with+Solr+Cell+using+Apache+Tika
> > >> >>
> > >> >> Also, some PDF files actually have the content as a bitmap image,
> so
> > no
> > >> >> text is extracted.
> > >> >>
> > >> >>
> > >> >> -- Jack Krupansky
> > >> >>
> > >> >> On Tue, Apr 14, 2015 at 10:57 AM, Vijaya Narayana Reddy Bhoomi
> Reddy
> > <
> > >> >> [email protected]> wrote:
> > >> >>
> > >> >>  Hi,
> > >> >>>
> > >> >>> I am trying to index PDF and Microsoft Office files (.doc, .docx,
> > .ppt,
> > >> >>> .pptx, .xlx, and .xlx) files into Solr. I am facing the following
> > >> issues.
> > >> >>> Request to please let me know what is going wrong with the
> indexing
> > >> >>> process.
> > >> >>>
> > >> >>> I am using solr 4.10.2 and using the default example server
> > >> configuration
> > >> >>> that comes with Solr distribution.
> > >> >>>
> > >> >>> PDF Files - Indexing as such works fine, but when I query using
> *.*
> > in
> > >> >>> the
> > >> >>> Solr Query console, metadata information is displayed properly.
> > >> However,
> > >> >>> the PDF content field is empty. This is happening for all PDF
> files
> > I
> > >> >>> have
> > >> >>> tried. I have tried with some proprietary files, PDF eBooks etc.
> > >> Whatever
> > >> >>> be the PDF file, content is not being displayed.
> > >> >>>
> > >> >>> MS Office files -  For some office files, everything works perfect
> > and
> > >> >>> the
> > >> >>> extracted content is visible in the query console. However, for
> > >> others, I
> > >> >>> see the below error message during the indexing process.
> > >> >>>
> > >> >>> *Exception in thread "main"
> > >> >>>
> > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
> > >> >>> org.apache.tika.exception.TikaException: Unexpected
> RuntimeException
> > >> >>> from
> > >> >>> org.apache.tika.parser.microsoft.OfficeParser*
> > >> >>>
> > >> >>>
> > >> >>> I am using SolrJ to index the documents and below is the code
> > snippet
> > >> >>> related to indexing. Please let me know where the issue is
> > occurring.
> > >> >>>
> > >> >>>                          static String solrServerURL = "
> > >> >>> http://localhost:8983/solr";;
> > >> >>> static SolrServer solrServer = new HttpSolrServer(solrServerURL);
> > >> >>>                          static ContentStreamUpdateRequest
> > indexingReq
> > >> =
> > >> >>> new
> > >> >>>
> > >> >>>      ContentStreamUpdateRequest("/update/extract");
> > >> >>>
> > >> >>>                          indexingReq.addFile(file, fileType);
> > >> >>> indexingReq.setParam("literal.id", literalId);
> > >> >>> indexingReq.setParam("uprefix", "attr_");
> > >> >>> indexingReq.setParam("fmap.content", "content");
> > >> >>> indexingReq.setParam("literal.fileurl", fileURL);
> > >> >>> indexingReq.setAction(AbstractUpdateRequest.ACTION.COMMIT, true,
> > true);
> > >> >>> solrServer.request(indexingReq);
> > >> >>>
> > >> >>> Thanks & Regards
> > >> >>> Vijay
> > >> >>>
> > >> >>> --
> > >> >>> The contents of this e-mail are confidential and for the exclusive
> > use
> > >> of
> > >> >>> the intended recipient. If you receive this e-mail in error please
> > >> delete
> > >> >>> it from your system immediately and notify us either by e-mail or
> > >> >>> telephone. You should not copy, forward or otherwise disclose the
> > >> content
> > >> >>> of the e-mail. The views expressed in this communication may not
> > >> >>> necessarily be the view held by WHISHWORKS.
> > >> >>>
> > >> >>>
> > >> >
> > >>
> > >>
> > >> --
> > >> Ph: 9845704792
> > >>
> > >
> > > --
> > > The contents of this e-mail are confidential and for the exclusive use
> of
> > > the intended recipient. If you receive this e-mail in error please
> delete
> > > it from your system immediately and notify us either by e-mail or
> > > telephone. You should not copy, forward or otherwise disclose the
> content
> > > of the e-mail. The views expressed in this communication may not
> > > necessarily be the view held by WHISHWORKS.
> >
>
> --
> The contents of this e-mail are confidential and for the exclusive use of
> the intended recipient. If you receive this e-mail in error please delete
> it from your system immediately and notify us either by e-mail or
> telephone. You should not copy, forward or otherwise disclose the content
> of the e-mail. The views expressed in this communication may not
> necessarily be the view held by WHISHWORKS.
>

-- 
The contents of this e-mail are confidential and for the exclusive use of 
the intended recipient. If you receive this e-mail in error please delete 
it from your system immediately and notify us either by e-mail or 
telephone. You should not copy, forward or otherwise disclose the content 
of the e-mail. The views expressed in this communication may not 
necessarily be the view held by WHISHWORKS.

Re: Indexing PDF and MS Office files

Reply via email to