>At the end of the day it would be a much better architecture to parse the
> PDFs using plain standalone TikaServer

+1

Also, note that we added a -spawnChild switch to tika-server that will
run the server in a child process and kill+restart the child process
if there is an infinite loop/oom/segfault, etc.  Your client will need
to handle tika-server being down for a second or two during restarts
and/or 503 while shutting down.

>In fact, moving the parsing to the client solved the problem!
Yay!

On Mon, Feb 4, 2019 at 2:01 PM Monique Monteiro
<monique.lou...@gmail.com> wrote:
>
> Hi all,
>
> In fact, moving the parsing to the client solved the problem!
>
> Thanks!
> Monique
>
> On Thu, Jan 31, 2019 at 8:25 AM Jan Høydahl <jan....@cominvent.com> wrote:
>
> > Hi
> >
> > This is Apache Tika that cannot parse a zip file or possibly a zip
> > formatted office file.
> > You have to post the full stack trace (which you'll find in the solr.log
> > on server side)
> > if you want help in locating the source of the issue, you may be able to
> > configure Tika
> >
> > Have you tried to specify ignoreTikaException=true on the request? See
> > https://lucene.apache.org/solr/guide/7_6/uploading-data-with-solr-cell-using-apache-tika.html
> >
> > At the end of the day it would be a much better architecture to parse the
> > PDFs using plain standalone TikaServer and then construct a Solr Document
> > in your Python code which is then posted to Solr. Reason is you have much
> > better control over parse errors and how to map metadata to your schema
> > fields. Also you don't want to overload Solr with all this work, it can
> > even crash the whole Solr server if some parser crashes or gets stuck in an
> > infinite loop.
> >
> > --
> > Jan Høydahl, search solution architect
> > Cominvent AS - www.cominvent.com
> >
> > > 30. jan. 2019 kl. 20:49 skrev Monique Monteiro <monique.lou...@gmail.com
> > >:
> > >
> > > Hi all,
> > >
> > > I'm writing a Python routine to upload thousands of PDF files to Solr,
> > and
> > > after trying to upload some files, Solr reports the following error in a
> > > HTTP 500 response:
> > >
> > > "by: java.util.zip.DataFormatException: invalid distance too far back"
> > >
> > > Does anyone have any idea about how to overcome this?
> > >
> > > Thanks in advance,
> > > Monique Monteiro
> >
> >
>
> --
> Monique Monteiro
> Twitter: http://twitter.com/monilouise

Reply via email to