>At the end of the day it would be a much better architecture to parse the > PDFs using plain standalone TikaServer
+1 Also, note that we added a -spawnChild switch to tika-server that will run the server in a child process and kill+restart the child process if there is an infinite loop/oom/segfault, etc. Your client will need to handle tika-server being down for a second or two during restarts and/or 503 while shutting down. >In fact, moving the parsing to the client solved the problem! Yay! On Mon, Feb 4, 2019 at 2:01 PM Monique Monteiro <monique.lou...@gmail.com> wrote: > > Hi all, > > In fact, moving the parsing to the client solved the problem! > > Thanks! > Monique > > On Thu, Jan 31, 2019 at 8:25 AM Jan Høydahl <jan....@cominvent.com> wrote: > > > Hi > > > > This is Apache Tika that cannot parse a zip file or possibly a zip > > formatted office file. > > You have to post the full stack trace (which you'll find in the solr.log > > on server side) > > if you want help in locating the source of the issue, you may be able to > > configure Tika > > > > Have you tried to specify ignoreTikaException=true on the request? See > > https://lucene.apache.org/solr/guide/7_6/uploading-data-with-solr-cell-using-apache-tika.html > > > > At the end of the day it would be a much better architecture to parse the > > PDFs using plain standalone TikaServer and then construct a Solr Document > > in your Python code which is then posted to Solr. Reason is you have much > > better control over parse errors and how to map metadata to your schema > > fields. Also you don't want to overload Solr with all this work, it can > > even crash the whole Solr server if some parser crashes or gets stuck in an > > infinite loop. > > > > -- > > Jan Høydahl, search solution architect > > Cominvent AS - www.cominvent.com > > > > > 30. jan. 2019 kl. 20:49 skrev Monique Monteiro <monique.lou...@gmail.com > > >: > > > > > > Hi all, > > > > > > I'm writing a Python routine to upload thousands of PDF files to Solr, > > and > > > after trying to upload some files, Solr reports the following error in a > > > HTTP 500 response: > > > > > > "by: java.util.zip.DataFormatException: invalid distance too far back" > > > > > > Does anyone have any idea about how to overcome this? > > > > > > Thanks in advance, > > > Monique Monteiro > > > > > > -- > Monique Monteiro > Twitter: http://twitter.com/monilouise