Re: by: java.util.zip.DataFormatException: invalid distance too far back reported by Solr API

Jan Høydahl Thu, 31 Jan 2019 02:25:48 -0800

Hi

This is Apache Tika that cannot parse a zip file or possibly a zip formatted 
office file.
You have to post the full stack trace (which you'll find in the solr.log on 
server side)
if you want help in locating the source of the issue, you may be able to 
configure Tika


Have you tried to specify ignoreTikaException=true on the request? See 
https://lucene.apache.org/solr/guide/7_6/uploading-data-with-solr-cell-using-apache-tika.html

At the end of the day it would be a much better architecture to parse the PDFs 
using plain standalone TikaServer and then construct a Solr Document in your 
Python code which is then posted to Solr. Reason is you have much better 
control over parse errors and how to map metadata to your schema fields. Also 
you don't want to overload Solr with all this work, it can even crash the whole 
Solr server if some parser crashes or gets stuck in an infinite loop.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 30. jan. 2019 kl. 20:49 skrev Monique Monteiro <monique.lou...@gmail.com>:
> 
> Hi all,
> 
> I'm writing a Python routine to upload thousands of PDF files to Solr, and
> after trying to upload some files, Solr reports the following error in a
> HTTP 500 response:
> 
> "by: java.util.zip.DataFormatException: invalid distance too far back"
> 
> Does anyone have any idea about how to overcome this?
> 
> Thanks in advance,
> Monique Monteiro

Re: by: java.util.zip.DataFormatException: invalid distance too far back reported by Solr API

Reply via email to