M
To: solr-user
Subject: Re: tikaparser docx file fails with exception
It is quite clear actually that the problem is this:
Caused by: java.io.CharConversionException: Characters larger than 4 bytes are
not supported: byte 0xb7 implies a length of more than 4
gt; Sent: Wednesday, November 04, 2015 4:33 PM
> To: solr-user
> Subject: Re: tikaparser docx file fails with exception
>
> Possibly a corrupt file? Tika does its best, but bad data is...bad data.
>
> You can experiment a bit with using Tika in Java, that might give you a
>
o the document.
-Original Message-
From: Erick Erickson [mailto:erickerick...@gmail.com]
Sent: Wednesday, November 04, 2015 4:33 PM
To: solr-user
Subject: Re: tikaparser docx file fails with exception
Possibly a corrupt file? Tika does its best, but bad data is...bad data.
You can exper
Possibly a corrupt file? Tika does its best, but bad data is...bad data.
You can experiment a bit with using Tika in Java, that might give you
a better idea of what's really going on, here's a SolrJ example:
https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/
Best,
Erick
On Wed, Nov 4,
Trying to index a document. A docx file. Ending up with the below exception.
Not sure why it is erroring out. When I opened the docx I was able to see lots
of binary data like embedded pictures etc., Is there a possible solution to
this or is it a bug? Only one such file fails. Rest of the file