RE: tikaparser docx file fails with exception

2015-11-06 Thread Allison, Timothy B.
M To: solr-user Subject: Re: tikaparser docx file fails with exception It is quite clear actually that the problem is this: Caused by: java.io.CharConversionException: Characters larger than 4 bytes are not supported: byte 0xb7 implies a length of more than 4

Re: tikaparser docx file fails with exception

2015-11-05 Thread Alexandre Rafalovitch
gt; Sent: Wednesday, November 04, 2015 4:33 PM > To: solr-user > Subject: Re: tikaparser docx file fails with exception > > Possibly a corrupt file? Tika does its best, but bad data is...bad data. > > You can experiment a bit with using Tika in Java, that might give you a >

RE: tikaparser docx file fails with exception

2015-11-05 Thread Aswath Srinivasan (TMS)
o the document. -Original Message- From: Erick Erickson [mailto:erickerick...@gmail.com] Sent: Wednesday, November 04, 2015 4:33 PM To: solr-user Subject: Re: tikaparser docx file fails with exception Possibly a corrupt file? Tika does its best, but bad data is...bad data. You can exper

Re: tikaparser docx file fails with exception

2015-11-04 Thread Erick Erickson
Possibly a corrupt file? Tika does its best, but bad data is...bad data. You can experiment a bit with using Tika in Java, that might give you a better idea of what's really going on, here's a SolrJ example: https://lucidworks.com/blog/2012/02/14/indexing-with-solrj/ Best, Erick On Wed, Nov 4,

RE: tikaparser docx file fails with exception

2015-11-04 Thread Aswath Srinivasan (TMS)
Trying to index a document. A docx file. Ending up with the below exception. Not sure why it is erroring out. When I opened the docx I was able to see lots of binary data like embedded pictures etc., Is there a possible solution to this or is it a bug? Only one such file fails. Rest of the file