Re: OCR not working occasionally

2017-03-27 Thread Zheng Lin Edwin Yeo
I have found this solution in Stackoverflow from Tim Allison to be working. http://stackoverflow.com/questions/32354209/apache- tika-extract-scanned-pdf-files Regards, Edwin On 19 March 2017 at 19:47, Zheng Lin Edwin Yeo wrote: > This is my settings in the PDFParser.properties file > under tik

Re: OCR not working occasionally

2017-03-19 Thread Zheng Lin Edwin Yeo
This is my settings in the PDFParser.properties file under tika-parsers-1.13.jar enableAutoSpace true extractAnnotationText true sortByPosition false suppressDuplicateOverlappingText false extractAcroFormContent true extractInlineImages true extractUniqueInlineImagesOnly true checkExtractAccessPer

Re: OCR not working occasionally

2017-03-18 Thread Zheng Lin Edwin Yeo
Hi Rick, Thanks for your reply. I saw this error message for the file which has a failure. Am I able to index such files together with the other files which store text as an image together in the same indexing threads? 2017-03-19 01:02:26.610 INFO (qtp1543727556-19) [c:collection1 s:shard1 r:co

Re: OCR not working occasionally

2017-03-18 Thread Rick Leir
Hi Edwin The pdf file format can store text as an image, and then you need OCR to get the text. However, text is more commonly not stored as an image in the pdf, and then you should not use OCR to get the text. Do you get an error message when you have a failure? Cheers -- Rick On March 18, 201

OCR not working occasionally

2017-03-18 Thread Zheng Lin Edwin Yeo
Hi, I'm facing the issue of that the Tesseract OCR is not able to extract the words in a PDF file in an attachment in EMLfile and index it into Solr occasionally? However, most of the time it can be extracted. What could be the reason that causes the file in the email attachment to be failed to e