I have found this solution in Stackoverflow from Tim Allison to be working.
http://stackoverflow.com/questions/32354209/apache-
tika-extract-scanned-pdf-files
Regards,
Edwin
On 19 March 2017 at 19:47, Zheng Lin Edwin Yeo wrote:
> This is my settings in the PDFParser.properties file
> under tik
This is my settings in the PDFParser.properties file
under tika-parsers-1.13.jar
enableAutoSpace true
extractAnnotationText true
sortByPosition false
suppressDuplicateOverlappingText false
extractAcroFormContent true
extractInlineImages true
extractUniqueInlineImagesOnly true
checkExtractAccessPer
Hi Rick,
Thanks for your reply.
I saw this error message for the file which has a failure.
Am I able to index such files together with the other files which store
text as an image together in the same indexing threads?
2017-03-19 01:02:26.610 INFO (qtp1543727556-19) [c:collection1 s:shard1
r:co
Hi Edwin
The pdf file format can store text as an image, and then you need OCR to get
the text. However, text is more commonly not stored as an image in the pdf, and
then you should not use OCR to get the text.
Do you get an error message when you have a failure?
Cheers -- Rick
On March 18, 201
Hi,
I'm facing the issue of that the Tesseract OCR is not able to extract the
words in a PDF file in an attachment in EMLfile and index it into Solr
occasionally? However, most of the time it can be extracted.
What could be the reason that causes the file in the email attachment to be
failed to e