[
https://issues.apache.org/jira/browse/TIKA-3102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17106520#comment-17106520
]
Tilman Hausherr commented on TIKA-3102:
---------------------------------------
Yes, you should use the OCR feature in Tika to extract this doc (uses
tesseract). IIRC the OCR can be made optional, i.e. that it kicks in only when
too many characters don't get decoded.
> Unmappable chars for PDF
> ------------------------
>
> Key: TIKA-3102
> URL: https://issues.apache.org/jira/browse/TIKA-3102
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.21, 1.24
> Reporter: Henning Vogt
> Priority: Major
>
> Parsing PDF produces almost no output, even though the PDF is perfectly
> readable in adobe pdf reader. In adobe pdf reader, if you select a part of
> the text, and copy it, it's also byte garbage, like the output in tika.
> PDF File:
> [https://s21.q4cdn.com/317678438/files/doc_financials/Annual/2018/2018-Financial-Report.pdf]
> I can't attach the file here directly, sorry.
> Error log:
> {code:java}
> 2020-05-13 08:25:53,008 WARN [pool-6-thread-1]
> o.a.pdfbox.pdmodel.font.PDSimpleFont No Unicode mapping for MT55 (30) in font
> JPGAGL+Helvetica-Bold
> 2020-05-13 08:25:53,008 WARN [pool-6-thread-1]
> o.a.pdfbox.pdmodel.font.PDSimpleFont No Unicode mapping for MT79 (31) in font
> JPGAGL+Helvetica-Bold
> .
> .
> . (goes on for ~6500 lines, also with a different font){code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)