[
https://issues.apache.org/jira/browse/TIKA-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16566533#comment-16566533
]
Lior commented on TIKA-2702:
----------------------------
I'm seeing similar issue with docx file:
Using the apache tika I'm getting text with bullets inside, but using just the
POI to extract the text, i'm not seeing any bullets
So I guess that TIKA isn't doing a 1:1 match with those libraries - where can I
see the external process that TIKA is doing to the text?
> Different behavior between TIKA and pdfbox
> ------------------------------------------
>
> Key: TIKA-2702
> URL: https://issues.apache.org/jira/browse/TIKA-2702
> Project: Tika
> Issue Type: Bug
> Components: app
> Affects Versions: 1.18
> Reporter: Lior
> Priority: Major
>
> As far as I understand, TIKA is using pdfbox for extracting text from pdf
> files
> During a side benchmark I'm doing, I'm seeing that the text I'm getting using
> PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in
> most cases, when there is a hyperlink inside the pdf file, the pdfbox ignore
> the link itself, while TIKA is extracting the text, for example:
> https://www.linkedin.com/in/jhonDo
> mailto:[[email protected] |mailto:[email protected]]
>
> This is really a deal breaker for me, because I'm using pdfbox for another
> process I'm doing and I need the text to be the same, so I can't use TIKA at
> the moment....
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)