[ 
https://issues.apache.org/jira/browse/TIKA-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16568483#comment-16568483
 ] 

Tim Allison commented on TIKA-2702:
-----------------------------------

Right, there is no guarantee or desire that Tika extracts the same text as 
PDFBox or POI.  We extract what we do based on user feedback.  We (very 
gratefully) rely on the underlying libraries, but we extract as users and devs 
desire.

If you'd like to turn off our parsers and swap in your own for those file 
types, see: https://tika.apache.org/1.18/parser_guide.html

To see our parser code, see: 
https://github.com/apache/tika/tree/master/tika-parsers/src/main/java/org/apache/tika/parser

> Different behavior between TIKA and pdfbox
> ------------------------------------------
>
>                 Key: TIKA-2702
>                 URL: https://issues.apache.org/jira/browse/TIKA-2702
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Lior
>            Priority: Major
>
> As far as I understand, TIKA is using pdfbox for extracting text from pdf 
> files
> During a side benchmark I'm doing, I'm seeing that the text I'm getting using 
> PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in 
> most cases, when there is a hyperlink inside the pdf file, the pdfbox ignore 
> the link itself, while TIKA is extracting the text, for example:
> https://www.linkedin.com/in/jhonDo
> mailto:[[email protected] |mailto:[email protected]]
>  
> This is really a deal breaker for me, because I'm using pdfbox for another 
> process I'm doing and I need the text to be the same, so I can't use TIKA at 
> the moment....



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to