[jira] [Commented] (TIKA-2702) Different behavior between TIKA and pdfbox

Tim Allison (JIRA) Wed, 01 Aug 2018 05:57:49 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2702?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16565254#comment-16565254
 ]


Tim Allison commented on TIKA-2702:
-----------------------------------

If you want to prevent extraction of hyperlinks, we could parameterize that in 
Tika to allow you to turn it off via tika-config.xml.

However, there will likely be many differences in how Tika extracts text from 
PDFs and how PDFBox's text extractor extracts text.  IIRC, PDFBox doesn't 
handle embedded files; it may not handle bookmarks, annotations, any number of 
things that I can't remember at the moment.

Can you help me understand why you need to have an exact 1:1 match?

> Different behavior between TIKA and pdfbox
> ------------------------------------------
>
>                 Key: TIKA-2702
>                 URL: https://issues.apache.org/jira/browse/TIKA-2702
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.18
>            Reporter: Lior
>            Priority: Major
>
> As far as I understand, TIKA is using pdfbox for extracting text from pdf 
> files
> During a side benchmark I'm doing, I'm seeing that the text I'm getting using 
> PDFBox 2.0.9 and the text I'm getting from TIKA is not 100% the same...in 
> most cases, when there is a hyperlink inside the pdf file, the pdfbox ignore 
> the link itself, while TIKA is extracting the text, for example:
> https://www.linkedin.com/in/jhonDo
> mailto:[[email protected] |mailto:[email protected]]
>  
> This is really a deal breaker for me, because I'm using pdfbox for another 
> process I'm doing and I need the text to be the same, so I can't use TIKA at 
> the moment....



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2702) Different behavior between TIKA and pdfbox

Reply via email to