[ 
https://issues.apache.org/jira/browse/TIKA-1963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15265657#comment-15265657
 ] 

Konstantin Avdeev commented on TIKA-1963:
-----------------------------------------

> you can't OCR a PDF
I believe, I can understand that :)

Guys, simple question again - is it possible with current implementation to 
configure the toolkit to enable Tesseract for PDF only? If not, are there any 
plans to make the "high degree of control" even more higher?
Thanks a lot!


> Configuring Parsers: "high degree of control over which parsers are or aren't 
> used" does not work
> -------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-1963
>                 URL: https://issues.apache.org/jira/browse/TIKA-1963
>             Project: Tika
>          Issue Type: Bug
>          Components: config
>    Affects Versions: 1.12
>         Environment: windows, java version "1.8.0_73", 64 bit
>            Reporter: Konstantin Avdeev
>
> Hi everybody!
> I'm trying to white-list a particular mime-type for OCR with the following 
> config:
> {code}
> <properties>
>   <parsers>
>     <parser class="org.apache.tika.parser.DefaultParser">
>       <mime-exclude>application/pdf</mime-exclude>
>       <parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>     </parser>
>     <parser class="org.apache.tika.parser.pdf.PDFParser">
>       <mime>application/pdf</mime>
>     </parser>
>   </parsers>
> </properties>
> {code}
> So, the idea is - to enable the Tesseract parser for PDF format only.
> But this configuration disables the Tesseract completely.
> Is it the expected behaviour or a bug?
> Thank you!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to