Hi,

> Am 30.03.2017 um 14:25 schrieb Wouter De Borger <[email protected]>:
> 
> Hi,
> 
> Thanks for the hint! I'll try to add some content there, as I can
> definitely use a garbage detector.
> 
> In this case, however, I was specifically trying to avoid using a
> statistical detector. PDFBox already knows there is a problem,

that is not the case here. From PDFBox perspective everything is fine. It's 
extracting the text according to the definition and information in the PDF. 
That this is garbage from a users perspective would mean that PDFBox 
'understands' that the extracted text is not meaningful.
BR
Maruan 

> so there is
> no need to examine the content to attempt to detect a problem.
> I would like to be able to capture the problem when and where it is known,
> as this is easier and more accurate.
> 
> Thanks,
> Wouter
> 
> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <[email protected]>
> wrote:
> 
>> If you have any recommendations for the more general case, let us know on
>> TIKA-1443 [1].
>> 
>> [1] https://issues.apache.org/jira/browse/TIKA-1443
>> 
>> -----Original Message-----
>> From: Wouter De Borger [mailto:[email protected]]
>> Sent: Thursday, March 30, 2017 6:00 AM
>> To: [email protected]
>> Subject: Make PDFBox fail on bad pdf
>> 
>> Hi All,
>> 
>> When a pdf has bad encoding, PDFBox produces garbage (as explained in the
>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
>> 
>> Can I make PDFBox fail in this case instead of producing garbage?
>> 
>> (I'm working on a system that can also do OCR, so at the least sign of
>> trouble, I would like PDF box to fail and try OCR.)
>> 
>> Thanks,
>> Wouter
>> 
> 
> 
> 
> -- 
> Wouter De Borger, PhD
> Co-founder Inmanta
> www.inmanta.com
> Email: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to