Hi, Thanks for the hint! I'll try to add some content there, as I can definitely use a garbage detector.
In this case, however, I was specifically trying to avoid using a statistical detector. PDFBox already knows there is a problem, so there is no need to examine the content to attempt to detect a problem. I would like to be able to capture the problem when and where it is known, as this is easier and more accurate. Thanks, Wouter On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <[email protected]> wrote: > If you have any recommendations for the more general case, let us know on > TIKA-1443 [1]. > > [1] https://issues.apache.org/jira/browse/TIKA-1443 > > -----Original Message----- > From: Wouter De Borger [mailto:[email protected]] > Sent: Thursday, March 30, 2017 6:00 AM > To: [email protected] > Subject: Make PDFBox fail on bad pdf > > Hi All, > > When a pdf has bad encoding, PDFBox produces garbage (as explained in the > FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish). > > Can I make PDFBox fail in this case instead of producing garbage? > > (I'm working on a system that can also do OCR, so at the least sign of > trouble, I would like PDF box to fail and try OCR.) > > Thanks, > Wouter > -- Wouter De Borger, PhD Co-founder Inmanta www.inmanta.com Email: [email protected]

