an option without changing PDFBox could be to create a custom log appender and grab the org.apache.pdfbox.pdmodel.font.PDSimpleFont log messages. You could then count them afterwards and if they are above a certain threshold decide to drop the result of the text extraction.
> Am 30.03.2017 um 14:54 schrieb Wouter De Borger <[email protected]>: > > Oh, sorry, my bad. > > The log lines are: > > 2017-46-30 14:46:04.788 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c49 (86) in font null > 2017-46-30 14:46:04.788 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c103 (87) in font null > 2017-46-30 14:46:04.789 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c59 (88) in font null > 2017-46-30 14:46:04.792 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c86 (89) in font null > 2017-46-30 14:46:04.792 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c122 (90) in font null > 2017-46-30 14:46:04.795 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c174 (32) in font null > 2017-46-30 14:46:04.795 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c104 (33) in font null > 2017-46-30 14:46:04.795 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c231 (34) in font null > 2017-46-30 14:46:04.796 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c175 (35) in font null > 2017-46-30 14:46:04.796 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c99 (36) in font null > 2017-46-30 14:46:04.796 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c98 (37) in font null > 2017-46-30 14:46:04.802 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c76 (32) in font null > 2017-46-30 14:46:04.802 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c101 (33) in font null > 2017-46-30 14:46:04.803 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c114 (34) in font null > 2017-46-30 14:46:04.803 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c109 (35) in font null > 2017-46-30 14:46:04.803 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c98 (36) in font null > 2017-46-30 14:46:04.803 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c111 (37) in font null > 2017-46-30 14:46:04.804 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c117 (38) in font null > 2017-46-30 14:46:04.804 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c115 (39) in font null > 2017-46-30 14:46:04.804 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c110 (40) in font null > 2017-46-30 14:46:04.804 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c116 (41) in font null > 2017-46-30 14:46:04.805 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c97 (42) in font null > 2017-46-30 14:46:04.805 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c108 (43) in font null > 2017-46-30 14:46:04.805 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c100 (44) in font null > 2017-46-30 14:46:04.805 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c39 (45) in font null > > I can't forward the PDF, as it contains banking information. I can try to > get permission to pass it on, but I don't have much hope. > The PDF is quite weird. It renders well, but pdftotext and chrome are also > unable to get meaning full text out of it. > The pdf is created with PDF Converter 3.0. > > The header/footer of the PDF are extracted somewhat OK, but the body looks > like thisouter > > On Thu, Mar 30, 2017 at 2:42 PM, Maruan Sahyoun <[email protected]> > wrote: > >> >>> Am 30.03.2017 um 14:37 schrieb Wouter De Borger < >> [email protected]>: >>> >>> Hi, >>> >>> Well, PDF box does know it can't decode the unicode characters (as it >>> outputs a stream of warnings). It would be nice if I could ask PDFBox how >>> many undecodable characters a document has. >> >> well, that's something you didn't mention before - could you drop some of >> the messages here so we know which one you are talking about? >> >> BR >> Maruan >> >>> >>> Wouter >>> >>> On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>>> Am 30.03.2017 um 14:25 schrieb Wouter De Borger < >>>> [email protected]>: >>>>> >>>>> Hi, >>>>> >>>>> Thanks for the hint! I'll try to add some content there, as I can >>>>> definitely use a garbage detector. >>>>> >>>>> In this case, however, I was specifically trying to avoid using a >>>>> statistical detector. PDFBox already knows there is a problem, >>>> >>>> that is not the case here. From PDFBox perspective everything is fine. >>>> It's extracting the text according to the definition and information in >> the >>>> PDF. That this is garbage from a users perspective would mean that >> PDFBox >>>> 'understands' that the extracted text is not meaningful. >>>> BR >>>> Maruan >>>> >>>>> so there is >>>>> no need to examine the content to attempt to detect a problem. >>>>> I would like to be able to capture the problem when and where it is >>>> known, >>>>> as this is easier and more accurate. >>>>> >>>>> Thanks, >>>>> Wouter >>>>> >>>>> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. < >> [email protected] >>>>> >>>>> wrote: >>>>> >>>>>> If you have any recommendations for the more general case, let us know >>>> on >>>>>> TIKA-1443 [1]. >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/TIKA-1443 >>>>>> >>>>>> -----Original Message----- >>>>>> From: Wouter De Borger [mailto:[email protected]] >>>>>> Sent: Thursday, March 30, 2017 6:00 AM >>>>>> To: [email protected] >>>>>> Subject: Make PDFBox fail on bad pdf >>>>>> >>>>>> Hi All, >>>>>> >>>>>> When a pdf has bad encoding, PDFBox produces garbage (as explained in >>>> the >>>>>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish). >>>>>> >>>>>> Can I make PDFBox fail in this case instead of producing garbage? >>>>>> >>>>>> (I'm working on a system that can also do OCR, so at the least sign of >>>>>> trouble, I would like PDF box to fail and try OCR.) >>>>>> >>>>>> Thanks, >>>>>> Wouter >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Wouter De Borger, PhD >>>>> Co-founder Inmanta >>>>> www.inmanta.com >>>>> Email: [email protected] >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>> >>> -- >>> Wouter De Borger, PhD >>> Co-founder Inmanta >>> www.inmanta.com >>> Email: [email protected] >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > > -- > Wouter De Borger, PhD > Co-founder Inmanta > www.inmanta.com > Email: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

