an option without changing PDFBox could be to create a custom log appender and grab the org.apache.pdfbox.pdmodel.font.PDSimpleFont log messages. You could then count them afterwards and if they are above a certain threshold decide to drop the result of the text extraction.
> Am 30.03.2017 um 14:54 schrieb Wouter De Borger <[email protected]>: > > Oh, sorry, my bad. > > The log lines are: > > 2017-46-30 14:46:04.788 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c49 (86) in font null > 2017-46-30 14:46:04.788 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c103 (87) in font null > 2017-46-30 14:46:04.789 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c59 (88) in font null > 2017-46-30 14:46:04.792 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c86 (89) in font null > 2017-46-30 14:46:04.792 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c122 (90) in font null > 2017-46-30 14:46:04.795 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c174 (32) in font null > 2017-46-30 14:46:04.795 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c104 (33) in font null > 2017-46-30 14:46:04.795 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c231 (34) in font null > 2017-46-30 14:46:04.796 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c175 (35) in font null > 2017-46-30 14:46:04.796 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c99 (36) in font null > 2017-46-30 14:46:04.796 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c98 (37) in font null > 2017-46-30 14:46:04.802 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c76 (32) in font null > 2017-46-30 14:46:04.802 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c101 (33) in font null > 2017-46-30 14:46:04.803 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c114 (34) in font null > 2017-46-30 14:46:04.803 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c109 (35) in font null > 2017-46-30 14:46:04.803 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c98 (36) in font null > 2017-46-30 14:46:04.803 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c111 (37) in font null > 2017-46-30 14:46:04.804 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c117 (38) in font null > 2017-46-30 14:46:04.804 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c115 (39) in font null > 2017-46-30 14:46:04.804 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c110 (40) in font null > 2017-46-30 14:46:04.804 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c116 (41) in font null > 2017-46-30 14:46:04.805 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c97 (42) in font null > 2017-46-30 14:46:04.805 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c108 (43) in font null > 2017-46-30 14:46:04.805 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c100 (44) in font null > 2017-46-30 14:46:04.805 [33mWARN [m --- > [DefaultMessageListenerContainer-1] > [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325) > [m : No Unicode mapping for c39 (45) in font null > > I can't forward the PDF, as it contains banking information. I can try to > get permission to pass it on, but I don't have much hope. > The PDF is quite weird. It renders well, but pdftotext and chrome are also > unable to get meaning full text out of it. > The pdf is created with PDF Converter 3.0. > > The header/footer of the PDF are extracted somewhat OK, but the body looks > like this: > > "#$%&'( > )*+$ ,% -,+$#$' .% /!&# $"0!'1%' 2&% 34 5"#&'+"6% %#7 .$#-!#8% 9 /!&# > 6!"#%"7$' &"% +/+"6% #&' > /!7'% 6!"7'+7 .*+##&'+"6% ": ;;<=>?;;@AA=><B > C%# 6!".$7$!"# %" #%'+$%"7 "!7+11%"7 ,%# #&$/+"7%# D > !"7+"7 "!1$"+, > D >BE;;(;; FGH > ... I&'8% > D ? +"# > J+&K > D ?(L?; M ,*+" > N$"78'O7# -+P+Q,%# +"7$6$-+7$/%1%"7 -+' -8'$!.% .*&" +"R > S$ ,%# 1!.+,$78# .% 6%77% +/+"6% .8/%,!--8%# .+"# 6% 6!&''$%' /!&# > 6!"/$%""%"7( T% /!&# $"/$7% 9 > 1% '%"/!P%' ,%# -$U6%# #&$/+"7%# +/+"7 ,% @; +/'$, A;V? D > .. ,% .!&Q,% .% ,+ -'8#%"7% ,%77'%( .+78 %7 #$W"8X > /!7'% 6!"7'+7 .*+##&'+"6% %7 7!&# ,%# +/%"+"7#( 2&$ #%'!"7 6!"#%'/8# -+' 34 > 5"#&'+"6% -%".+"7 > ,+ .&'8% .% ,*+/+"6%B > IU# '86%-7$!" .% 6%# .!6&1%"7#( ,% 1!"7+"7 "!1$"+, .% ,*+/+"6%( .$1$"&8 .%# > $"78'O7# +"7$6$-8# > -!&' ,+ -'%1$U'% -8'$!.%( /!&# #%'+ -+P8B > C+ .+7% .% -'$#% .% 6!&'# .% /!7'% +/+"6% #&' 6!"7'+7 %#7 0$K8% +& V%' .& > 1!$# .% ,+ '86%-7$!" -+' > 34 5"#&'+"6% .& .!&Q,% #$W"8 .% ,+ -'8#%"7%B C%# > > Wouter > > On Thu, Mar 30, 2017 at 2:42 PM, Maruan Sahyoun <[email protected]> > wrote: > >> >>> Am 30.03.2017 um 14:37 schrieb Wouter De Borger < >> [email protected]>: >>> >>> Hi, >>> >>> Well, PDF box does know it can't decode the unicode characters (as it >>> outputs a stream of warnings). It would be nice if I could ask PDFBox how >>> many undecodable characters a document has. >> >> well, that's something you didn't mention before - could you drop some of >> the messages here so we know which one you are talking about? >> >> BR >> Maruan >> >>> >>> Wouter >>> >>> On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <[email protected]> >>> wrote: >>> >>>> Hi, >>>> >>>>> Am 30.03.2017 um 14:25 schrieb Wouter De Borger < >>>> [email protected]>: >>>>> >>>>> Hi, >>>>> >>>>> Thanks for the hint! I'll try to add some content there, as I can >>>>> definitely use a garbage detector. >>>>> >>>>> In this case, however, I was specifically trying to avoid using a >>>>> statistical detector. PDFBox already knows there is a problem, >>>> >>>> that is not the case here. From PDFBox perspective everything is fine. >>>> It's extracting the text according to the definition and information in >> the >>>> PDF. That this is garbage from a users perspective would mean that >> PDFBox >>>> 'understands' that the extracted text is not meaningful. >>>> BR >>>> Maruan >>>> >>>>> so there is >>>>> no need to examine the content to attempt to detect a problem. >>>>> I would like to be able to capture the problem when and where it is >>>> known, >>>>> as this is easier and more accurate. >>>>> >>>>> Thanks, >>>>> Wouter >>>>> >>>>> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. < >> [email protected] >>>>> >>>>> wrote: >>>>> >>>>>> If you have any recommendations for the more general case, let us know >>>> on >>>>>> TIKA-1443 [1]. >>>>>> >>>>>> [1] https://issues.apache.org/jira/browse/TIKA-1443 >>>>>> >>>>>> -----Original Message----- >>>>>> From: Wouter De Borger [mailto:[email protected]] >>>>>> Sent: Thursday, March 30, 2017 6:00 AM >>>>>> To: [email protected] >>>>>> Subject: Make PDFBox fail on bad pdf >>>>>> >>>>>> Hi All, >>>>>> >>>>>> When a pdf has bad encoding, PDFBox produces garbage (as explained in >>>> the >>>>>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish). >>>>>> >>>>>> Can I make PDFBox fail in this case instead of producing garbage? >>>>>> >>>>>> (I'm working on a system that can also do OCR, so at the least sign of >>>>>> trouble, I would like PDF box to fail and try OCR.) >>>>>> >>>>>> Thanks, >>>>>> Wouter >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Wouter De Borger, PhD >>>>> Co-founder Inmanta >>>>> www.inmanta.com >>>>> Email: [email protected] >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>> >>> -- >>> Wouter De Borger, PhD >>> Co-founder Inmanta >>> www.inmanta.com >>> Email: [email protected] >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > > -- > Wouter De Borger, PhD > Co-founder Inmanta > www.inmanta.com > Email: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]

