Re: Make PDFBox fail on bad pdf

Maruan Sahyoun Thu, 30 Mar 2017 06:51:46 -0700

an option without changing PDFBox could be  to create a custom log appender and 
grab the org.apache.pdfbox.pdmodel.font.PDSimpleFont log messages. You could 
then count them afterwards and if they are above a certain threshold decide to 
drop the result of the text extraction.


> Am 30.03.2017 um 14:54 schrieb Wouter De Borger <[email protected]>:
> 
> Oh, sorry, my bad.
> 
> The log lines are:
> 
> 2017-46-30 14:46:04.788   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c49 (86) in font null
> 2017-46-30 14:46:04.788   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c103 (87) in font null
> 2017-46-30 14:46:04.789   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c59 (88) in font null
> 2017-46-30 14:46:04.792   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c86 (89) in font null
> 2017-46-30 14:46:04.792   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c122 (90) in font null
> 2017-46-30 14:46:04.795   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c174 (32) in font null
> 2017-46-30 14:46:04.795   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c104 (33) in font null
> 2017-46-30 14:46:04.795   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c231 (34) in font null
> 2017-46-30 14:46:04.796   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c175 (35) in font null
> 2017-46-30 14:46:04.796   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c99 (36) in font null
> 2017-46-30 14:46:04.796   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c98 (37) in font null
> 2017-46-30 14:46:04.802   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c76 (32) in font null
> 2017-46-30 14:46:04.802   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c101 (33) in font null
> 2017-46-30 14:46:04.803   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c114 (34) in font null
> 2017-46-30 14:46:04.803   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c109 (35) in font null
> 2017-46-30 14:46:04.803   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c98 (36) in font null
> 2017-46-30 14:46:04.803   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c111 (37) in font null
> 2017-46-30 14:46:04.804   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c117 (38) in font null
> 2017-46-30 14:46:04.804   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c115 (39) in font null
> 2017-46-30 14:46:04.804   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c110 (40) in font null
> 2017-46-30 14:46:04.804   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c116 (41) in font null
> 2017-46-30 14:46:04.805   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c97 (42) in font null
> 2017-46-30 14:46:04.805   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c108 (43) in font null
> 2017-46-30 14:46:04.805   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c100 (44) in font null
> 2017-46-30 14:46:04.805   [33mWARN [m ---
> [DefaultMessageListenerContainer-1]
> [1morg.apache.pdfbox.pdmodel.font.PDSimpleFont.toUnicode(PDSimpleFont.java:325)
> [m : No Unicode mapping for c39 (45) in font null
> 
> I can't forward the PDF, as it contains banking information. I can try to
> get permission to pass it on, but I don't have much hope.
> The PDF is quite weird. It renders well, but pdftotext and chrome are also
> unable to get meaning full text out of it.
> The pdf is created with PDF Converter 3.0.
> 
> The header/footer of the PDF are extracted somewhat OK, but the body looks
> like this:
> 
> "#$%&'(
> )*+$ ,% -,+$#$' .% /!&# $"0!'1%' 2&% 34 5"#&'+"6% %#7 .$#-!#8% 9 /!&#
> 6!"#%"7$' &"% +/+"6% #&'
> /!7'% 6!"7'+7 .*+##&'+"6% ": ;;<=>?;;@AA=><B
> C%# 6!".$7$!"# %" #%'+$%"7 "!7+11%"7 ,%# #&$/+"7%# D
> !"7+"7 "!1$"+,
> D >BE;;(;; FGH
> ... I&'8%
> D ? +"#
> J+&K
> D ?(L?; M ,*+"
> N$"78'O7# -+P+Q,%# +"7$6$-+7$/%1%"7 -+' -8'$!.% .*&" +"R
> S$ ,%# 1!.+,$78# .% 6%77% +/+"6% .8/%,!--8%# .+"# 6% 6!&''$%' /!&#
> 6!"/$%""%"7( T% /!&# $"/$7% 9
> 1% '%"/!P%' ,%# -$U6%# #&$/+"7%# +/+"7 ,% @; +/'$, A;V? D
> .. ,% .!&Q,% .% ,+ -'8#%"7% ,%77'%( .+78 %7 #$W"8X
> /!7'% 6!"7'+7 .*+##&'+"6% %7 7!&# ,%# +/%"+"7#( 2&$ #%'!"7 6!"#%'/8# -+' 34
> 5"#&'+"6% -%".+"7
> ,+ .&'8% .% ,*+/+"6%B
> IU# '86%-7$!" .% 6%# .!6&1%"7#( ,% 1!"7+"7 "!1$"+, .% ,*+/+"6%( .$1$"&8 .%#
> $"78'O7# +"7$6$-8#
> -!&' ,+ -'%1$U'% -8'$!.%( /!&# #%'+ -+P8B
> C+ .+7% .% -'$#% .% 6!&'# .% /!7'% +/+"6% #&' 6!"7'+7 %#7 0$K8% +& V%' .&
> 1!$# .% ,+ '86%-7$!" -+'
> 34 5"#&'+"6% .& .!&Q,% #$W"8 .% ,+ -'8#%"7%B C%#
> 
> Wouter
> 
> On Thu, Mar 30, 2017 at 2:42 PM, Maruan Sahyoun <[email protected]>
> wrote:
> 
>> 
>>> Am 30.03.2017 um 14:37 schrieb Wouter De Borger <
>> [email protected]>:
>>> 
>>> Hi,
>>> 
>>> Well, PDF box does know it can't decode the unicode characters (as it
>>> outputs a stream of warnings). It would be nice if I could ask PDFBox how
>>> many undecodable characters a document has.
>> 
>> well, that's something you didn't mention before - could you drop some of
>> the messages here so we know which one you are talking about?
>> 
>> BR
>> Maruan
>> 
>>> 
>>> Wouter
>>> 
>>> On Thu, Mar 30, 2017 at 2:29 PM, Maruan Sahyoun <[email protected]>
>>> wrote:
>>> 
>>>> Hi,
>>>> 
>>>>> Am 30.03.2017 um 14:25 schrieb Wouter De Borger <
>>>> [email protected]>:
>>>>> 
>>>>> Hi,
>>>>> 
>>>>> Thanks for the hint! I'll try to add some content there, as I can
>>>>> definitely use a garbage detector.
>>>>> 
>>>>> In this case, however, I was specifically trying to avoid using a
>>>>> statistical detector. PDFBox already knows there is a problem,
>>>> 
>>>> that is not the case here. From PDFBox perspective everything is fine.
>>>> It's extracting the text according to the definition and information in
>> the
>>>> PDF. That this is garbage from a users perspective would mean that
>> PDFBox
>>>> 'understands' that the extracted text is not meaningful.
>>>> BR
>>>> Maruan
>>>> 
>>>>> so there is
>>>>> no need to examine the content to attempt to detect a problem.
>>>>> I would like to be able to capture the problem when and where it is
>>>> known,
>>>>> as this is easier and more accurate.
>>>>> 
>>>>> Thanks,
>>>>> Wouter
>>>>> 
>>>>> On Thu, Mar 30, 2017 at 2:16 PM, Allison, Timothy B. <
>> [email protected]
>>>>> 
>>>>> wrote:
>>>>> 
>>>>>> If you have any recommendations for the more general case, let us know
>>>> on
>>>>>> TIKA-1443 [1].
>>>>>> 
>>>>>> [1] https://issues.apache.org/jira/browse/TIKA-1443
>>>>>> 
>>>>>> -----Original Message-----
>>>>>> From: Wouter De Borger [mailto:[email protected]]
>>>>>> Sent: Thursday, March 30, 2017 6:00 AM
>>>>>> To: [email protected]
>>>>>> Subject: Make PDFBox fail on bad pdf
>>>>>> 
>>>>>> Hi All,
>>>>>> 
>>>>>> When a pdf has bad encoding, PDFBox produces garbage (as explained in
>>>> the
>>>>>> FAQ https://pdfbox.apache.org/2.0/faq.html#gibberish).
>>>>>> 
>>>>>> Can I make PDFBox fail in this case instead of producing garbage?
>>>>>> 
>>>>>> (I'm working on a system that can also do OCR, so at the least sign of
>>>>>> trouble, I would like PDF box to fail and try OCR.)
>>>>>> 
>>>>>> Thanks,
>>>>>> Wouter
>>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Wouter De Borger, PhD
>>>>> Co-founder Inmanta
>>>>> www.inmanta.com
>>>>> Email: [email protected]
>>>> 
>>>> 
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> Wouter De Borger, PhD
>>> Co-founder Inmanta
>>> www.inmanta.com
>>> Email: [email protected]
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>> 
>> 
> 
> 
> -- 
> Wouter De Borger, PhD
> Co-founder Inmanta
> www.inmanta.com
> Email: [email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Make PDFBox fail on bad pdf

Reply via email to