I have a few Russian PDFs that are exibiting strange behavior when being 
extracted with PDFTextStripper. I am attaching my pdf, but I'm not sure if that 
is the correct thing to do. When I extract the PDF on windows using UTF-8  
encoding, the output is garbage. When I extract the PDF on windows not 
specifying an encoding, the output is correct when viewed with Ultra Edit. When 
I extract the PDF on linux using any encoding, the output is garbage. 
It appears to me that the encoding isn't being read correctly from the PDF, and 
when it's outputted as UTF-8, it is being double encoded. I can detect this 
double encoding, and then run the file with no encoding specified, then convert 
it to UTF-8 using iconv, and it is OK. But, this method does not work on linux, 
as I cannot get the file to extract using any encoding on linux. 
Has anyone come across anything like this before, and if so, what can be done 
to solve it? I am using the latest 0.8 build from the svn repository. I just 
recently started using pdfbox, so I am not very familiar with the code. Any 
information will be helpful. Thanks.
 
-Adrian Romano
 
 

Reply via email to