I have a few Russian PDFs that are exibiting strange behavior when being
extracted with PDFTextStripper. I am attaching my pdf, but I'm not sure if that
is the correct thing to do. When I extract the PDF on windows using UTF-8
encoding, the output is garbage. When I extract the PDF on windows not
specifying an encoding, the output is correct when viewed with Ultra Edit. When
I extract the PDF on linux using any encoding, the output is garbage.
It appears to me that the encoding isn't being read correctly from the PDF, and
when it's outputted as UTF-8, it is being double encoded. I can detect this
double encoding, and then run the file with no encoding specified, then convert
it to UTF-8 using iconv, and it is OK. But, this method does not work on linux,
as I cannot get the file to extract using any encoding on linux.
Has anyone come across anything like this before, and if so, what can be done
to solve it? I am using the latest 0.8 build from the svn repository. I just
recently started using pdfbox, so I am not very familiar with the code. Any
information will be helpful. Thanks.
-Adrian Romano