Hello,
I'm trying to find a way to read some data from a PDF file using PDFBox and
write it to a text file.
For English PDFs the code below works perfect .... But what if the PDF contains
other languages like Arabic - Chinese - ...
I'm trying to figure out how to specify the Encoding.
Can you please tell me what's wrong with my code ?
My Code :
/**
* This method should read an Arabic PDF & write its contents in a text file.
*/
private void copyPDFText() {
PDDocument pdfFile;
try {
String file = "C:\\index.pdf";
pdfFile = PDDocument.load( file ); // Open this pdf to edit.
// Specify the page to read :
PDFTextStripper stripper = new PDFTextStripper();
stripper.setStartPage( 3 );
stripper.setEndPage( 3 );
// Read page data (Here's the problem - It reads it as String & not bytes
as in text files):
String pageData = stripper.getText( pdfFile );
byte[] pageDataInBytes = pageData.getBytes();
String decodedPageData = new String( pageDataInBytes, "Cp1256" );
byte[] output = decodedPageData.getBytes( "UTF-8" );
// Define the text file to write the data to & write the encoded output to
it :
File outfile = new File( "C:\\index.txt" );
FileOutputStream fout = new FileOutputStream( outfile);
fout.write(output);
fout.close();
System.out.println( "Read/Write complete" );
pdfFile.save( file );
pdfFile.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
} catch (COSVisitorException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
Your help is really appreciated ,
Hesham