Extracting Arabic text from a PDF

Hesham G. Fri, 05 Jun 2009 14:36:13 -0700

Hello,
 
I'm trying to find a way to read some data from a PDF file using PDFBox and 
write it to a text file.
For English PDFs the code below works perfect .... But what if the PDF contains 
other languages like Arabic - Chinese - ...
I'm trying to figure out how to specify the Encoding.
Can you please tell me what's wrong with my code ?


My Code :

    /**

    * This method should read an Arabic PDF & write its contents in a text file.

    */

    private void copyPDFText() {

    PDDocument pdfFile;

    try {

    String file = "C:\\index.pdf"; 


    pdfFile = PDDocument.load( file ); // Open this pdf to edit. 


    // Specify the page to read :

    PDFTextStripper stripper = new PDFTextStripper();

    stripper.setStartPage( 3 );

    stripper.setEndPage( 3 );


    // Read page data (Here's the problem - It reads it as String & not bytes 
as in text files):

    String pageData = stripper.getText( pdfFile );

    byte[] pageDataInBytes = pageData.getBytes();


    String decodedPageData = new String( pageDataInBytes, "Cp1256" ); 

    byte[] output = decodedPageData.getBytes( "UTF-8" );


    // Define the text file to write the data to & write the encoded output to 
it : 

    File outfile = new File( "C:\\index.txt" );

    FileOutputStream fout = new FileOutputStream( outfile);


    fout.write(output);

    fout.close();


    System.out.println( "Read/Write complete" );


    pdfFile.save( file );

    pdfFile.close();


    } catch (IOException e) {

    // TODO Auto-generated catch block

    e.printStackTrace();

    } catch (COSVisitorException e) {

    // TODO Auto-generated catch block

    e.printStackTrace();

    }

    }



Your help is really appreciated ,
Hesham

Extracting Arabic text from a PDF

Reply via email to