Problem extracting and processing text from a PDF

David Patterson Wed, 05 Apr 2017 12:47:08 -0700

Hello,



I’m trying to extract the text from a PDF that was saved from a Word
document.



I am using Release 2.0.5 of pdfbox and pdfbox-tools, with Java 8 on a
Windows machine.



I’m using the following code to get the text:



PDDocument pdDocument = PDDocument.load( pdfFile );

PDFTextStripper stripper = new PDFTextStripper();

String rawText = stripper.getText( pdDocument );

// end of code excerpt



I’m running the same code on a collection of files. Most work as expected.
I can see the following in the text of the Table of Contents:

5.15.1 ADDENDA.....................................................
................................. 1

5.15.2 YOU ARE HERE ..............................
.............................................. 2

5.15.3 INTRODUCTION ..............................
.............................................. 4



However, for two files, what I see is:

5.16 xxx SYSTEM PROCEDURES
............................................................
1

 ADDENDA......................................
......................................................... 1 5.16.1

YOU ARE HERE ..............................
........................................................
2 5.16.2

INTRODUCTION 
.......................................................................................
4 5.16.3



Note: the outline numbers (5.16.1, etc.) are at the end of the line, not at
the beginning.



A)  Is this a known, solvable problem?

B)  If not, is there a different way I can try to extract the data?

C)  If not, can I help debug/diagnose the problem? I cannot send the
offending PDF file out of my system.

Thanks



Dave Patterson

Problem extracting and processing text from a PDF

Reply via email to