Hello,
I’m trying to extract the text from a PDF that was saved from a Word document. I am using Release 2.0.5 of pdfbox and pdfbox-tools, with Java 8 on a Windows machine. I’m using the following code to get the text: PDDocument pdDocument = PDDocument.load( pdfFile ); PDFTextStripper stripper = new PDFTextStripper(); String rawText = stripper.getText( pdDocument ); // end of code excerpt I’m running the same code on a collection of files. Most work as expected. I can see the following in the text of the Table of Contents: 5.15.1 ADDENDA..................................................... ................................. 1 5.15.2 YOU ARE HERE .............................. .............................................. 2 5.15.3 INTRODUCTION .............................. .............................................. 4 However, for two files, what I see is: 5.16 xxx SYSTEM PROCEDURES ............................................................ 1 ADDENDA...................................... ......................................................... 1 5.16.1 YOU ARE HERE .............................. ........................................................ 2 5.16.2 INTRODUCTION ....................................................................................... 4 5.16.3 Note: the outline numbers (5.16.1, etc.) are at the end of the line, not at the beginning. A) Is this a known, solvable problem? B) If not, is there a different way I can try to extract the data? C) If not, can I help debug/diagnose the problem? I cannot send the offending PDF file out of my system. Thanks Dave Patterson

