Hi,
I use this code for extracting the text of my pdf files for adding them to
the lucene index:

    public Reader extractText(InputStream stream,
                              String type,
                              String encoding) throws IOException {
        try {
            PDFParser parser = new PDFParser(new
BufferedInputStream(stream));
            try {
                parser.parse();
                PDDocument document = parser.getPDDocument();
                CharArrayWriter writer = new CharArrayWriter();

                PDFTextStripper stripper = new PDFTextStripper();
                stripper.setLineSeparator("\n");
                stripper.writeText(document, writer);

                return new CharArrayReader(writer.toCharArray());
            } finally {
                try {
                    PDDocument doc = parser.getPDDocument();
                    if (doc != null) {
                        doc.close();
                    }
                } catch (IOException e) {
                    // ignore
                }
            }
        } catch (Throwable e) {
            logger.log(Level.WARNING, "Failed to extract PDF text content",
e);
            return new StringReader("");
        } finally {
            stream.close();
        }
    }


2008/12/10 NiTiN <[EMAIL PROTECTED]>

> Hi,
>
>  i dont know how to extract all content of given pdf file using pdfbox,
> Please give me proper direction for that..
>
>
> Thank you ,
> NiTiN
>



-- 
Mit freundlichen Grüßen

Daniel Manzke

Reply via email to