Hi,
take a look at the ExtractImages.java source code in /org/apache/pdfbox/tools/
for cases where you can take the image data
directly and write that out directly.
BR
Maruan
> Hi all,
>
> I have a use case where I need to extract the images and the text content
> from PDF documents.
> Comparing the image extraction and text extraction speed the time taken for
> image extraction is too large.
>
> Furthermore, we compared the image extraction speed with Linux bash command
> *pdfimages* it was so much faster than pdfbox
>
> Is there anything I'm missing? I have included the snipped I have used for
> image extraction here.
>
> Thanks
> Aravinth
>
>
> PDDocument pdDocument = PDDocument.load(new File("test.pdf"));
> > for (PDPage pdPage : pdDocument.getPages())
> > {
> > PDResources resources = pdPage.getResources();
> > Iterable<COSName> xObjectNames =
> > resources.getXObjectNames();
> > for (COSName cosName : xObjectNames)
> > {
> > PDXObject xObject = resources.getXObject(cosName);
> > if(xObject instanceof PDImageXObject)
> > {
> > PDImageXObject pdImageXObject = (PDImageXObject)
> > xObject;
> > long start = System.currentTimeMillis();
> > BufferedImage image = pdImageXObject.getImage();
> > String nameName = cosName.getName();
> > System.out.println("Time taken for PDF image
> > object "+nameName +" "+(System.currentTimeMillis() - start));
> > BufferedOutputStream output = new
> > BufferedOutputStream(new FileOutputStream(nameName + "." +
> > pdImageXObject.getSuffix()));
> > start = System.currentTimeMillis();
> > ImageIOUtil.writeImage(image ,
> > pdImageXObject.getSuffix() , output);
> > output.close();
> > System.out.println("Time taken for write to file
> > object "+nameName +" " +(System.currentTimeMillis() - start));
> > }
> > }
> > }
> > pdDocument.close();
> > System.err.println("Time taken for extracting for images " +
> > (System.currentTimeMillis() - time));
> >
>
> The PDF Image extraction using pdfimages,
>
> long start = System.currentTimeMillis();
> > ProcessBuilder processBuilder = new ProcessBuilder("pdfimages" , "-j" ,
> > "test.pdf" , "out");
> > processBuilder.start();
> >
> > System.out.println("Time taken for extracting images " +
> > (System.currentTimeMillis() - start));
> >
--
Maruan Sahyoun
FileAffairs GmbH
Josef-Schappe-Straße 21
40882 Ratingen
Tel: +49 (2102) 89497 88
Fax: +49 (2102) 89497 91
[email protected]
www.fileaffairs.de
Geschäftsführer: Maruan Sahyoun
Handelsregister: AG Düsseldorf, HRB 53837
UST.-ID: DE248275827
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]