Thanks for the info! We would prefer to continue with PDFBox if possible.
Lowering resolution would bring bad user experience for us.

I am requesting for sharing the files. Once available, I am going to share
them here.

On Fri, Feb 5, 2021 at 11:30 AM Tilman Hausherr <[email protected]>
wrote:

> Am 05.02.2021 um 20:17 schrieb Ethan Huang:
> > We are converting PDF files into images and the way we are doing it is
> > breaking a single PDF files into several PDDocument, one per page, and
> > converting them in parallel.
> >
> >
> >
> > What I found is for pages with more objects, the processing is going to
> > take much longer (see below logs, time unit in seconds).
> >
> > I cannot share the test file for now. I will need to ask for permission.
>
>
> Please do so. It is unlikely, but sometimes we do find an optimization
> potential in PDFBox when a file is slow. However in most cases we can't
> help.
>
>
> >
> > Is there a way to make it faster? Also I see the logs for pages requiring
> > longer processing time.
>
> No, not really. You could lower the resolution.
>
>
> >
> > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
> > getAnchorRect
> > INFO: Pattern surface is too large, will be clipped
> > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
> > getAnchorRect
> > INFO: width: 4405.8223, height: -4405.8223
> > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
> > getAnchorRect
> > INFO: XStep: 1707.63, YStep: 1707.63
> > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
> > getAnchorRect
> > INFO: bbox: [-54.8253,-217.611,1652.8,1490.02]
> > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
> > getAnchorRect
> > INFO: pattern matrix: [2.58008,0.0,0.0,-2.58008,0.0,540.0]
> > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
> > getAnchorRect
> > INFO: concatenated matrix: [2.58008,0.0,0.0,-2.58008,0.0,540.0]
> >
> >
> > Logs showing objects count and processing duration per page for the file
> > with PDFBox:
> >
> >
> > [main] INFO doc.DocumentProcessorUtils - page 0 has 20 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 1 has 24 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 2 has 176 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 3 has 21 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 4 has 26 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 5 has 21 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 6 has 138 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 7 has 33 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 8 has 22 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 9 has 26 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 10 has 52 objs.
> >
> > [ForkJoinPool.commonPool-worker-10] INFO doc.Pdf2Image - Page 3 takes
> 0.803.
> > [ForkJoinPool.commonPool-worker-13] INFO doc.Pdf2Image - Page 8 takes
> 0.805.
> > [ForkJoinPool.commonPool-worker-8] INFO doc.Pdf2Image - Page 4 takes
> 0.822.
> > [ForkJoinPool.commonPool-worker-15] INFO doc.Pdf2Image - Page 0 takes
> 0.852.
> > [ForkJoinPool.commonPool-worker-11] INFO doc.Pdf2Image - Page 5 takes
> 0.892.
> > [ForkJoinPool.commonPool-worker-4] INFO doc.Pdf2Image - Page 1 takes
> 0.901.
> > [ForkJoinPool.commonPool-worker-6] INFO doc.Pdf2Image - Page 7 takes
> 0.962.
> > [ForkJoinPool.commonPool-worker-2] INFO doc.Pdf2Image - Page 9 takes
> 1.075.
> > [ForkJoinPool.commonPool-worker-1] INFO doc.Pdf2Image - Page 10 takes
> > 73.145.
> > [ForkJoinPool.commonPool-worker-9] INFO doc.Pdf2Image - Page 2 takes
> 201.11.
> > [main] INFO doc.Pdf2Image - Page 6 takes 202.048.
>
>
> I don't think there is a correlation between the number of objects and
> the rendering time.
>
>
> >
> > Also I tried to use ImageMagick to do the same thing with the same DPI
> and
> > this is what I get, which seems much faster for pages with more objects,
> > although it is a bit slower than PDFBox for other pages.
> >
> > [main] INFO doc.DocumentProcessorUtils - page 0 has 20 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 1 has 24 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 2 has 176 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 3 has 21 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 4 has 26 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 5 has 21 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 6 has 138 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 7 has 33 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 8 has 22 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 9 has 26 objs.
> > [main] INFO doc.DocumentProcessorUtils - page 10 has 52 objs.
> > [ForkJoinPool.commonPool-worker-2] INFO doc.ProcessDoc - Page 9 takes
> 1.684.
> > [ForkJoinPool.commonPool-worker-11] INFO doc.ProcessDoc - Page 1 takes
> > 2.081.
> > [ForkJoinPool.commonPool-worker-8] INFO doc.ProcessDoc - Page 5 takes
> 2.095.
> > [ForkJoinPool.commonPool-worker-4] INFO doc.ProcessDoc - Page 8 takes
> 2.208.
> > [ForkJoinPool.commonPool-worker-15] INFO doc.ProcessDoc - Page 7 takes
> > 2.336.
> > [ForkJoinPool.commonPool-worker-10] INFO doc.ProcessDoc - Page 3 takes
> > 2.443.
> > [ForkJoinPool.commonPool-worker-13] INFO doc.ProcessDoc - Page 4 takes
> > 2.485.
> > [ForkJoinPool.commonPool-worker-6] INFO doc.ProcessDoc - Page 0 takes
> 3.722.
> > [ForkJoinPool.commonPool-worker-1] INFO doc.ProcessDoc - Page 10 takes
> > 3.765.
> > [main] INFO doc.ProcessDoc - Page 6 takes 4.479.
> > [ForkJoinPool.commonPool-worker-9] INFO doc.ProcessDoc - Page 2 takes
> 4.51.
> >
> ImageMagick uses ghostscript which is written in C++, and they're 10
> years ahead of us. IMHO they are the best, just below Adobe.
>
> Tilman
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to