Am 05.02.2021 um 20:17 schrieb Ethan Huang:
We are converting PDF files into images and the way we are doing it is
breaking a single PDF files into several PDDocument, one per page, and
converting them in parallel.
What I found is for pages with more objects, the processing is going to
take much longer (see below logs, time unit in seconds).
I cannot share the test file for now. I will need to ask for permission.
Please do so. It is unlikely, but sometimes we do find an optimization
potential in PDFBox when a file is slow. However in most cases we can't
help.
Is there a way to make it faster? Also I see the logs for pages requiring
longer processing time.
No, not really. You could lower the resolution.
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: Pattern surface is too large, will be clipped
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: width: 4405.8223, height: -4405.8223
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: XStep: 1707.63, YStep: 1707.63
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: bbox: [-54.8253,-217.611,1652.8,1490.02]
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: pattern matrix: [2.58008,0.0,0.0,-2.58008,0.0,540.0]
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: concatenated matrix: [2.58008,0.0,0.0,-2.58008,0.0,540.0]
Logs showing objects count and processing duration per page for the file
with PDFBox:
[main] INFO doc.DocumentProcessorUtils - page 0 has 20 objs.
[main] INFO doc.DocumentProcessorUtils - page 1 has 24 objs.
[main] INFO doc.DocumentProcessorUtils - page 2 has 176 objs.
[main] INFO doc.DocumentProcessorUtils - page 3 has 21 objs.
[main] INFO doc.DocumentProcessorUtils - page 4 has 26 objs.
[main] INFO doc.DocumentProcessorUtils - page 5 has 21 objs.
[main] INFO doc.DocumentProcessorUtils - page 6 has 138 objs.
[main] INFO doc.DocumentProcessorUtils - page 7 has 33 objs.
[main] INFO doc.DocumentProcessorUtils - page 8 has 22 objs.
[main] INFO doc.DocumentProcessorUtils - page 9 has 26 objs.
[main] INFO doc.DocumentProcessorUtils - page 10 has 52 objs.
[ForkJoinPool.commonPool-worker-10] INFO doc.Pdf2Image - Page 3 takes 0.803.
[ForkJoinPool.commonPool-worker-13] INFO doc.Pdf2Image - Page 8 takes 0.805.
[ForkJoinPool.commonPool-worker-8] INFO doc.Pdf2Image - Page 4 takes 0.822.
[ForkJoinPool.commonPool-worker-15] INFO doc.Pdf2Image - Page 0 takes 0.852.
[ForkJoinPool.commonPool-worker-11] INFO doc.Pdf2Image - Page 5 takes 0.892.
[ForkJoinPool.commonPool-worker-4] INFO doc.Pdf2Image - Page 1 takes 0.901.
[ForkJoinPool.commonPool-worker-6] INFO doc.Pdf2Image - Page 7 takes 0.962.
[ForkJoinPool.commonPool-worker-2] INFO doc.Pdf2Image - Page 9 takes 1.075.
[ForkJoinPool.commonPool-worker-1] INFO doc.Pdf2Image - Page 10 takes
73.145.
[ForkJoinPool.commonPool-worker-9] INFO doc.Pdf2Image - Page 2 takes 201.11.
[main] INFO doc.Pdf2Image - Page 6 takes 202.048.
I don't think there is a correlation between the number of objects and
the rendering time.
Also I tried to use ImageMagick to do the same thing with the same DPI and
this is what I get, which seems much faster for pages with more objects,
although it is a bit slower than PDFBox for other pages.
[main] INFO doc.DocumentProcessorUtils - page 0 has 20 objs.
[main] INFO doc.DocumentProcessorUtils - page 1 has 24 objs.
[main] INFO doc.DocumentProcessorUtils - page 2 has 176 objs.
[main] INFO doc.DocumentProcessorUtils - page 3 has 21 objs.
[main] INFO doc.DocumentProcessorUtils - page 4 has 26 objs.
[main] INFO doc.DocumentProcessorUtils - page 5 has 21 objs.
[main] INFO doc.DocumentProcessorUtils - page 6 has 138 objs.
[main] INFO doc.DocumentProcessorUtils - page 7 has 33 objs.
[main] INFO doc.DocumentProcessorUtils - page 8 has 22 objs.
[main] INFO doc.DocumentProcessorUtils - page 9 has 26 objs.
[main] INFO doc.DocumentProcessorUtils - page 10 has 52 objs.
[ForkJoinPool.commonPool-worker-2] INFO doc.ProcessDoc - Page 9 takes 1.684.
[ForkJoinPool.commonPool-worker-11] INFO doc.ProcessDoc - Page 1 takes
2.081.
[ForkJoinPool.commonPool-worker-8] INFO doc.ProcessDoc - Page 5 takes 2.095.
[ForkJoinPool.commonPool-worker-4] INFO doc.ProcessDoc - Page 8 takes 2.208.
[ForkJoinPool.commonPool-worker-15] INFO doc.ProcessDoc - Page 7 takes
2.336.
[ForkJoinPool.commonPool-worker-10] INFO doc.ProcessDoc - Page 3 takes
2.443.
[ForkJoinPool.commonPool-worker-13] INFO doc.ProcessDoc - Page 4 takes
2.485.
[ForkJoinPool.commonPool-worker-6] INFO doc.ProcessDoc - Page 0 takes 3.722.
[ForkJoinPool.commonPool-worker-1] INFO doc.ProcessDoc - Page 10 takes
3.765.
[main] INFO doc.ProcessDoc - Page 6 takes 4.479.
[ForkJoinPool.commonPool-worker-9] INFO doc.ProcessDoc - Page 2 takes 4.51.
ImageMagick uses ghostscript which is written in C++, and they're 10
years ahead of us. IMHO they are the best, just below Adobe.
Tilman
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]