Thanks for the info! We would prefer to continue with PDFBox if possible. Lowering resolution would bring bad user experience for us.
I am requesting for sharing the files. Once available, I am going to share them here. On Fri, Feb 5, 2021 at 11:30 AM Tilman Hausherr <[email protected]> wrote: > Am 05.02.2021 um 20:17 schrieb Ethan Huang: > > We are converting PDF files into images and the way we are doing it is > > breaking a single PDF files into several PDDocument, one per page, and > > converting them in parallel. > > > > > > > > What I found is for pages with more objects, the processing is going to > > take much longer (see below logs, time unit in seconds). > > > > I cannot share the test file for now. I will need to ask for permission. > > > Please do so. It is unlikely, but sometimes we do find an optimization > potential in PDFBox when a file is slow. However in most cases we can't > help. > > > > > > Is there a way to make it faster? Also I see the logs for pages requiring > > longer processing time. > > No, not really. You could lower the resolution. > > > > > > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint > > getAnchorRect > > INFO: Pattern surface is too large, will be clipped > > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint > > getAnchorRect > > INFO: width: 4405.8223, height: -4405.8223 > > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint > > getAnchorRect > > INFO: XStep: 1707.63, YStep: 1707.63 > > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint > > getAnchorRect > > INFO: bbox: [-54.8253,-217.611,1652.8,1490.02] > > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint > > getAnchorRect > > INFO: pattern matrix: [2.58008,0.0,0.0,-2.58008,0.0,540.0] > > Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint > > getAnchorRect > > INFO: concatenated matrix: [2.58008,0.0,0.0,-2.58008,0.0,540.0] > > > > > > Logs showing objects count and processing duration per page for the file > > with PDFBox: > > > > > > [main] INFO doc.DocumentProcessorUtils - page 0 has 20 objs. > > [main] INFO doc.DocumentProcessorUtils - page 1 has 24 objs. > > [main] INFO doc.DocumentProcessorUtils - page 2 has 176 objs. > > [main] INFO doc.DocumentProcessorUtils - page 3 has 21 objs. > > [main] INFO doc.DocumentProcessorUtils - page 4 has 26 objs. > > [main] INFO doc.DocumentProcessorUtils - page 5 has 21 objs. > > [main] INFO doc.DocumentProcessorUtils - page 6 has 138 objs. > > [main] INFO doc.DocumentProcessorUtils - page 7 has 33 objs. > > [main] INFO doc.DocumentProcessorUtils - page 8 has 22 objs. > > [main] INFO doc.DocumentProcessorUtils - page 9 has 26 objs. > > [main] INFO doc.DocumentProcessorUtils - page 10 has 52 objs. > > > > [ForkJoinPool.commonPool-worker-10] INFO doc.Pdf2Image - Page 3 takes > 0.803. > > [ForkJoinPool.commonPool-worker-13] INFO doc.Pdf2Image - Page 8 takes > 0.805. > > [ForkJoinPool.commonPool-worker-8] INFO doc.Pdf2Image - Page 4 takes > 0.822. > > [ForkJoinPool.commonPool-worker-15] INFO doc.Pdf2Image - Page 0 takes > 0.852. > > [ForkJoinPool.commonPool-worker-11] INFO doc.Pdf2Image - Page 5 takes > 0.892. > > [ForkJoinPool.commonPool-worker-4] INFO doc.Pdf2Image - Page 1 takes > 0.901. > > [ForkJoinPool.commonPool-worker-6] INFO doc.Pdf2Image - Page 7 takes > 0.962. > > [ForkJoinPool.commonPool-worker-2] INFO doc.Pdf2Image - Page 9 takes > 1.075. > > [ForkJoinPool.commonPool-worker-1] INFO doc.Pdf2Image - Page 10 takes > > 73.145. > > [ForkJoinPool.commonPool-worker-9] INFO doc.Pdf2Image - Page 2 takes > 201.11. > > [main] INFO doc.Pdf2Image - Page 6 takes 202.048. > > > I don't think there is a correlation between the number of objects and > the rendering time. > > > > > > Also I tried to use ImageMagick to do the same thing with the same DPI > and > > this is what I get, which seems much faster for pages with more objects, > > although it is a bit slower than PDFBox for other pages. > > > > [main] INFO doc.DocumentProcessorUtils - page 0 has 20 objs. > > [main] INFO doc.DocumentProcessorUtils - page 1 has 24 objs. > > [main] INFO doc.DocumentProcessorUtils - page 2 has 176 objs. > > [main] INFO doc.DocumentProcessorUtils - page 3 has 21 objs. > > [main] INFO doc.DocumentProcessorUtils - page 4 has 26 objs. > > [main] INFO doc.DocumentProcessorUtils - page 5 has 21 objs. > > [main] INFO doc.DocumentProcessorUtils - page 6 has 138 objs. > > [main] INFO doc.DocumentProcessorUtils - page 7 has 33 objs. > > [main] INFO doc.DocumentProcessorUtils - page 8 has 22 objs. > > [main] INFO doc.DocumentProcessorUtils - page 9 has 26 objs. > > [main] INFO doc.DocumentProcessorUtils - page 10 has 52 objs. > > [ForkJoinPool.commonPool-worker-2] INFO doc.ProcessDoc - Page 9 takes > 1.684. > > [ForkJoinPool.commonPool-worker-11] INFO doc.ProcessDoc - Page 1 takes > > 2.081. > > [ForkJoinPool.commonPool-worker-8] INFO doc.ProcessDoc - Page 5 takes > 2.095. > > [ForkJoinPool.commonPool-worker-4] INFO doc.ProcessDoc - Page 8 takes > 2.208. > > [ForkJoinPool.commonPool-worker-15] INFO doc.ProcessDoc - Page 7 takes > > 2.336. > > [ForkJoinPool.commonPool-worker-10] INFO doc.ProcessDoc - Page 3 takes > > 2.443. > > [ForkJoinPool.commonPool-worker-13] INFO doc.ProcessDoc - Page 4 takes > > 2.485. > > [ForkJoinPool.commonPool-worker-6] INFO doc.ProcessDoc - Page 0 takes > 3.722. > > [ForkJoinPool.commonPool-worker-1] INFO doc.ProcessDoc - Page 10 takes > > 3.765. > > [main] INFO doc.ProcessDoc - Page 6 takes 4.479. > > [ForkJoinPool.commonPool-worker-9] INFO doc.ProcessDoc - Page 2 takes > 4.51. > > > ImageMagick uses ghostscript which is written in C++, and they're 10 > years ahead of us. IMHO they are the best, just below Adobe. > > Tilman > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >

