Hi,

I forgot one thing, you can activate subsampling with PDFRenderer.setSubsamplingAllowed(). In some cases (large images) this will make things faster, with a slight quality loss.

(However there was a bug, see https://issues.apache.org/jira/browse/PDFBOX-5091 , so try with the snapshot link mentioned at the bottom or with 2.0.20 to see if your file gets faster)

You could also try to see what happens by setting low quality rendering hints.

Things that usually make rendering slow:
- thousands of images
- huge shadings
- very complex clipping paths

Tilman

Am 05.02.2021 um 22:45 schrieb Ethan Huang:
Thanks for the info! We would prefer to continue with PDFBox if possible.
Lowering resolution would bring bad user experience for us.

I am requesting for sharing the files. Once available, I am going to share
them here.

On Fri, Feb 5, 2021 at 11:30 AM Tilman Hausherr <[email protected]>
wrote:

Am 05.02.2021 um 20:17 schrieb Ethan Huang:
We are converting PDF files into images and the way we are doing it is
breaking a single PDF files into several PDDocument, one per page, and
converting them in parallel.



What I found is for pages with more objects, the processing is going to
take much longer (see below logs, time unit in seconds).

I cannot share the test file for now. I will need to ask for permission.

Please do so. It is unlikely, but sometimes we do find an optimization
potential in PDFBox when a file is slow. However in most cases we can't
help.


Is there a way to make it faster? Also I see the logs for pages requiring
longer processing time.
No, not really. You could lower the resolution.


Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: Pattern surface is too large, will be clipped
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: width: 4405.8223, height: -4405.8223
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: XStep: 1707.63, YStep: 1707.63
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: bbox: [-54.8253,-217.611,1652.8,1490.02]
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: pattern matrix: [2.58008,0.0,0.0,-2.58008,0.0,540.0]
Feb 04, 2021 5:39:20 PM org.apache.pdfbox.rendering.TilingPaint
getAnchorRect
INFO: concatenated matrix: [2.58008,0.0,0.0,-2.58008,0.0,540.0]


Logs showing objects count and processing duration per page for the file
with PDFBox:


[main] INFO doc.DocumentProcessorUtils - page 0 has 20 objs.
[main] INFO doc.DocumentProcessorUtils - page 1 has 24 objs.
[main] INFO doc.DocumentProcessorUtils - page 2 has 176 objs.
[main] INFO doc.DocumentProcessorUtils - page 3 has 21 objs.
[main] INFO doc.DocumentProcessorUtils - page 4 has 26 objs.
[main] INFO doc.DocumentProcessorUtils - page 5 has 21 objs.
[main] INFO doc.DocumentProcessorUtils - page 6 has 138 objs.
[main] INFO doc.DocumentProcessorUtils - page 7 has 33 objs.
[main] INFO doc.DocumentProcessorUtils - page 8 has 22 objs.
[main] INFO doc.DocumentProcessorUtils - page 9 has 26 objs.
[main] INFO doc.DocumentProcessorUtils - page 10 has 52 objs.

[ForkJoinPool.commonPool-worker-10] INFO doc.Pdf2Image - Page 3 takes
0.803.
[ForkJoinPool.commonPool-worker-13] INFO doc.Pdf2Image - Page 8 takes
0.805.
[ForkJoinPool.commonPool-worker-8] INFO doc.Pdf2Image - Page 4 takes
0.822.
[ForkJoinPool.commonPool-worker-15] INFO doc.Pdf2Image - Page 0 takes
0.852.
[ForkJoinPool.commonPool-worker-11] INFO doc.Pdf2Image - Page 5 takes
0.892.
[ForkJoinPool.commonPool-worker-4] INFO doc.Pdf2Image - Page 1 takes
0.901.
[ForkJoinPool.commonPool-worker-6] INFO doc.Pdf2Image - Page 7 takes
0.962.
[ForkJoinPool.commonPool-worker-2] INFO doc.Pdf2Image - Page 9 takes
1.075.
[ForkJoinPool.commonPool-worker-1] INFO doc.Pdf2Image - Page 10 takes
73.145.
[ForkJoinPool.commonPool-worker-9] INFO doc.Pdf2Image - Page 2 takes
201.11.
[main] INFO doc.Pdf2Image - Page 6 takes 202.048.

I don't think there is a correlation between the number of objects and
the rendering time.


Also I tried to use ImageMagick to do the same thing with the same DPI
and
this is what I get, which seems much faster for pages with more objects,
although it is a bit slower than PDFBox for other pages.

[main] INFO doc.DocumentProcessorUtils - page 0 has 20 objs.
[main] INFO doc.DocumentProcessorUtils - page 1 has 24 objs.
[main] INFO doc.DocumentProcessorUtils - page 2 has 176 objs.
[main] INFO doc.DocumentProcessorUtils - page 3 has 21 objs.
[main] INFO doc.DocumentProcessorUtils - page 4 has 26 objs.
[main] INFO doc.DocumentProcessorUtils - page 5 has 21 objs.
[main] INFO doc.DocumentProcessorUtils - page 6 has 138 objs.
[main] INFO doc.DocumentProcessorUtils - page 7 has 33 objs.
[main] INFO doc.DocumentProcessorUtils - page 8 has 22 objs.
[main] INFO doc.DocumentProcessorUtils - page 9 has 26 objs.
[main] INFO doc.DocumentProcessorUtils - page 10 has 52 objs.
[ForkJoinPool.commonPool-worker-2] INFO doc.ProcessDoc - Page 9 takes
1.684.
[ForkJoinPool.commonPool-worker-11] INFO doc.ProcessDoc - Page 1 takes
2.081.
[ForkJoinPool.commonPool-worker-8] INFO doc.ProcessDoc - Page 5 takes
2.095.
[ForkJoinPool.commonPool-worker-4] INFO doc.ProcessDoc - Page 8 takes
2.208.
[ForkJoinPool.commonPool-worker-15] INFO doc.ProcessDoc - Page 7 takes
2.336.
[ForkJoinPool.commonPool-worker-10] INFO doc.ProcessDoc - Page 3 takes
2.443.
[ForkJoinPool.commonPool-worker-13] INFO doc.ProcessDoc - Page 4 takes
2.485.
[ForkJoinPool.commonPool-worker-6] INFO doc.ProcessDoc - Page 0 takes
3.722.
[ForkJoinPool.commonPool-worker-1] INFO doc.ProcessDoc - Page 10 takes
3.765.
[main] INFO doc.ProcessDoc - Page 6 takes 4.479.
[ForkJoinPool.commonPool-worker-9] INFO doc.ProcessDoc - Page 2 takes
4.51.
ImageMagick uses ghostscript which is written in C++, and they're 10
years ahead of us. IMHO they are the best, just below Adobe.

Tilman


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to