Hi,
We're always interested in optimizations. Please do one thing at a time,
I have regression tests that render over 1000 PDF files so we find
problems early and can "blame" them on a specific optimization. Submit
your changes as .diff / .patch files. You can also do PRs. Re the code
formatting, I can do this automatically, however it is important that
your change doesn't modify existing code so we can see what change you made.
Tilman
Am 16.06.2021 um 20:23 schrieb Gunnar Brand:
Hi.
I am using PDFBox for rendering PDF files into images. There is a certain file
I am using as benchmark for any PDF library and PDF Box has some problems with
it (please note that almost all 3rd party PDF engines have issues with this
file):
https://archive.org/details/AlfaWaffenkatalog1911
Good news: PDFBox renders the file perfectly!
Bad news: It takes forever to do so (first page 16 seconds in PDFDebugger on my
machine)
I was asking myself, why is this and I have identified and „fixed“ things and
could get the time down to 6 seconds.
I started fixing these issues earlier this year, I can’t work on it all the
time. (I noticed PDFBOX-5145 which was a good start but misses some things.)
The problem lies within the optimized nature of this file, it stores the white
of the background, the blackness of the text, an image mask for the text, as
well as drawings separately. This is nothing new, I have a scan of a very old
magazine which was optimized from 90 to 9 mb in a similar way (but with slight
differences so it loads in a second).
What you have is basically a low res picture of white soup, a low res picture
of black soup, a very very high res picture of an single bit image mask (say
10000*10000 pixels) and a bunch of normal res images for drawings.
The diffence to the fast pdf is that the image mask is applied to the black
soup image as mask (the fast pdf renders it directly) and that the image mask
is stored as JBIG2 instead of CCITTFax.
Since this is happening w/o the final target image resolution in mind, apply
mask works on the full 10000*10000 pixels.
(Memory requirements: 12 MB for the bitmask, 100 MB for the 8bit mask – luckily
single bit masks get expanded to only 8 bit, anything else turns into RGB -,
400 MB for the picture + one extra 400MB since there is a pointless in between
image).
Things seen in apply mask:
* Scaling the image to the mask is very very slow if you have a 10x
scaling factor for each axis and large target and use bicubic. Billinear should
be used somehow in these cases (I used an area enlargement of 16 as threshold
but problably also should count in the absolute number of pixels). This is a
major performance gain (as 2 seconds instead of in many more). Nearest neighbor
is even faster (no time) but of course not an option.
* There is some wasteful image allocation happening (400 MB).
* PDFBOX-5145 bulk copy works in a roundabout way that slows it down.
* It’s posible to use direct alpha copying, which is even faster
(optional).
* Softmask code could use integer math which is twice as fast with
neglible error (0.001%) compared to float (this is a bonus optimization)
With this alone I almost shaved of half the time. I also looked at the mask
reading part:
* from1bit() could be optimized a bit (and also fails to issue a warn and
break the loop if subsampling is enabled)
* reading the jbig2 image in the JBIG2 library is very slow.
I understand that JBIG2 is way more complex than CCITTFax but carefully
investigation showed that of 2 seoncds, 0.5 was used for decoding the image
itself (depending on page complexity this number can be lower/higher) and 1.5
for converting the bitmap into a BufferedImage. I optimized that 1.5 seconds
away to a few milliseconds.
If you are interested in any of this, I can go and clone the git repo and
„implement“ my changes there so you can pull things back into the main repo
that might be worth it?
(What I can already say is that it‘s probably not going to be 100% formatting
style compliant (no leading tabs is one thing, but the whitespaciness with
curly brackets lines and no single line if statements I can’t guarantee)).
Gunnar
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]