Am 28.08.23 um 13:30 schrieb bnncdv:
When migrating from 2.0 to 3.0 I noticed some operations were very slow,
mainly the Splitter tool. With a big-ish file it would take *a lot* more
memory/cpu (jdk8).
What exactly are you doing? I've tried to reproduce the issue and I've bee succesful with regard to the memory footprint but I can't confirm the higher cpu usage.

What exactly are doing? I've splitted the PDF spec, 32Mb file with more than 1.300 pages, into 2 pages pdfs and it can't see any difference with regard to the cup usage wether I use a file or a input stream.

However, I was able to reproduce the regression with regard to the memory consumption and fixed/optimized it in [1]


I believe the culprit is RandomAccessReadBuffer with inputstreams. This
fully reads the stream in 4KB chunks (not a problem), however every time
We have o do that as we need random access to the file. 2.0.x does the same

createView(..) is called (on every PDPage access I think) it call a clone
RARB constructor, and all its ByteArray chunks are duplicate()'d which for
bigger files with many pages means *tons* of wasted objects + calls (even
if the underlying buf is the same). Simplifying that, for example by
reusing the parent bufferList rather than duplicting it uses the expected
cpu/memory (I don't know the implications though).

 From simple observations Splitter seems to take x4 more cpu/heap. For
example I'd assume with a 100MB file of 300 pages (normal enough if you
deal with scanned docs) + inputstream: 100MB = 25600 chunks of 4KB * 300
pages = 7680000 objects created+gc'd in a short time, at least.

With smaller files (few pages) this isn't very noticeable, nor with
RandomAccessReadBufferedFile (different handling). Passing a pre-read
byte[] file to RandomAccessReadBuffer works ok (minimal dupes).
RandomAccessReadBufferedFile has a builtin cache to avoid to many copies, see [1]

RandomAccess.createBuffer(inputStream) in alpha3 was also ok but removed in
beta1. Either way, I don't think code should be copying/duping so much and
could be restructured, specially since the migration guide hints at using
RandomAccessReadBuffer for inputStreams.
Alpha3 did the same as final version 3.0.0. The removed method was redundant.

Also, for RARB it'd make more sense to read chunks as needed in read()
rather than all at once in the constructor I think (faster metadata
query'ing). Incidentally, may be useful to increase the default chunk size
(or allow users to set it) to reduce fragmentation, since it's going the
read the whole thing and PDFs < 4kb aren't that common I'd say.
We have to read all data as need random access to the pdf. In many case on of the first steps is to jump to the end of the pdf to read the cross reference table/stream.

(I don't have a publishable example at hand but can be easily replicated by
using the PDFMergerUtility and joining the same non-tiny PDF xN times, then
splitting it).
There has to be something special about your use case and/or pdf as I can't reproduce the cpu issue, see above.


Andreas

Thanks.



[1]  https://issues.apache.org/jira/browse/PDFBOX-5685

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to