RandomAccessReadBuffer performance issues with inputStreams in 3.0

bnncdv Mon, 28 Aug 2023 04:33:58 -0700

When migrating from 2.0 to 3.0 I noticed some operations were very slow,
mainly the Splitter tool. With a big-ish file it would take *a lot* more
memory/cpu (jdk8).


I believe the culprit is RandomAccessReadBuffer with inputstreams. This
fully reads the stream in 4KB chunks (not a problem), however every time
createView(..) is called (on every PDPage access I think) it call a clone
RARB constructor, and all its ByteArray chunks are duplicate()'d which for
bigger files with many pages means *tons* of wasted objects + calls (even
if the underlying buf is the same). Simplifying that, for example by
reusing the parent bufferList rather than duplicting it uses the expected
cpu/memory (I don't know the implications though).

>From simple observations Splitter seems to take x4 more cpu/heap. For
example I'd assume with a 100MB file of 300 pages (normal enough if you
deal with scanned docs) + inputstream: 100MB = 25600 chunks of 4KB * 300
pages = 7680000 objects created+gc'd in a short time, at least.

With smaller files (few pages) this isn't very noticeable, nor with
RandomAccessReadBufferedFile (different handling). Passing a pre-read
byte[] file to RandomAccessReadBuffer works ok (minimal dupes).
RandomAccess.createBuffer(inputStream) in alpha3 was also ok but removed in
beta1. Either way, I don't think code should be copying/duping so much and
could be restructured, specially since the migration guide hints at using
RandomAccessReadBuffer for inputStreams.

Also, for RARB it'd make more sense to read chunks as needed in read()
rather than all at once in the constructor I think (faster metadata
query'ing). Incidentally, may be useful to increase the default chunk size
(or allow users to set it) to reduce fragmentation, since it's going the
read the whole thing and PDFs < 4kb aren't that common I'd say.

(I don't have a publishable example at hand but can be easily replicated by
using the PDFMergerUtility and joining the same non-tiny PDF xN times, then
splitting it).

Thanks.

RandomAccessReadBuffer performance issues with inputStreams in 3.0

Reply via email to