I'm splitting a document into groups of 20 pages using the Splitter (PDFBox
3.0.3).
It works as expected, the sum of group sizes (~77MB) is similar to the full
document size (~64MB).
*But if I remove the annotations from each page before splitting,* the
result is a group of pages of 64MB, and the sum of sizes (~660MB) is huge
compared to the original document (~64MB).
*Result without removing annotations:*
Permissions Size User Date Modified Name
.rw-rw-r-- 10M joan 23 dic 16:00 'test 0.pdf'
.rw-rw-r-- 7,9M joan 23 dic 16:00 'test 1.pdf'
.rw-rw-r-- 6,9M joan 23 dic 16:00 'test 2.pdf'
.rw-rw-r-- 6,2M joan 23 dic 16:00 'test 3.pdf'
.rw-rw-r-- 3,1M joan 23 dic 16:00 'test 4.pdf'
.rw-rw-r-- 6,5M joan 23 dic 16:00 'test 5.pdf'
.rw-rw-r-- 6,8M joan 23 dic 16:00 'test 6.pdf'
.rw-rw-r-- 4,3M joan 23 dic 16:00 'test 7.pdf'
.rw-rw-r-- 5,0M joan 23 dic 16:00 'test 8.pdf'
.rw-rw-r-- 2,8M joan 23 dic 16:00 'test 9.pdf'
.rw-rw-r-- 5,4M joan 23 dic 16:00 'test 10.pdf'
.rw-rw-r-- 4,7M joan 23 dic 16:00 'test 11.pdf'
.rw-rw-r-- 3,5M joan 23 dic 16:00 'test 12.pdf'
.rw-rw-r-- 3,4M joan 23 dic 16:00 'test 13.pdf'
.rw-rw-r-- 815k joan 23 dic 16:00 'test 14.pdf'
*Result removing annotations:*
Permissions Size User Date Modified Name
.rw-rw-r-- 10M joan 23 dic 16:53 'test 0.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 1.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 2.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 3.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 4.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 5.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 6.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 7.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 8.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 9.pdf'
.rw-rw-r-- 64M joan 23 dic 16:53 'test 10.pdf'
.rw-rw-r-- 4,7M joan 23 dic 16:53 'test 11.pdf'
.rw-rw-r-- 3,5M joan 23 dic 16:53 'test 12.pdf'
.rw-rw-r-- 3,4M joan 23 dic 16:53 'test 13.pdf'
.rw-rw-r-- 833k joan 23 dic 16:53 'test 14.pdf'
*Related code:*
private static List<Path> splitPdfByCleanAnnotations(Path fileToSplit,
Supplier<Path> pathSupplier, int splitAtPage) throws IOException {
Splitter splitter = new Splitter();
splitter.setSplitAtPage(splitAtPage);
try (var document = Loader.loadPDF(fileToSplit.toFile())) {
*clearAnnotations(document);*
return splitAndSave(pathSupplier, splitter, document);
}
}
private static void clearAnnotations(PDDocument document) throws
IOException {
for (int i = 0; i < document.getNumberOfPages(); i++) {
document.getPage(i).getAnnotations().clear();
}
}
private static List<Path> splitAndSave(Supplier<Path> pathSupplier,
Splitter splitter, PDDocument document) throws IOException {
return splitter.split(document).stream()
.map(d ->
callOrLog(() -> {
try (d) {
Path path = pathSupplier.get();
d.save(path.toFile());
return path;
}
})
).toList();
}
Here is the link to the PDF: https://file.io/KI2CFBB87H4c
Any idea why this is happening with this PDF?
Thanks!
P.S: We split 100's of PDFs each day and this is the first time we see this
issue.