Re: PDFBox 2.0.23 MergeUtility

viraf Tue, 30 Mar 2021 04:41:06 -0700

 I apologize if I was unclear in my earlier response.  There are two items in 
the initial e-mail.  The memory issue that I am facing, and a request to 
provide an abstraction to a resource.  
The API's allow one to specify File(s) and InputStream(s) as sources that need 
to be merged by the PDFMergerUtility.  In my case I have a large number of 
sources (thousands) that are located in cloud storage.  Opening thousands of 
input streams and passing them to PDFMergerUtility is not efficient.  So, my 
request was to provide an abstraction such as:
public interface InputStreamResource {  String getDetails();  InputStream 
getInputStream();}
This would allow me (the developer) to provide an appropriate implementation 
for the specific resource, and defer creating the InputStream until it is 
needed.  The following code in optimizedMergeDocuments would need to be 
modified to accept an InputStreamResource and call getInputStream to open the 
InputStream to the resource.


    private void optimizedMergeDocuments(MemoryUsageSetting memUsageSetting) 
throws IOException    {        PDDocument destination = null;        try        
{            destination = new PDDocument(memUsageSetting);            
PDFCloneUtility cloner = new PDFCloneUtility(destination);
            for (Object sourceObject : sources)            {                
PDDocument sourceDoc = null;                try                {                
    if (sourceObject instanceof File)                    {                      
  sourceDoc = PDDocument.load((File) sourceObject, memUsageSetting);            
        }                    else                    {                        
sourceDoc = PDDocument.load((InputStream) sourceObject, memUsageSetting);       
             }

I am hooping that the above clarifies my request.  The abstraction provides 
great flexibility to accommodate downstream resource types.
Thanks
- viraf 

    On Monday, March 29, 2021, 11:03:51 PM EDT, Tilman Hausherr 
<[email protected]> wrote:  
 
 The document loading calls for File and InputStream are different, so I 
don't see how this would help. I also don't see how this would solve 
your memory problems.

Tilman

Am 30.03.2021 um 04:01 schrieb viraf:
>  I was not suggesting adding Spring to PDFBox.  I was suggesting that we add 
>an interface like Resource and allow developers to provide the implementation 
>for getInputStream.  This way we can support File, InputStream and various 
>cloud environments with little to no impact to PDFBox.  Furthermore the PDFBox 
>code would be cleaner as it would not need checks on the type of source.
> Could you expand on "Maybe it is a good idea to close the files from time to 
> time and not to wait
> until all a merged together.".  Each source (input input stream) being merged 
> is closed once merged.  The consolidated PDF is not closed until all files 
> are merged.  Are you suggesting to close the consolidated PDF ?  If so, is 
> there a way to reopen it for merging ?
> Happy to try the new release.
> Thanks.
> - viraf
>
>
>      On Monday, March 29, 2021, 01:14:49 PM EDT, Andreas Lehmkuehler 
><[email protected]> wrote:
>  
>  Hi,
>
> Am 29.03.21 um 03:17 schrieb viraf:
>> I am using PDFBox 2.0.23 to merge a large number of single page searchable 
>> PDF files.  As these files are stored in the cloud, and to make a copy of 
>> PDFMergerUtility::optimizedMergeDocuments passing in sourceObject other than 
>> a File or InputStream.
>> In support of cloud environments, I am requesting an enhancement to PDFBox 
>> allowing one to pass in an object that implements an interface such as the 
>> Resource in the SpringFramework.
> I'm afraid that won't happen as it would add one or more SPring jar as
> dependency just to support a Resource.
>
>> In merging a large number of files, and frequently get OOM.  An examination 
>> in VisualVM indicates a large number of ScratchFile objects.
>> Looking for suggestions on how best to merge a large number (say 100K) of 
>> searchable PDF files generated during OCR (i.e. image + text).
> It is possible to reduce the usage of ScratchFile object by using the main
> memory instead. Have a look at org.apache.pdfbox.io.MemoryUsageSetting for
> further details.
> Maybe it is a good idea to close the files from time to time and not to wait
> until all a merged together.
>
> If you are able to experiment a little you might wanna use the upcoming new
> major release 3.0.0. A first release candidate will be available in a few 
> days.
> It provides an on demand parser which doesn't use ScratchFiles for reading
> anymore, those are limited to writing.
>
> Andreas
>
>> Thanks
>> - viraf
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>    



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: PDFBox 2.0.23 MergeUtility

Reply via email to