Re: Re: PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined

Andreas Lehmkühler Wed, 22 Jul 2009 23:17:46 -0700

Hi Ismael,

thanks for your detailed description. I've made a similar test and ran into the 
same issue. Obvoiusly only the next to last region is processed, the others 
somehow getting lost.


Please create an issue on jira [1] and attach your description.

Thanks in advance
Andreas Lehmkühler

[1] https://issues.apache.org/jira/browse/PDFBOX

----- original Nachricht --------

Betreff: Re: PDFTextStripperByArea extracts text only from 1 region, despite  
several regions being defined
Gesendet: Di, 21. Jul 2009
Von: Ismael Hasan<[email protected]>

> Hi Andreas,
> 
> thanks for answering. I am now loading the document as you suggested.
> 
> I will rewrite the question, since I have been doing some testing:  I
> understand that the code in my first message should return the same
> result for both of the regions, since they are defined with the same
> parameters, but it does not. Testing it with more regions, it only
> retrieves the text from one:
> 
> I divide a page in 4 regions and add the regions to the stripper in
> the following order:
> 1-upper left, 2-upper right, 3-lower left, 4-lower right.
> 
> After calling "extractRegions" function, only the text for the third
> one is retrieved.
> If I donnot add the third region, only the text for region 2 is retrieved.
> 
> 
> I think this behaviour is strange, and it may not be the expected. In
> the example you suggested,
> 'org.apache.pdfbox.examples.util.ExtractTextByArea', only one region
> is defined, so maybe the tool is not intended to extract several
> regions at a time.
> 
> Any answer will be appreciated,
> 
> thanks in advance,
> 
> Ismael
> 
> 2009/7/21 Andreas Lehmkühler <[email protected]>:
> > Hi Ismael,
> >
> > first of all try to load the pdf with PDDocument doc =
> PDDocument.load(file). You don't have to parse the doc on your own. See
> org.apache.pdfbox.examples.util.ExtractTextByArea as an example for
> extracting textareas.
> > Why do you try to extract the same region twice? Wouldn't it be easier to
> just copy the resultstring?
> >
> > BR
> > Andreas Lehmkühler
> >
> > ----- original Nachricht --------
> >
> > Betreff: PDFTextStripperByArea extracts text only from 1 region, despite
>  several regions being defined
> > Gesendet: Di, 21. Jul 2009
> > Von: Ismael Hasan<[email protected]>
> >
> >> Hello. I have a problem with the class
> >> "org.apache.pdfbox.util.PDFTextStripperByArea":
> >>
> >> If I add several regions to this class to extract the text from, it is
> >> only retrieved from one of them. The example I build was to create two
> >> regions with the same values (with different names), add them to the
> >> text stripper, and use the "extractRegions" function.
> >>
> >> I really appreciate if someone can answer me what I am doing wrong, or
> >> if this is a bug in the tool.
> >>
> >> Please, see at the end of the message the code with which I get this
> >> issue; the final result buffers (localResult1 and localResult2) have
> >> different content (one of them is empty). If you need a PDF document
> >> to reproduce this, please ask me for it.
> >>
> >> Thanks in advance,
> >> Ismael
> >>
> >>
> >>
> >> //Opening the document and getting the page
> >> PDFParser parser = new PDFParser(new
> >> ByteArrayInputStream(documentInBytes));
> >> parser.parse();
> >> PDDocument doc = parser.getPDDocument();
> >> PDPage page = (PDPage)
> >> doc.getDocumentCatalog().getAllPages().get(pageNumber);
> >>
> >> // Creating the stripper
> >> PDFTextStripperByArea areaStripper = new PDFTextStripperByArea();
> >>
> >> // Creation and addition of the regions to the stripper
> >> Rectangle2D rectangle = new Rectangle2D.Float();
> >> rectangle.setRect(0, 0, 500, 100);
> >> areaStripper.addRegion("1", rectangle);
> >>
> >> Rectangle2D rectangle2 = new Rectangle2D.Float();
> >> rectangle2.setRect(0, 0, 500, 100);
> >> areaStripper.addRegion("2", rectangle2);
> >>
> >> // Extracting the regions and getting the results
> >> areaStripper.extractRegions(page);
> >> String localResult1 = areaStripper.getTextForRegion("1");
> >> String localResult2 = areaStripper.getTextForRegion("2");
> >>
> >
> > --- original Nachricht Ende ----
> >
> >
> 

--- original Nachricht Ende ----

Re: Re: PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined

Reply via email to