Re: Re: PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined

Ismael Hasan Thu, 23 Jul 2009 04:30:12 -0700

The issue can be found at

https://issues.apache.org/jira/browse/PDFBOX-495




2009/7/23 Andreas Lehmkühler <[email protected]>:
> Hi Ismael,
>
> thanks for your detailed description. I've made a similar test and ran into 
> the same issue. Obvoiusly only the next to last region is processed, the 
> others somehow getting lost.
>
> Please create an issue on jira [1] and attach your description.
>
> Thanks in advance
> Andreas Lehmkühler
>
> [1] https://issues.apache.org/jira/browse/PDFBOX
>
> ----- original Nachricht --------
>
> Betreff: Re: PDFTextStripperByArea extracts text only from 1 region, despite  
> several regions being defined
> Gesendet: Di, 21. Jul 2009
> Von: Ismael Hasan<[email protected]>
>
>> Hi Andreas,
>>
>> thanks for answering. I am now loading the document as you suggested.
>>
>> I will rewrite the question, since I have been doing some testing:  I
>> understand that the code in my first message should return the same
>> result for both of the regions, since they are defined with the same
>> parameters, but it does not. Testing it with more regions, it only
>> retrieves the text from one:
>>
>> I divide a page in 4 regions and add the regions to the stripper in
>> the following order:
>> 1-upper left, 2-upper right, 3-lower left, 4-lower right.
>>
>> After calling "extractRegions" function, only the text for the third
>> one is retrieved.
>> If I donnot add the third region, only the text for region 2 is retrieved.
>>
>>
>> I think this behaviour is strange, and it may not be the expected. In
>> the example you suggested,
>> 'org.apache.pdfbox.examples.util.ExtractTextByArea', only one region
>> is defined, so maybe the tool is not intended to extract several
>> regions at a time.
>>
>> Any answer will be appreciated,
>>
>> thanks in advance,
>>
>> Ismael
>>
>> 2009/7/21 Andreas Lehmkühler <[email protected]>:
>> > Hi Ismael,
>> >
>> > first of all try to load the pdf with PDDocument doc =
>> PDDocument.load(file). You don't have to parse the doc on your own. See
>> org.apache.pdfbox.examples.util.ExtractTextByArea as an example for
>> extracting textareas.
>> > Why do you try to extract the same region twice? Wouldn't it be easier to
>> just copy the resultstring?
>> >
>> > BR
>> > Andreas Lehmkühler
>> >
>> > ----- original Nachricht --------
>> >
>> > Betreff: PDFTextStripperByArea extracts text only from 1 region, despite
>>  several regions being defined
>> > Gesendet: Di, 21. Jul 2009
>> > Von: Ismael Hasan<[email protected]>
>> >
>> >> Hello. I have a problem with the class
>> >> "org.apache.pdfbox.util.PDFTextStripperByArea":
>> >>
>> >> If I add several regions to this class to extract the text from, it is
>> >> only retrieved from one of them. The example I build was to create two
>> >> regions with the same values (with different names), add them to the
>> >> text stripper, and use the "extractRegions" function.
>> >>
>> >> I really appreciate if someone can answer me what I am doing wrong, or
>> >> if this is a bug in the tool.
>> >>
>> >> Please, see at the end of the message the code with which I get this
>> >> issue; the final result buffers (localResult1 and localResult2) have
>> >> different content (one of them is empty). If you need a PDF document
>> >> to reproduce this, please ask me for it.
>> >>
>> >> Thanks in advance,
>> >> Ismael
>> >>
>> >>
>> >>
>> >> //Opening the document and getting the page
>> >> PDFParser parser = new PDFParser(new
>> >> ByteArrayInputStream(documentInBytes));
>> >> parser.parse();
>> >> PDDocument doc = parser.getPDDocument();
>> >> PDPage page = (PDPage)
>> >> doc.getDocumentCatalog().getAllPages().get(pageNumber);
>> >>
>> >> // Creating the stripper
>> >> PDFTextStripperByArea areaStripper = new PDFTextStripperByArea();
>> >>
>> >> // Creation and addition of the regions to the stripper
>> >> Rectangle2D rectangle = new Rectangle2D.Float();
>> >> rectangle.setRect(0, 0, 500, 100);
>> >> areaStripper.addRegion("1", rectangle);
>> >>
>> >> Rectangle2D rectangle2 = new Rectangle2D.Float();
>> >> rectangle2.setRect(0, 0, 500, 100);
>> >> areaStripper.addRegion("2", rectangle2);
>> >>
>> >> // Extracting the regions and getting the results
>> >> areaStripper.extractRegions(page);
>> >> String localResult1 = areaStripper.getTextForRegion("1");
>> >> String localResult2 = areaStripper.getTextForRegion("2");
>> >>
>> >
>> > --- original Nachricht Ende ----
>> >
>> >
>>
>
> --- original Nachricht Ende ----
>
>

Re: Re: PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined

Reply via email to