Hi Ismael, thanks for your detailed description. I've made a similar test and ran into the same issue. Obvoiusly only the next to last region is processed, the others somehow getting lost.
Please create an issue on jira [1] and attach your description. Thanks in advance Andreas Lehmkühler [1] https://issues.apache.org/jira/browse/PDFBOX ----- original Nachricht -------- Betreff: Re: PDFTextStripperByArea extracts text only from 1 region, despite several regions being defined Gesendet: Di, 21. Jul 2009 Von: Ismael Hasan<[email protected]> > Hi Andreas, > > thanks for answering. I am now loading the document as you suggested. > > I will rewrite the question, since I have been doing some testing: I > understand that the code in my first message should return the same > result for both of the regions, since they are defined with the same > parameters, but it does not. Testing it with more regions, it only > retrieves the text from one: > > I divide a page in 4 regions and add the regions to the stripper in > the following order: > 1-upper left, 2-upper right, 3-lower left, 4-lower right. > > After calling "extractRegions" function, only the text for the third > one is retrieved. > If I donnot add the third region, only the text for region 2 is retrieved. > > > I think this behaviour is strange, and it may not be the expected. In > the example you suggested, > 'org.apache.pdfbox.examples.util.ExtractTextByArea', only one region > is defined, so maybe the tool is not intended to extract several > regions at a time. > > Any answer will be appreciated, > > thanks in advance, > > Ismael > > 2009/7/21 Andreas Lehmkühler <[email protected]>: > > Hi Ismael, > > > > first of all try to load the pdf with PDDocument doc = > PDDocument.load(file). You don't have to parse the doc on your own. See > org.apache.pdfbox.examples.util.ExtractTextByArea as an example for > extracting textareas. > > Why do you try to extract the same region twice? Wouldn't it be easier to > just copy the resultstring? > > > > BR > > Andreas Lehmkühler > > > > ----- original Nachricht -------- > > > > Betreff: PDFTextStripperByArea extracts text only from 1 region, despite > several regions being defined > > Gesendet: Di, 21. Jul 2009 > > Von: Ismael Hasan<[email protected]> > > > >> Hello. I have a problem with the class > >> "org.apache.pdfbox.util.PDFTextStripperByArea": > >> > >> If I add several regions to this class to extract the text from, it is > >> only retrieved from one of them. The example I build was to create two > >> regions with the same values (with different names), add them to the > >> text stripper, and use the "extractRegions" function. > >> > >> I really appreciate if someone can answer me what I am doing wrong, or > >> if this is a bug in the tool. > >> > >> Please, see at the end of the message the code with which I get this > >> issue; the final result buffers (localResult1 and localResult2) have > >> different content (one of them is empty). If you need a PDF document > >> to reproduce this, please ask me for it. > >> > >> Thanks in advance, > >> Ismael > >> > >> > >> > >> //Opening the document and getting the page > >> PDFParser parser = new PDFParser(new > >> ByteArrayInputStream(documentInBytes)); > >> parser.parse(); > >> PDDocument doc = parser.getPDDocument(); > >> PDPage page = (PDPage) > >> doc.getDocumentCatalog().getAllPages().get(pageNumber); > >> > >> // Creating the stripper > >> PDFTextStripperByArea areaStripper = new PDFTextStripperByArea(); > >> > >> // Creation and addition of the regions to the stripper > >> Rectangle2D rectangle = new Rectangle2D.Float(); > >> rectangle.setRect(0, 0, 500, 100); > >> areaStripper.addRegion("1", rectangle); > >> > >> Rectangle2D rectangle2 = new Rectangle2D.Float(); > >> rectangle2.setRect(0, 0, 500, 100); > >> areaStripper.addRegion("2", rectangle2); > >> > >> // Extracting the regions and getting the results > >> areaStripper.extractRegions(page); > >> String localResult1 = areaStripper.getTextForRegion("1"); > >> String localResult2 = areaStripper.getTextForRegion("2"); > >> > > > > --- original Nachricht Ende ---- > > > > > --- original Nachricht Ende ----
