The issue can be found at https://issues.apache.org/jira/browse/PDFBOX-495
2009/7/23 Andreas Lehmkühler <[email protected]>: > Hi Ismael, > > thanks for your detailed description. I've made a similar test and ran into > the same issue. Obvoiusly only the next to last region is processed, the > others somehow getting lost. > > Please create an issue on jira [1] and attach your description. > > Thanks in advance > Andreas Lehmkühler > > [1] https://issues.apache.org/jira/browse/PDFBOX > > ----- original Nachricht -------- > > Betreff: Re: PDFTextStripperByArea extracts text only from 1 region, despite > several regions being defined > Gesendet: Di, 21. Jul 2009 > Von: Ismael Hasan<[email protected]> > >> Hi Andreas, >> >> thanks for answering. I am now loading the document as you suggested. >> >> I will rewrite the question, since I have been doing some testing: I >> understand that the code in my first message should return the same >> result for both of the regions, since they are defined with the same >> parameters, but it does not. Testing it with more regions, it only >> retrieves the text from one: >> >> I divide a page in 4 regions and add the regions to the stripper in >> the following order: >> 1-upper left, 2-upper right, 3-lower left, 4-lower right. >> >> After calling "extractRegions" function, only the text for the third >> one is retrieved. >> If I donnot add the third region, only the text for region 2 is retrieved. >> >> >> I think this behaviour is strange, and it may not be the expected. In >> the example you suggested, >> 'org.apache.pdfbox.examples.util.ExtractTextByArea', only one region >> is defined, so maybe the tool is not intended to extract several >> regions at a time. >> >> Any answer will be appreciated, >> >> thanks in advance, >> >> Ismael >> >> 2009/7/21 Andreas Lehmkühler <[email protected]>: >> > Hi Ismael, >> > >> > first of all try to load the pdf with PDDocument doc = >> PDDocument.load(file). You don't have to parse the doc on your own. See >> org.apache.pdfbox.examples.util.ExtractTextByArea as an example for >> extracting textareas. >> > Why do you try to extract the same region twice? Wouldn't it be easier to >> just copy the resultstring? >> > >> > BR >> > Andreas Lehmkühler >> > >> > ----- original Nachricht -------- >> > >> > Betreff: PDFTextStripperByArea extracts text only from 1 region, despite >> several regions being defined >> > Gesendet: Di, 21. Jul 2009 >> > Von: Ismael Hasan<[email protected]> >> > >> >> Hello. I have a problem with the class >> >> "org.apache.pdfbox.util.PDFTextStripperByArea": >> >> >> >> If I add several regions to this class to extract the text from, it is >> >> only retrieved from one of them. The example I build was to create two >> >> regions with the same values (with different names), add them to the >> >> text stripper, and use the "extractRegions" function. >> >> >> >> I really appreciate if someone can answer me what I am doing wrong, or >> >> if this is a bug in the tool. >> >> >> >> Please, see at the end of the message the code with which I get this >> >> issue; the final result buffers (localResult1 and localResult2) have >> >> different content (one of them is empty). If you need a PDF document >> >> to reproduce this, please ask me for it. >> >> >> >> Thanks in advance, >> >> Ismael >> >> >> >> >> >> >> >> //Opening the document and getting the page >> >> PDFParser parser = new PDFParser(new >> >> ByteArrayInputStream(documentInBytes)); >> >> parser.parse(); >> >> PDDocument doc = parser.getPDDocument(); >> >> PDPage page = (PDPage) >> >> doc.getDocumentCatalog().getAllPages().get(pageNumber); >> >> >> >> // Creating the stripper >> >> PDFTextStripperByArea areaStripper = new PDFTextStripperByArea(); >> >> >> >> // Creation and addition of the regions to the stripper >> >> Rectangle2D rectangle = new Rectangle2D.Float(); >> >> rectangle.setRect(0, 0, 500, 100); >> >> areaStripper.addRegion("1", rectangle); >> >> >> >> Rectangle2D rectangle2 = new Rectangle2D.Float(); >> >> rectangle2.setRect(0, 0, 500, 100); >> >> areaStripper.addRegion("2", rectangle2); >> >> >> >> // Extracting the regions and getting the results >> >> areaStripper.extractRegions(page); >> >> String localResult1 = areaStripper.getTextForRegion("1"); >> >> String localResult2 = areaStripper.getTextForRegion("2"); >> >> >> > >> > --- original Nachricht Ende ---- >> > >> > >> > > --- original Nachricht Ende ---- > >
