Ah! I apologize. In my old code I had been calling processStream on a single PDPage, not processPage. Sorry that was my mixup.
I think I am good now using the setPage(PDPage) override for what I was looking to do. Cheers, Britt Britt Fitch Wired Informatics 265 Franklin St Ste 1702 Boston, MA 02110 http://wiredinformatics.com [email protected] > On Dec 4, 2015, at 3:53 PM, Tilman Hausherr <[email protected]> wrote: > > Am 04.12.2015 um 21:39 schrieb britt fitch: >> Thanks very much for the quick replies! >> >> I think setting startPage & endPage with make it so you correctly only >> extract the pages you want, but on every extraction it will iterate over all >> pages first. >> >> For example, if you have a 100 page document and want to extract page 2 & >> page 90, you will iterate over all 100 pages and process page 2, then >> iterate over all 100 pages and process page 90. >> >> The 1.8 version allowed you to pass a single page to be processed. I’m >> curious if that functionality was removed because of an issue or if it was >> just a bug. > > Really? I looked at processPage(), and it does use currentPageNo and I don't > see a way to set that one from outside. > > On a second look, I think I understand what you mean: processPages() uses a > list of pages, so you would set your own list. But this would mean trouble if > you had set other variables. > I assume this was changed in 2.0 as part of the page tree refactoring. > > Btw this looping does indeed look weird, but I doubt you'll use any time. The > text extraction by itself does much more, it needs to loop through every > glyph in the page you're extracting. > > Tilman > >> >> It looks like I can get around this a bit by overriding startPage(PDPage) >> and endPage(PDPage) though. >> >> Thanks again, I really appreciate all your feedback. >> >> Cheers, >> >> Britt >> >> >> >> Britt Fitch >> Wired Informatics >> 265 Franklin St Ste 1702 >> Boston, MA 02110 >> http://wiredinformatics.com >> [email protected] >> >>> On Dec 4, 2015, at 3:07 PM, Tilman Hausherr <[email protected] >>> <mailto:[email protected]><mailto:[email protected] >>> <mailto:[email protected]>>> wrote: >>> >>> Am 04.12.2015 um 20:56 schrieb britt fitch: >>>> Awesome, thanks. That takes care of #1 & 2. >>>> >>>> For #3, is the check on currentPageNo necessary? >>>> Right now processPage must be called from processPages or nothing happens. >>>> This has a negative effect for cases like mine where I want to override >>>> processTextPosition and handle different pages or even if you only want to >>>> extract data from particular pages. >>> >>> You can set the start and endpage through the setters setStartPage() and >>> setEndPage(). That's the official way to do it. >>> >>> Tilman
signature.asc
Description: Message signed with OpenPGP using GPGMail

