Hi, Recently I heard some people wants to retrieve the list of words from PDF, as cpp's poppler::page::text_list(), but with the font information (e.g. the familyname of the font).
Considering that often the office document or academic articles use different fonts for the section titles and the main text, it would be reasonable for the people to expect as "I want to retrieve the text boxes, but only the text boxes written by Helvetica-Bold". What is the right way to do such? During the developmet of poppler::page::text_list(), once I've tried to do such. https://github.com/mpsuzuki/poppler/commit/8ce2556a62a90c034d7cea8b1dfd26715d03a8f0 (note: this patch was written before the stabilization of unique_ptr utilization. more fix is expected in future) However, I feel it's slightly too big. Its changes are not only for cpp frontend codes, but also for poppler/FontInfo.{cc,h} and poppler/TextOutputDev.{cc,h}. I want to ask a few questions... Q-1) a request for text_box with font info fits to poppler's scope? is there any better library to request such feature? Q-2) if this request fits to poppler's scope, the enhancement of the cpp frontend poppler::page::text_list() is the way to go? having different API for such purpose is better? Q-3) my current patch modifies FontInfo and TextOutputDev of libpoppler itself. such modification is acceptable? I appreciate if the maintainers can give some comments. Regards, mpsuzuki _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
