Hi, I want to ask some questions about the internal design of poppler-glib.
---------------------------------------------------------- Recently Albert accepted my proposal to extend the interface of TextOutputDev to access the raw text (the layout/position info is not considered). At present, only poppler-qt4 could use the extented API, but I don't want to restrict it to poppler-qt4. I'm trying to extend poppler-glib (and poppler-cpp in next) to use the extended API. Checking the internal code how to extract the text from PDF, there is a difference between poppler-qt4 and poppler-glib. Adding a few new APIs to enable/disable raw-order mode is insufficient for poppler-glib to access raw text. poppler-qt4 ----------- To get the text content from page object, Poppler::Page::text() is invoked. In Poppler::Page::text(), TextOutputDev is created, TextOutputDev::displayPageSlice() is invoked with selection area, and TextOutputDev::getText() is invoked and GooString is obtained. Finally, GooString is converted to QString object and returned to the client. poppler-glib ------------ To get the text content from page object, TextOutputDev::getSelectionText() is used. It dumps the strings collected by TextSelectionVisitor object. TextSelectionVisitor define 3 methods to eat the text, visitBlock(), visitLine() and visitWord(). But only visitLine() method is implemented. Because "line" is defined by the analysis of the text layout, there is no lines in raw order. --------------------------------------------------------------- Indepth modification is required to keep the procedure similarity between poppler-glib's physical-layout mode and poppler-glib's raw-order mode. Because of no lines can be defined in raw-ordered mode, using visitLine() for raw-order mode won't be good idea. Adding the implementation of visitWord() would be better. Is there any features bounded to the properties obtained by visitLine()? I don't want to put a mine that blows the application assuming all text are collected by visitLine(). Regards, mpsuzuki _______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
