Excerpts from Daniel Garcia's message of mié ene 05 14:11:05 +0100 2011: > On Wed, Sep 22, 2010 at 02:11:31PM +0200, carlosgc wrote: > > Excerpts from suzuki toshiya's message of mié sep 15 12:16:22 +0200 2010: > > > Hi, > > > > Hi, > > > > > Attached patches are the introduction of new API to access raw text. > > > I wish some maintainer of poppler-glib can review it. > > > > Yes, sorry for the delay. > > > > > poppler-0.15.0_glib-lib.diff > > > patch to declare new function and its implementation > > > > > > > I prefer poppler_page_get_raw_text(), rather than > > poppler_page_get_selected_raw_text(), and always return the text of > > the whole page. I don't see why you might want the selected text in > > raw order. > > I've made that function. Here's the patch.
Thanks for the patch!. Comments inline below > From 389d49e3413ce09601b574308bd6bbd46044e6b3 Mon Sep 17 00:00:00 2001 > From: danigm <[email protected]> > Date: Wed, 5 Jan 2011 14:07:59 +0100 > Subject: [PATCH] [glib] Added poppler_page_get_raw_text function > > --- > glib/poppler-page.cc | 54 > +++++++++++++++++++++++++++++++++++++++++++++++++- > glib/poppler-page.h | 1 + > 2 files changed, 54 insertions(+), 1 deletions(-) > > diff --git a/glib/poppler-page.cc b/glib/poppler-page.cc > index a8e6b2d..8966f7e 100644 > --- a/glib/poppler-page.cc > +++ b/glib/poppler-page.cc > @@ -2117,7 +2117,7 @@ poppler_page_get_crop_box (PopplerPage *page, > PopplerRectangle *rect) > * This array must be freed with g_free () when done. > * > * The position in the array represents an offset in the text returned by > - * poppler_page_get_text() > + * poppler_page_get_raw_text() Why? if they are compatible is because they return the same, I guess get_text_layout() wants the text in reading order. > * Return value: %TRUE if the page contains text, %FALSE otherwise > * > @@ -2200,3 +2200,55 @@ poppler_page_get_text_layout (PopplerPage *page, > > return TRUE; > } > + > +/** > + * poppler_page_get_raw_text: > + * @page: A #PopplerPage You should explain here what raw_text() exactly is, and why it is different from get_text(). > + * Return value: a pointer to the text page in raw order > + * as a string This is new API, add Since: 0.18 here and remember to add the symbol to glib/reference/poppler-sections.txt > + **/ > +char * > +poppler_page_get_raw_text (PopplerPage *page) > +{ > + TextPage *text; > + TextWordList *wordlist; > + TextWord *word, *nextword; > + char *craw_text; > + GooString *raw_text; > + int i = 0; > + > + raw_text = new GooString(); > + > + g_return_val_if_fail (POPPLER_IS_PAGE (page), FALSE); s/FALSE/NULL/ > + text = poppler_page_get_text_page (page); > + wordlist = text->makeWordList (gFalse); > + > + if (wordlist->getLength () <= 0) > + return NULL; You are leaking wordlist and raw_text in this early return. Delete the wordlist when length <= 0 and create raw_text after the if. > + for (i = 0; i < wordlist->getLength (); i++) > + { > + word = wordlist->get (i); > + raw_text->append (word->getText ()); word->getText() returns a new allocated GooString and GooString::append() copies the given string, so you are leaking the GooString here. > + nextword = word->getNext (); > + if (nextword) > + { > + raw_text->append (' '); > + } > + else > + { > + raw_text->append ('\n'); > + } Don't use braces for single line clauses. Here you could use something like raw_text->append (nextword ? ' ' : '\n'); > + } > + > + craw_text = g_strdup (raw_text->getCString ()); We can avoid this g_strdup() by using a GString instead of a GooString. GString *raw_text = g_string_new (NULL); raw_text = g_string_append_len (raw_text, wordText->getCString(), wordText->getLength()); raw_text = g_string_append_c (raw_text, nextword ? ' ' : '\n'); craw_text = g_string_free (raw_text, FALSE); > + delete wordlist; > + delete raw_text; > + > + return craw_text; > +} > diff --git a/glib/poppler-page.h b/glib/poppler-page.h > index d40c0ee..333cb23 100644 > --- a/glib/poppler-page.h > +++ b/glib/poppler-page.h > @@ -128,6 +128,7 @@ void poppler_page_get_crop_box > (PopplerPage *page, > gboolean poppler_page_get_text_layout (PopplerPage > *page, > PopplerRectangle > **rectangles, > guint > *n_rectangles); > +char *poppler_page_get_raw_text (PopplerPage > *page); > > /* A rectangle on a page, with coordinates in PDF points. */ > #define POPPLER_TYPE_RECTANGLE (poppler_rectangle_get_type ()) -- Carlos Garcia Campos PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x523E6462
signature.asc
Description: PGP signature
_______________________________________________ poppler mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/poppler
