Current poppler-dump (a testing tool of cpp-frontend) has no feature to
demonstrate per-character bbox feature.
Attached patch adds the option to demonstrate it (I'm not saying "this is ready
to use, please use", I want to understand your request and whether existing
features could cover some part of your requests).

The patched poppler-dump can work like this:

$ cpp/tests/poppler-dump --show-glyph-list test.pdf
Page 1/1:
---
[Please] @ ( x=72 y=72.624 w=61.32 h=21.6 )
        [0] @ ( x=72 y=72.624 w=13.344 h=21.6 )
        [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 )
        [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 )
        [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 )
        [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 )
        [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 )
[wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 )
        [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 )
        [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 )
        [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 )
        [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 )
        [4] @ ( x=180.648 y=72.624 w=6 h=21.6 )
        [5] @ ( x=186.648 y=72.624 w=6 h=21.6 )
        [6] @ ( x=192.648 y=72.624 w=6 h=21.6 )
[If] @ ( x=72 y=112.428 w=7.992 h=10.8 )
        [0] @ ( x=72 y=112.428 w=3.996 h=10.8 )
        [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 )
[this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 )
        [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 )
        [1] @ ( x=86.328 y=112.428 w=6 h=10.8 )
        [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 )
        [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 )
...

Regards,
mpsuzuki

suzuki toshiya wrote:
> Dear obsidian,
> 
> Too many posts about similar issues :-)
> I'm not sure whether poppler maintainers are interested in the enhancement of
> pdftotext,
> but recently Jeroen and I were working with cpp-frontend to have similar 
> features.
> 
> in the latest version of poppler,
> cpp-frontend has a feature to retrieve the list of words with bounding box,
> and it can retrieve the bounding box for each glyph in the word.
> 
> --
> 
> also I proposed a patch to retrieve the font family and point size:
> https://lists.freedesktop.org/archives/poppler/2018-April/013035.html
> 
> it might be waiting the maintainers review. the discussion and result would be
> found at here:
> https://github.com/ropensci/pdftools/issues/29
> 
> --
> 
>> - style, i.e. none, bold, italic
> 
> if the document producer has a bold font and used in the document, aslike
> Helvetica-Bold,
> it would be found by the family name.
> but if the document producer has no bold font and let the word processor
> software synthesize the embolden fonts,
> it would be difficult for the PDF renderer to recognize it as embolden font,
> because the embolding is done by showing same glyph with subtle shifting.
> Simple PDF renderers would be unable to distinguish "normal font but layered"
> and "embolden font".
> 
> Regards,
> mpsuzuki
> 
> obsidian . wrote:
>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html.
>>
>> Here's a sample line from the output:
>>     <word xMin="359.852025" yMin="462.548936" xMax="365.689478" 
>> yMax="467.681498">foo</word>
>>
>> Is there a way to get font information for every word like:
>> - font family, e.g. Verdana
>> - style, i.e. none, bold, italic
>> - size, e.g. font size 9
>>
>> I'm using pdftotext version 0.55.0 on Windows.
>>
>>
> 
> _______________________________________________
> poppler mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/poppler
diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp
index a1a6825..a751480 100644
--- a/cpp/tests/poppler-dump.cpp
+++ b/cpp/tests/poppler-dump.cpp
@@ -52,6 +52,7 @@ bool show_pages = false;
 bool show_help = false;
 char show_text[32];
 bool show_text_list = false;
+bool show_glyph_list = false;
 poppler::page::text_layout_enum show_text_layout = poppler::page::physical_layout;
 
 static const ArgDesc the_args[] = {
@@ -75,6 +76,8 @@ static const ArgDesc the_args[] = {
       "show text (physical|raw) extracted from all pages" },
     { "--show-text-list",      argFlag, &show_text_list,       0,
       "show text list (experimental)" },
+    { "--show-glyph-list", argFlag, &show_glyph_list,          0,
+      "show glyph list in each word (experimental)" },
     { "-h",                    argFlag,  &show_help,           0,
       "print usage information" },
     { "--help",                argFlag,  &show_help,           0,
@@ -348,6 +351,36 @@ static void print_page_text_list(poppler::page *p)
     std::cout << "---" << std::endl;
 }
 
+static void print_page_glyph_list(poppler::page *p)
+{
+    if (!p) {
+        std::cout << std::setw(out_width) << "Broken Page. Could not be parsed" << std::endl;
+        std::cout << std::endl;
+        return;
+    }
+    auto text_list = p->text_list();
+
+    std::cout << "---" << std::endl;
+    for (size_t i = 0; i < text_list.size(); i ++) {
+        poppler::rectf bbox = text_list[i].bbox();
+        poppler::ustring ustr = text_list[i].text();
+        std::cout << "[" << ustr << "] @ ";
+        std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" << bbox.width() << " h=" << bbox.height() << " )";
+        std::cout << std::endl;
+
+        for (size_t j = 0; ; j++) {
+            poppler::rectf c_bbox = text_list[i].char_bbox(j);
+            if (c_bbox.x() == 0 && c_bbox.y() == 0 && c_bbox.width() == 0 && c_bbox.height() == 0)
+                break;
+            std::cout << "\t[" << j << "] @ ";
+            std::cout << "( x=" << c_bbox.x() << " y=" << c_bbox.y() << " w=" << c_bbox.width() << " h=" << c_bbox.height() << " )";
+            std::cout << std::endl;
+        }
+
+    }
+    std::cout << "---" << std::endl;
+}
+
 
 int main(int argc, char *argv[])
 {
@@ -432,6 +465,15 @@ int main(int argc, char *argv[])
             print_page_text_list(p.get());
         }
     }
+    if (show_glyph_list) {
+        const int pages = doc->pages();
+        for (int i = 0; i < pages; ++i) {
+            std::cout << "Page " << (i + 1) << "/" << pages << ":" << std::endl;
+            std::unique_ptr<poppler::page> p(doc->create_page(i));
+            print_page_glyph_list(p.get());
+        }
+    }
+
 
     return 0;
 }
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to