Hello again, so I obviously forgot the attachment... |:-\ Sorry for that.
Regards, Adam Am 05.05.2018 um 08:16 schrieb Adam Reichold: > Hello mpsuzuki, > > attached is a version of your patch with some inline comments. > > Generally speaking, I would say that some well-defined format like JSON > or YAML would be preferable to the ad-hoc encoding? > > Best regards, > Adam > > Am 03.05.2018 um 13:50 schrieb suzuki toshiya: >> Current poppler-dump (a testing tool of cpp-frontend) has no feature to >> demonstrate per-character bbox feature. >> Attached patch adds the option to demonstrate it (I'm not saying "this is >> ready >> to use, please use", I want to understand your request and whether existing >> features could cover some part of your requests). >> >> The patched poppler-dump can work like this: >> >> $ cpp/tests/poppler-dump --show-glyph-list test.pdf >> Page 1/1: >> --- >> [Please] @ ( x=72 y=72.624 w=61.32 h=21.6 ) >> [0] @ ( x=72 y=72.624 w=13.344 h=21.6 ) >> [1] @ ( x=85.344 y=72.624 w=6.672 h=21.6 ) >> [2] @ ( x=92.016 y=72.624 w=10.656 h=21.6 ) >> [3] @ ( x=102.672 y=72.624 w=10.656 h=21.6 ) >> [4] @ ( x=113.328 y=72.624 w=9.336 h=21.6 ) >> [5] @ ( x=122.664 y=72.624 w=10.656 h=21.6 ) >> [wait...] @ ( x=139.32 y=72.624 w=59.328 h=21.6 ) >> [0] @ ( x=139.32 y=72.624 w=17.328 h=21.6 ) >> [1] @ ( x=156.648 y=72.624 w=10.656 h=21.6 ) >> [2] @ ( x=167.304 y=72.624 w=6.672 h=21.6 ) >> [3] @ ( x=173.976 y=72.624 w=6.672 h=21.6 ) >> [4] @ ( x=180.648 y=72.624 w=6 h=21.6 ) >> [5] @ ( x=186.648 y=72.624 w=6 h=21.6 ) >> [6] @ ( x=192.648 y=72.624 w=6 h=21.6 ) >> [If] @ ( x=72 y=112.428 w=7.992 h=10.8 ) >> [0] @ ( x=72 y=112.428 w=3.996 h=10.8 ) >> [1] @ ( x=75.996 y=112.428 w=3.996 h=10.8 ) >> [this] @ ( x=82.992 y=112.428 w=17.34 h=10.8 ) >> [0] @ ( x=82.992 y=112.428 w=3.336 h=10.8 ) >> [1] @ ( x=86.328 y=112.428 w=6 h=10.8 ) >> [2] @ ( x=92.328 y=112.428 w=3.336 h=10.8 ) >> [3] @ ( x=95.664 y=112.428 w=4.668 h=10.8 ) >> ... >> >> Regards, >> mpsuzuki >> >> suzuki toshiya wrote: >>> Dear obsidian, >>> >>> Too many posts about similar issues :-) >>> I'm not sure whether poppler maintainers are interested in the enhancement >>> of >>> pdftotext, >>> but recently Jeroen and I were working with cpp-frontend to have similar >>> features. >>> >>> in the latest version of poppler, >>> cpp-frontend has a feature to retrieve the list of words with bounding box, >>> and it can retrieve the bounding box for each glyph in the word. >>> >>> -- >>> >>> also I proposed a patch to retrieve the font family and point size: >>> https://lists.freedesktop.org/archives/poppler/2018-April/013035.html >>> >>> it might be waiting the maintainers review. the discussion and result would >>> be >>> found at here: >>> https://github.com/ropensci/pdftools/issues/29 >>> >>> -- >>> >>>> - style, i.e. none, bold, italic >>> >>> if the document producer has a bold font and used in the document, aslike >>> Helvetica-Bold, >>> it would be found by the family name. >>> but if the document producer has no bold font and let the word processor >>> software synthesize the embolden fonts, >>> it would be difficult for the PDF renderer to recognize it as embolden font, >>> because the embolding is done by showing same glyph with subtle shifting. >>> Simple PDF renderers would be unable to distinguish "normal font but >>> layered" >>> and "embolden font". >>> >>> Regards, >>> mpsuzuki >>> >>> obsidian . wrote: >>>> I'm using "pdftotext -bbox file.pdf" to convert a pdf file into html. >>>> >>>> Here's a sample line from the output: >>>> <word xMin="359.852025" yMin="462.548936" xMax="365.689478" >>>> yMax="467.681498">foo</word> >>>> >>>> Is there a way to get font information for every word like: >>>> - font family, e.g. Verdana >>>> - style, i.e. none, bold, italic >>>> - size, e.g. font size 9 >>>> >>>> I'm using pdftotext version 0.55.0 on Windows. >>>> >>>> >>> >>> _______________________________________________ >>> poppler mailing list >>> [email protected] >>> https://lists.freedesktop.org/mailman/listinfo/poppler >>> >>> >>> _______________________________________________ >>> poppler mailing list >>> [email protected] >>> https://lists.freedesktop.org/mailman/listinfo/poppler > > > > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler >
diff --git a/cpp/tests/poppler-dump.cpp b/cpp/tests/poppler-dump.cpp
index a1a6825..a751480 100644
--- a/cpp/tests/poppler-dump.cpp
+++ b/cpp/tests/poppler-dump.cpp
@@ -52,6 +52,7 @@ bool show_pages = false;
bool show_help = false;
char show_text[32];
bool show_text_list = false;
+bool show_glyph_list = false;
poppler::page::text_layout_enum show_text_layout = poppler::page::physical_layout;
static const ArgDesc the_args[] = {
@@ -75,6 +76,8 @@ static const ArgDesc the_args[] = {
"show text (physical|raw) extracted from all pages" },
{ "--show-text-list", argFlag, &show_text_list, 0,
"show text list (experimental)" },
+ { "--show-glyph-list", argFlag, &show_glyph_list, 0,
+ "show glyph list in each word (experimental)" },
{ "-h", argFlag, &show_help, 0,
"print usage information" },
{ "--help", argFlag, &show_help, 0,
@@ -348,6 +351,36 @@ static void print_page_text_list(poppler::page *p)
std::cout << "---" << std::endl;
}
+static void print_page_glyph_list(poppler::page *p)
+{
+ if (!p) {
+ std::cout << std::setw(out_width) << "Broken Page. Could not be parsed" << std::endl;
+ std::cout << std::endl;
Shouldn' this go to std::cerr? Also "\n" is generally preferred over std::endl since the later also unnecessarily flushes the output stream.
+ return;
+ }
+ auto text_list = p->text_list();
I guess this could be const auto text_list?
+ std::cout << "---" << std::endl;
+ for (size_t i = 0; i < text_list.size(); i ++) {
+ poppler::rectf bbox = text_list[i].bbox();
+ poppler::ustring ustr = text_list[i].text();
I would suggest a ranged for loop:
for (const auto& word : text_list) {
const auto bbox = word.bbox();
const auto text = word.text();
+ std::cout << "[" << ustr << "] @ ";
+ std::cout << "( x=" << bbox.x() << " y=" << bbox.y() << " w=" << bbox.width() << " h=" << bbox.height() << " )";
+ std::cout << std::endl;
+
+ for (size_t j = 0; ; j++) {
+ poppler::rectf c_bbox = text_list[i].char_bbox(j);
+ if (c_bbox.x() == 0 && c_bbox.y() == 0 && c_bbox.width() == 0 && c_bbox.height() == 0)
+ break;
I think it would be nice if poppler::rectf had a is_null method to check this so that it stays consistent everywhere?
Also for example Qt uses negative width/height to indicate invalid rectangles since all zeros is basically a valid if degenerate rectanlge.
+ std::cout << "\t[" << j << "] @ ";
+ std::cout << "( x=" << c_bbox.x() << " y=" << c_bbox.y() << " w=" << c_bbox.width() << " h=" << c_bbox.height() << " )";
+ std::cout << std::endl;
+ }
+
+ }
+ std::cout << "---" << std::endl;
+}
+
int main(int argc, char *argv[])
{
@@ -432,6 +465,15 @@ int main(int argc, char *argv[])
print_page_text_list(p.get());
}
}
+ if (show_glyph_list) {
+ const int pages = doc->pages();
+ for (int i = 0; i < pages; ++i) {
+ std::cout << "Page " << (i + 1) << "/" << pages << ":" << std::endl;
+ std::unique_ptr<poppler::page> p(doc->create_page(i));
+ print_page_glyph_list(p.get());
+ }
+ }
+
return 0;
}
signature.asc
Description: OpenPGP digital signature
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
