A minimal example of this in a simple C++ program: https://git.io/vAQFW
When running the example on a simple english pdf file, the page->text() gets printed correctly, however the metadata fields as well as words from the page->text_list() seem to get the wrong encoding. What am I doing wrong here? On Mon, Mar 5, 2018 at 3:10 PM, Jeroen Ooms <[email protected]> wrote: > I'm testing the new page::text_list() function but I run into an old > problem where the conversion of the ustring to UTF-8 doesn't do what I > expect: > > byte_array buf = x.to_utf8(); > std::string y(buf.begin(), buf.end()); > const char * str = y.c_str(); > > The resulting char * is not UTF-8. It contains random Chinese > characters for pdf files with plain english ascii text. I can work > around the problem by using x.to_latin1(), which gives the correct > text, mostly, but obviously it doesn't work for non english text. > > I remember running into this before for example when reading a > toc_item->title() or document->info_key() the conversion to utf8 als > doesn't seem to work. Perhaps I am misunderstanding how this works. Is > there some limitation on pdfs or ustrings that limits their ability to > be converted to UTF-8? > > Somehow I am not getting this problem for ustrings from the page->text() > method. _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
