Hello mpsuzuki, from a glance at the code, it seems page::text uses ustring::from_utf8 to convert Poppler's GooString into ustring which seems correct if GlobalParams::textEncoding has its default value of "UTF-8" whereas page::text_list uses detail::unicode_GooString_to_ustring which seems to try to guess the source encoding based on byte order markers.
Personally, I see a few possibilities to fix things: * Always assume GlobalParams::textEncoding == "UTF-8" for the cpp frontend and use ustring::from_utf8. * Implement something similar to ustring::from_utf8 based on the capabilities of iconv and use the actual value of GlobalParams::textEncoding to specify the source encoding. * Adjust the Poppler core, so that the places that use GlobalParams::textEncoding used by the cpp frontend actually take a textEncoding parameter explicitly, so that the cpp frontend can specify which encoding it wants (UTF-8 or even directly UTF-16 I guess). IMHO, I think this order is one of increasing effort, but also of increasing long-term maintainability. Best regards, Adam. Am 06.03.2018 um 09:00 schrieb suzuki toshiya: > Oh, I should take a look. Do you think any change of public API > of cpp frontend is needed? > > Regards, > mpsuzuki > > On 3/6/2018 12:29 AM, Jeroen Ooms wrote: >> A minimal example of this in a simple C++ program: https://git.io/vAQFW >> >> When running the example on a simple english pdf file, the >> page->text() gets printed correctly, however the metadata fields as >> well as words from the page->text_list() seem to get the wrong >> encoding. What am I doing wrong here? >> >> >> >> >> On Mon, Mar 5, 2018 at 3:10 PM, Jeroen Ooms <[email protected]> wrote: >>> I'm testing the new page::text_list() function but I run into an old >>> problem where the conversion of the ustring to UTF-8 doesn't do what I >>> expect: >>> >>> byte_array buf = x.to_utf8(); >>> std::string y(buf.begin(), buf.end()); >>> const char * str = y.c_str(); >>> >>> The resulting char * is not UTF-8. It contains random Chinese >>> characters for pdf files with plain english ascii text. I can work >>> around the problem by using x.to_latin1(), which gives the correct >>> text, mostly, but obviously it doesn't work for non english text. >>> >>> I remember running into this before for example when reading a >>> toc_item->title() or document->info_key() the conversion to utf8 als >>> doesn't seem to work. Perhaps I am misunderstanding how this works. Is >>> there some limitation on pdfs or ustrings that limits their ability to >>> be converted to UTF-8? >>> >>> Somehow I am not getting this problem for ustrings from the >>> page->text() method. >> _______________________________________________ >> poppler mailing list >> [email protected] >> https://lists.freedesktop.org/mailman/listinfo/poppler >> > > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler
signature.asc
Description: OpenPGP digital signature
_______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
