Dear Adam, The 2nd option, iconv + GlobalParams::textEncoding solution might be something like: https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc4565e01d89 ?
Regards, mpsuzuki suzuki toshiya wrote: > Oops, I'm quite sorry for my mistake which make people confused as > if my bits are in github.com/freedesktop. The right places are: > > sample PDF file > https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp/tests/HereIsUSASCII.pdf > > a easiest (and oversimplified) fix for this issue > https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9 > > Regards, > mpsuzuki > > suzuki toshiya wrote: >> Dear Jeroen, Adam, >> >> Sorry for long latency about this issue. I would try to draft >> the solutions suggested by Adam. >> >> Yet I'm not sure what I'm seeing now is same trouble with you. >> In my case, the testing PDF is: >> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2ad94367/cpp/tests/HereIsUSASCII.pdf >> (maybe I should provide a PDF showing surrogate characters to >> clarify the difference of UTF-8 & UTF-16) >> I see your testing code shows same outputs for ASCII, but >> different outputs for Cyrill etc. So, the encodings by text() >> and textlist() are different, although their types are same >> (ustring). It should be fixed. However, US-ASCII characters >> are not garbled. If it's different from the trouble you're >> seeing, please let me know. >> >> Now the easiest solution, using ustring::from_utf8() is drafted. >> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9 >> Please check if it works for you. I think it works well in my >> environment. >> >> I would proceed to the next one, implementing something like >> ustring::from_utf8() which reflects GlobalParams::textEncoding. >> >> Regards, >> mpsuzuki >> >> >> Adam Reichold wrote: >>> Hello Jeroen, >>> >>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms: >>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold >>>> <[email protected]> wrote: >>>>> Hello mpsuzuki, >>>>> >>>>> from a glance at the code, it seems page::text uses ustring::from_utf8 >>>>> to convert Poppler's GooString into ustring which seems correct if >>>>> GlobalParams::textEncoding has its default value of "UTF-8" . >>>> I don't understand this part. Why is textEncoding a global property? >>>> Shouldn't this be a property of single pdf document? Is there some way >>>> I can read a document's encoding from the C++ api (without including >>>> GlobalParams.h). >>>> >>>> The pdf spec states that different strings may have different >>>> encodings. Perhaps it would be possible to expose an encoding field in >>>> the ustring class? If there would be a way to know the encoding of a >>>> ustring, I can get the raw data and convert it to a suitable encoding >>>> myself. This would be much better than making assumptions. >>> This is not the encoding of the text in the PDF document, but the >>> encoding of the GooString that are returned by the internal Poppler API. >>> Also I think the ustring class is intended to always store UTF-16 >>> encoded data. >>> >>> Best regards, Adam. >>> >>> >> _______________________________________________ >> poppler mailing list >> [email protected] >> https://lists.freedesktop.org/mailman/listinfo/poppler > > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
