Oops, I'm quite sorry for my mistake which make people confused as if my bits are in github.com/freedesktop. The right places are:
sample PDF file https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp/tests/HereIsUSASCII.pdf a easiest (and oversimplified) fix for this issue https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9 Regards, mpsuzuki suzuki toshiya wrote: > Dear Jeroen, Adam, > > Sorry for long latency about this issue. I would try to draft > the solutions suggested by Adam. > > Yet I'm not sure what I'm seeing now is same trouble with you. > In my case, the testing PDF is: > https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2ad94367/cpp/tests/HereIsUSASCII.pdf > (maybe I should provide a PDF showing surrogate characters to > clarify the difference of UTF-8 & UTF-16) > I see your testing code shows same outputs for ASCII, but > different outputs for Cyrill etc. So, the encodings by text() > and textlist() are different, although their types are same > (ustring). It should be fixed. However, US-ASCII characters > are not garbled. If it's different from the trouble you're > seeing, please let me know. > > Now the easiest solution, using ustring::from_utf8() is drafted. > https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72728fc85f9 > Please check if it works for you. I think it works well in my > environment. > > I would proceed to the next one, implementing something like > ustring::from_utf8() which reflects GlobalParams::textEncoding. > > Regards, > mpsuzuki > > > Adam Reichold wrote: >> Hello Jeroen, >> >> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms: >>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold >>> <[email protected]> wrote: >>>> Hello mpsuzuki, >>>> >>>> from a glance at the code, it seems page::text uses ustring::from_utf8 >>>> to convert Poppler's GooString into ustring which seems correct if >>>> GlobalParams::textEncoding has its default value of "UTF-8" . >>> I don't understand this part. Why is textEncoding a global property? >>> Shouldn't this be a property of single pdf document? Is there some way >>> I can read a document's encoding from the C++ api (without including >>> GlobalParams.h). >>> >>> The pdf spec states that different strings may have different >>> encodings. Perhaps it would be possible to expose an encoding field in >>> the ustring class? If there would be a way to know the encoding of a >>> ustring, I can get the raw data and convert it to a suitable encoding >>> myself. This would be much better than making assumptions. >> This is not the encoding of the text in the PDF document, but the >> encoding of the GooString that are returned by the internal Poppler API. >> Also I think the ustring class is intended to always store UTF-16 >> encoded data. >> >> Best regards, Adam. >> >> > > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
