El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya va escriure: > Dear Adam, > > The 2nd option, iconv + GlobalParams::textEncoding solution might be > something like: > https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc456 > 5e01d89 ?
Seems a bit too much to me. I've personally had had no time to test the other solution you sent (replacing unicode_GooString_to_ustring with from_utf8), but if that one works, it seems much simpler and straighforward and I'd like to commit that. Cheers, Albert > > Regards, > mpsuzuki > > suzuki toshiya wrote: > > Oops, I'm quite sorry for my mistake which make people confused as > > if my bits are in github.com/freedesktop. The right places are: > > > > sample PDF file > > https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp/t > > ests/HereIsUSASCII.pdf > > > > a easiest (and oversimplified) fix for this issue > > https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72 > > 728fc85f9 > > > > Regards, > > mpsuzuki > > > > suzuki toshiya wrote: > >> Dear Jeroen, Adam, > >> > >> Sorry for long latency about this issue. I would try to draft > >> the solutions suggested by Adam. > >> > >> Yet I'm not sure what I'm seeing now is same trouble with you. > >> In my case, the testing PDF is: > >> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2a > >> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF showing > >> surrogate characters to > >> clarify the difference of UTF-8 & UTF-16) > >> I see your testing code shows same outputs for ASCII, but > >> different outputs for Cyrill etc. So, the encodings by text() > >> and textlist() are different, although their types are same > >> (ustring). It should be fixed. However, US-ASCII characters > >> are not garbled. If it's different from the trouble you're > >> seeing, please let me know. > >> > >> Now the easiest solution, using ustring::from_utf8() is drafted. > >> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0 > >> bc72728fc85f9 Please check if it works for you. I think it works well in > >> my > >> environment. > >> > >> I would proceed to the next one, implementing something like > >> ustring::from_utf8() which reflects GlobalParams::textEncoding. > >> > >> Regards, > >> mpsuzuki > >> > >> Adam Reichold wrote: > >>> Hello Jeroen, > >>> > >>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms: > >>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold > >>>> > >>>> <[email protected]> wrote: > >>>>> Hello mpsuzuki, > >>>>> > >>>>> from a glance at the code, it seems page::text uses ustring::from_utf8 > >>>>> to convert Poppler's GooString into ustring which seems correct if > >>>>> GlobalParams::textEncoding has its default value of "UTF-8" . > >>>> > >>>> I don't understand this part. Why is textEncoding a global property? > >>>> Shouldn't this be a property of single pdf document? Is there some way > >>>> I can read a document's encoding from the C++ api (without including > >>>> GlobalParams.h). > >>>> > >>>> The pdf spec states that different strings may have different > >>>> encodings. Perhaps it would be possible to expose an encoding field in > >>>> the ustring class? If there would be a way to know the encoding of a > >>>> ustring, I can get the raw data and convert it to a suitable encoding > >>>> myself. This would be much better than making assumptions. > >>> > >>> This is not the encoding of the text in the PDF document, but the > >>> encoding of the GooString that are returned by the internal Poppler API. > >>> Also I think the ustring class is intended to always store UTF-16 > >>> encoded data. > >>> > >>> Best regards, Adam. > >> > >> _______________________________________________ > >> poppler mailing list > >> [email protected] > >> https://lists.freedesktop.org/mailman/listinfo/poppler > > > > _______________________________________________ > > poppler mailing list > > [email protected] > > https://lists.freedesktop.org/mailman/listinfo/poppler > > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
