El dijous, 15 de març de 2018, a les 14:20:52 CET, suzuki toshiya va escriure: > Dear Albert, > > Thank you, I'm glad to hear that one of the direction could be > acceptable. Maybe the consideration of GlobalParams::textEncoding > would be discussed in future when cpp frontend introduces an API > to modify it to non-Unicode values.
Honestly i don't think that makes any sense, why would you want that? Cheers, Albert > > Now I'm discussing with Jeroen about how to fix other metadata > (not related with text_list() API), please wait a while. > > Regards, > mpsuzuki > > Albert Astals Cid wrote: > > El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya va > > > > escriure: > >> Dear Adam, > >> > >> The 2nd option, iconv + GlobalParams::textEncoding solution might be > >> something like: > >> https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc > >> 456 5e01d89 ? > > > > Seems a bit too much to me. > > > > I've personally had had no time to test the other solution you sent > > (replacing unicode_GooString_to_ustring with from_utf8), but if that one > > works, it seems much simpler and straighforward and I'd like to commit > > that. > > > > Cheers, > > > > Albert > >> > >> Regards, > >> mpsuzuki > >> > >> suzuki toshiya wrote: > >>> Oops, I'm quite sorry for my mistake which make people confused as > >>> if my bits are in github.com/freedesktop. The right places are: > >>> > >>> sample PDF file > >>> https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp > >>> /t > >>> ests/HereIsUSASCII.pdf > >>> > >>> a easiest (and oversimplified) fix for this issue > >>> https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc > >>> 72 > >>> 728fc85f9 > >>> > >>> Regards, > >>> mpsuzuki > >>> > >>> suzuki toshiya wrote: > >>>> Dear Jeroen, Adam, > >>>> > >>>> Sorry for long latency about this issue. I would try to draft > >>>> the solutions suggested by Adam. > >>>> > >>>> Yet I'm not sure what I'm seeing now is same trouble with you. > >>>> In my case, the testing PDF is: > >>>> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba > >>>> 2a > >>>> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF > >>>> showing > >>>> surrogate characters to > >>>> clarify the difference of UTF-8 & UTF-16) > >>>> I see your testing code shows same outputs for ASCII, but > >>>> different outputs for Cyrill etc. So, the encodings by text() > >>>> and textlist() are different, although their types are same > >>>> (ustring). It should be fixed. However, US-ASCII characters > >>>> are not garbled. If it's different from the trouble you're > >>>> seeing, please let me know. > >>>> > >>>> Now the easiest solution, using ustring::from_utf8() is drafted. > >>>> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbf > >>>> c0 > >>>> bc72728fc85f9 Please check if it works for you. I think it works well > >>>> in > >>>> my > >>>> environment. > >>>> > >>>> I would proceed to the next one, implementing something like > >>>> ustring::from_utf8() which reflects GlobalParams::textEncoding. > >>>> > >>>> Regards, > >>>> mpsuzuki > >>>> > >>>> Adam Reichold wrote: > >>>>> Hello Jeroen, > >>>>> > >>>>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms: > >>>>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold > >>>>>> > >>>>>> <[email protected]> wrote: > >>>>>>> Hello mpsuzuki, > >>>>>>> > >>>>>>> from a glance at the code, it seems page::text uses > >>>>>>> ustring::from_utf8 > >>>>>>> to convert Poppler's GooString into ustring which seems correct if > >>>>>>> GlobalParams::textEncoding has its default value of "UTF-8" . > >>>>>> > >>>>>> I don't understand this part. Why is textEncoding a global property? > >>>>>> Shouldn't this be a property of single pdf document? Is there some > >>>>>> way > >>>>>> I can read a document's encoding from the C++ api (without including > >>>>>> GlobalParams.h). > >>>>>> > >>>>>> The pdf spec states that different strings may have different > >>>>>> encodings. Perhaps it would be possible to expose an encoding field > >>>>>> in > >>>>>> the ustring class? If there would be a way to know the encoding of a > >>>>>> ustring, I can get the raw data and convert it to a suitable encoding > >>>>>> myself. This would be much better than making assumptions. > >>>>> > >>>>> This is not the encoding of the text in the PDF document, but the > >>>>> encoding of the GooString that are returned by the internal Poppler > >>>>> API. > >>>>> Also I think the ustring class is intended to always store UTF-16 > >>>>> encoded data. > >>>>> > >>>>> Best regards, Adam. > >>>> > >>>> _______________________________________________ > >>>> poppler mailing list > >>>> [email protected] > >>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>> > >>> _______________________________________________ > >>> poppler mailing list > >>> [email protected] > >>> https://lists.freedesktop.org/mailman/listinfo/poppler > >> > >> _______________________________________________ > >> poppler mailing list > >> [email protected] > >> https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
