El diumenge, 18 de març de 2018, a les 19:39:41 CET, Adam Reichold va escriure: > Hello Albert, > > Am 18.03.2018 um 18:52 schrieb Albert Astals Cid: > > El diumenge, 18 de març de 2018, a les 15:46:42 CET, Adam Reichold va > > > > escriure: > >> Hello mpsuzuki, > >> > >> Am 18.03.2018 um 13:18 schrieb suzuki toshiya: > >>> Dear Albert, > >>> > >>> please let me confirm your thought. > >>> > >>>>> Maybe the consideration of GlobalParams::textEncoding > >>>>> would be discussed in future when cpp frontend introduces an API > >>>>> to modify it to non-Unicode values. > >>>> > >>>> Honestly i don't think that makes any sense, why would you want that? > >>> > >>> do you mean that "for cpp frontend, no need to care the cases that > >>> non-Unicode encoding is specified in GlobalParams::textEncoding" ? > >>> > >>> if so, its reason would be "because text(), text_list(), etc return > >>> the texts by ustring objects, thus, even if the clients can set > >>> GlobalParams::textEncoding to preferred non-Unicode encoding, they > >>> cannot retrieve the text in the preferred non-Unicode encoding. > >>> therefore, no need to expose GlobalParams::setTextEncoding() via > >>> cpp frontend" ? > >>> > >>> if this is what you meant, I agree that no need to care the cases > >>> that non-Unicode encoding in GlobalParams::textEncoding. > >>> > >>> The reason why I tried to care such cases was: some utils (like > >>> pdftotext) allow users to specify non-Unicode encoding, so I was > >>> wondering whether something similar would be added to cpp frontend > >>> in future. If there's no such, it's good news for me. > >>> > >>> Sorry for lengthy confirmation! > >> > >> I think you might be confusing two distinct interfaces: > >> > >> * The CPP frontend and the actual user application: There should be no > >> mentioning of GlobalParams here, since this is an internal > >> implementation detail (the we want to get rid of if at all possible) and > >> the the user application should not know or care about it. So we > >> definitely should not expose GlobalParams directly or > >> GlobalParams::setTextEncoding indirectly. > >> > >> * The internal Poppler API and the CPP frontend: The CPP frontend > >> currently assumes that GlobalParams::textEncoding is "UTF-8" which is > >> almost alright as it does not expose GlobalParams, and hence the user > >> application cannot change it and relying on the default value is fine. > >> This should only break if the default value changes (and hence the CPP > >> frontend needs to be adjusted) or the user applications circumvents the > >> CPP frontend by using the internal API directly (but this seems its own > >> fault IMHO). > > > > Exactly, as far as the cpp frontend is concerned > > GlobalParams::textEncoding is always "UTF-8" so that's all you need to > > care about. > > > >> Of course, ideally we would not have GlobalParams and the CPP frontend > >> would pass in the desired encoding everywhere text is extracted using > >> the Poppler API. > > > > I disagree, there's no point on letting the user of poppler choose which > > encoding the strings should be returned, if she wants to use a different > > encoding, she can do the conversion on the application side. > > I did not mean that end the user application should decide, just the > frontend, i.e. the CPP frontend seems to have decided that it will > always present text "ustring" which is UTF-16 encoded. > Hence it would be more efficient to just request the Poppler core to > return UTF-16 encoded data within GooString instead of UTF-8 and then > converting to UTF-16 before giving it to the application. > (The part about specifying the desired encoding whenever text is > extracted is only about avoid global state as much as possible and IMHO > desirable in any case.)
Ah, ok, that makes some sense, yes, on the other hand, it means the cpp frontend would be using a less "used" GlobalParams::textEncoding value and might get unique bugs because of that, but yeah ideally we would not have bugs and what you suggest would be somewhat more efficient. Cheers, Albert > > Best regards, Adam. > > > It is slightly different for pdftotext since that's an end user > > application so it makes sense letting the user specify the output she > > wants, but for the cpp API there's going to code on top of it so if > > further conversion is needed it can be done there. > > > > Cheers, > > > > Albert > >> > >> It could then also just request UTF-16 encoding for its > >> ustring representation instead of always converting UTF-8 to UTF-16 > >> before passing it to the user application. > >> > >> Best regards, Adam. > >> > >>> Regards, > >>> mpsuzuki > >>> > >>> Albert Astals Cid wrote: > >>>> El dijous, 15 de març de 2018, a les 14:20:52 CET, suzuki toshiya va > >>>> > >>>> escriure: > >>>>> Dear Albert, > >>>>> > >>>>> Thank you, I'm glad to hear that one of the direction could be > >>>>> acceptable. Maybe the consideration of GlobalParams::textEncoding > >>>>> would be discussed in future when cpp frontend introduces an API > >>>>> to modify it to non-Unicode values. > >>>> > >>>> Honestly i don't think that makes any sense, why would you want that? > >>>> > >>>> Cheers, > >>>> > >>>> Albert > >>>>> > >>>>> Now I'm discussing with Jeroen about how to fix other metadata > >>>>> (not related with text_list() API), please wait a while. > >>>>> > >>>>> Regards, > >>>>> mpsuzuki > >>>>> > >>>>> Albert Astals Cid wrote: > >>>>>> El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya > >>>>>> va > >>>>>> > >>>>>> escriure: > >>>>>>> Dear Adam, > >>>>>>> > >>>>>>> The 2nd option, iconv + GlobalParams::textEncoding solution might be > >>>>>>> something like: > >>>>>>> https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e0514 > >>>>>>> 7a > >>>>>>> 8dc > >>>>>>> > >>>>>>> 456 5e01d89 ? > >>>>>> > >>>>>> Seems a bit too much to me. > >>>>>> > >>>>>> I've personally had had no time to test the other solution you sent > >>>>>> (replacing unicode_GooString_to_ustring with from_utf8), but if that > >>>>>> one > >>>>>> works, it seems much simpler and straighforward and I'd like to > >>>>>> commit > >>>>>> that. > >>>>>> > >>>>>> Cheers, > >>>>>> > >>>>>> Albert > >>>>>>> > >>>>>>> Regards, > >>>>>>> mpsuzuki > >>>>>>> > >>>>>>> suzuki toshiya wrote: > >>>>>>>> Oops, I'm quite sorry for my mistake which make people confused as > >>>>>>>> if my bits are in github.com/freedesktop. The right places are: > >>>>>>>> > >>>>>>>> sample PDF file > >>>>>>>> https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issu > >>>>>>>> e/ > >>>>>>>> cpp > >>>>>>>> > >>>>>>>> /t > >>>>>>>> ests/HereIsUSASCII.pdf > >>>>>>>> > >>>>>>>> a easiest (and oversimplified) fix for this issue > >>>>>>>> https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bb > >>>>>>>> fc > >>>>>>>> 0bc > >>>>>>>> > >>>>>>>> 72 > >>>>>>>> 728fc85f9 > >>>>>>>> > >>>>>>>> Regards, > >>>>>>>> mpsuzuki > >>>>>>>> > >>>>>>>> suzuki toshiya wrote: > >>>>>>>>> Dear Jeroen, Adam, > >>>>>>>>> > >>>>>>>>> Sorry for long latency about this issue. I would try to draft > >>>>>>>>> the solutions suggested by Adam. > >>>>>>>>> > >>>>>>>>> Yet I'm not sure what I'm seeing now is same trouble with you. > >>>>>>>>> In my case, the testing PDF is: > >>>>>>>>> https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7 > >>>>>>>>> a9 > >>>>>>>>> dba > >>>>>>>>> > >>>>>>>>> 2a > >>>>>>>>> d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF > >>>>>>>>> showing > >>>>>>>>> surrogate characters to > >>>>>>>>> clarify the difference of UTF-8 & UTF-16) > >>>>>>>>> I see your testing code shows same outputs for ASCII, but > >>>>>>>>> different outputs for Cyrill etc. So, the encodings by text() > >>>>>>>>> and textlist() are different, although their types are same > >>>>>>>>> (ustring). It should be fixed. However, US-ASCII characters > >>>>>>>>> are not garbled. If it's different from the trouble you're > >>>>>>>>> seeing, please let me know. > >>>>>>>>> > >>>>>>>>> Now the easiest solution, using ustring::from_utf8() is drafted. > >>>>>>>>> https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8 > >>>>>>>>> b7 > >>>>>>>>> bbf > >>>>>>>>> > >>>>>>>>> c0 > >>>>>>>>> bc72728fc85f9 Please check if it works for you. I think it works > >>>>>>>>> well > >>>>>>>>> in > >>>>>>>>> my > >>>>>>>>> environment. > >>>>>>>>> > >>>>>>>>> I would proceed to the next one, implementing something like > >>>>>>>>> ustring::from_utf8() which reflects GlobalParams::textEncoding. > >>>>>>>>> > >>>>>>>>> Regards, > >>>>>>>>> mpsuzuki > >>>>>>>>> > >>>>>>>>> Adam Reichold wrote: > >>>>>>>>>> Hello Jeroen, > >>>>>>>>>> > >>>>>>>>>> Am 06.03.2018 um 12:59 schrieb Jeroen Ooms: > >>>>>>>>>>> On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold > >>>>>>>>>>> > >>>>>>>>>>> <[email protected]> wrote: > >>>>>>>>>>>> Hello mpsuzuki, > >>>>>>>>>>>> > >>>>>>>>>>>> from a glance at the code, it seems page::text uses > >>>>>>>>>>>> ustring::from_utf8 > >>>>>>>>>>>> to convert Poppler's GooString into ustring which seems > >>>>>>>>>>>> correct if > >>>>>>>>>>>> GlobalParams::textEncoding has its default value of "UTF-8" . > >>>>>>>>>>> > >>>>>>>>>>> I don't understand this part. Why is textEncoding a global > >>>>>>>>>>> property? > >>>>>>>>>>> Shouldn't this be a property of single pdf document? Is there > >>>>>>>>>>> some > >>>>>>>>>>> way > >>>>>>>>>>> I can read a document's encoding from the C++ api (without > >>>>>>>>>>> including > >>>>>>>>>>> GlobalParams.h). > >>>>>>>>>>> > >>>>>>>>>>> The pdf spec states that different strings may have different > >>>>>>>>>>> encodings. Perhaps it would be possible to expose an encoding > >>>>>>>>>>> field > >>>>>>>>>>> in > >>>>>>>>>>> the ustring class? If there would be a way to know the encoding > >>>>>>>>>>> of a > >>>>>>>>>>> ustring, I can get the raw data and convert it to a suitable > >>>>>>>>>>> encoding > >>>>>>>>>>> myself. This would be much better than making assumptions. > >>>>>>>>>> > >>>>>>>>>> This is not the encoding of the text in the PDF document, but the > >>>>>>>>>> encoding of the GooString that are returned by the internal > >>>>>>>>>> Poppler > >>>>>>>>>> API. > >>>>>>>>>> Also I think the ustring class is intended to always store UTF-16 > >>>>>>>>>> encoded data. > >>>>>>>>>> > >>>>>>>>>> Best regards, Adam. > >>>>>>>>> > >>>>>>>>> _______________________________________________ > >>>>>>>>> poppler mailing list > >>>>>>>>> [email protected] > >>>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>>>>>>> > >>>>>>>> _______________________________________________ > >>>>>>>> poppler mailing list > >>>>>>>> [email protected] > >>>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>>>>>> > >>>>>>> _______________________________________________ > >>>>>>> poppler mailing list > >>>>>>> [email protected] > >>>>>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>>> > >>>> _______________________________________________ > >>>> poppler mailing list > >>>> [email protected] > >>>> https://lists.freedesktop.org/mailman/listinfo/poppler > >>> > >>> _______________________________________________ > >>> poppler mailing list > >>> [email protected] > >>> https://lists.freedesktop.org/mailman/listinfo/poppler > > > > _______________________________________________ > > poppler mailing list > > [email protected] > > https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
