El dijous, 12 d’abril de 2018, a les 10:33:33 CEST, suzuki toshiya va escriure: > Dear Jeroen, > > Please let me prepare some data for regression test. > The data I've tested are mainly ASCII or UTF-16BE data. > I should check PDFEncoding data cases (if anybody already has something > appropriate, please let me know).
I have aroun1 1700 pdf here collected from random bugs so if you give me a patch and a test/regression program that outputs something that can be diff'ed i can "easily" compare the before and after. Cheers, Albert > > Regards, > mpsuzuki > > Jeroen Ooms wrote: > > FYI the encoding problems still exist in the master branch today. I am > > very interested in this patch by mpsuzuki, what can we do to move this > > forward? > > > > > > > > > > > > > > > > > > On Wed, Mar 28, 2018 at 2:26 PM, suzuki toshiya > > > > <[email protected]> wrote: > >> Dear Adam, > >> > >> Adam Reichold wrote: > >>>> I see. where is the appropriate place to add a document of > >>>> poppler::ustring class itself? > >>> > >>> Personally, I would suggest Doxygen comments in the public header. > >> > >> Thanks! Now I'm trying to write... also I found Doxygen comments > >> for text_list needs the improvement. > >> > >> During the check of the existing functions (to add documents), > >> I found a few inconsistencies about BOM. > >> > >> * ustring::to_latin1() this function does not use iconv(), > >> this function just cast the types between unsigned short and > >> char. BOM could not be converted to Latin-1, but the exist of > >> BOM is not checked. if stored UTF-16 has a BOM, broken 8bit > >> would be inserted in the beginning of the result. > >> > >> * ustring::from_latin1() this function does not use iconv() > >> either. BOM is not inserted to the beginning. no-BOM UTF-16 > >> string is created. > >> > >> * ustring::to_utf8() BOM or no-BOM is decided by iconv(). > >> > >> * ustring::from_utf8() assuming iconv() returns with-BOM UTF-16. > >> > >> I would collect Debian software packages depending libpoppler-cpp, > >> and check how they use ustring object. In my rough check it > >> would be less than 10, checking all of them would not be so > >> time-consuming. If there are softwares which always the skip > >> first character of UTF-16 (based on the assumption as the > >> ustring is always with UTF-16 with BOM), some discussion is > >> needed. > >> > >> Regards, > >> mpsuzuki > >> > >> _______________________________________________ > >> poppler mailing list > >> [email protected] > >> https://lists.freedesktop.org/mailman/listinfo/poppler > > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
