El diumenge, 25 de març de 2018, a les 5:39:18 CEST, suzuki toshiya va escriure: > Hi all, > > Finally I think I found the root of issue and I can propose a fix. > pre-patch situation is like this: > https://travis-ci.org/mpsuzuki/poppler/builds/357212162 > > post-patch situation is like this: > https://travis-ci.org/mpsuzuki/poppler/builds/357956103 > > My fix consists from 2 parts.
Can you post the patches either to bugzilla or as attachments to the mailing list? I don't feel confortable using github. > > part 1) > I replaced all detail::unicode_GooString_to_ustring() by > ustring::from_utf8(), this was suggested by Adam. > > https://github.com/mpsuzuki/poppler/commit/7404f5effa8e303399e5101d54ff954ee > 5153e44 > > I think this rather simple fix was already reviewed by Albert. > > part 2) > UTF-16 handling needs some improvement. the issue was reported > by Jeroen. > > https://github.com/mpsuzuki/poppler/commit/b3230c7098b891da0b92742264d78c9bd > 86750bd > > 2-a: PDF's document metadata /Info dict provides sometimes > ASCII (or UTF-8?) string, sometimes UTF-16 string (with BOM). > current implementation assumes it is always UTF-8. I inserted > a conditional to check BOM and deal as UTF-16 if BOM is found. > for ustring object creation by UTF-16, I added a private method > ustring::from_utf16(). I'm afraid there should be a room for > discussion about the better design of UTF-16 in ustring(), > I don't make ustring::from_utf16 at present. > > 2-b: the term "UTF-16" is not always same in various iconv(). > "UTF-16" is sometimes regarded as UTF-16BE, sometimes as UTF-16LE. > please compare the results of "./encoding ./hello.pdf" by Linux > and Mac OS X. > Linux: https://travis-ci.org/mpsuzuki/poppler/jobs/357212163#L1106 > Mac OS X: https://travis-ci.org/mpsuzuki/poppler/jobs/357212164#L2463 > > my fix is conditionalization of the name by WORDS_BIGENDIAN > macro (which is not newly-introduced, it was already used > in the source CairoOutputDev). > > 2-c: unneeded twiced sized buffer is skipped. > > ======================== > > There is a room of the discussion about the design how ustring > object store UTF-16 data. > > There is a direction thinking as the buffer should include BOM > always, to clarify the byteorder. > > There is another direction thinking as we should not expect the > exist of BOM (sometimes it may have, sometimes it may lack), > because the handling of BOM is more complicated than std::basic_string's > existing manipulations (like concatenation, splicing, replacing > etc). > > I took the 2nd direction in my proposed fix, but other experts > think the 1st direction would be better. > > In fact, I'm unfamiliar with how the cpp-frontend users think > about a BOM in ustring object. If there are so many existing > implementations assuming as if ustring always starts with a BOM > (and they have their own routines for the concatenation, splicing > and replacing), we should care for that. Please let me hear how > the users think. I guess this is not very important, I mean after all it was broken so i guess noone could make it really work? what we need to do is document exactly how it behaves (since it seems to be a bit under documented now). Cheers, Albert > > Regards, > mpsuzuki > > Jeroen Ooms wrote: > > Thanks everyone for the work on this issue, really appreciate the > > input. Also excited about mpsuzuki's suggestion to include font data > > with the text_list, this will be super helpful. > > > > I have updated and cleaned my example code a little bit to make it > > easier to test these issues. The updated test program and a small pdf > > file (with expected output) are now here: > > https://github.com/jeroen/popplertest . This makes it easier to update > > the example for new features/issues. > > > > It seems there are at least 3 places where encoding is not behaving as > > expected in 0.63: in the text_list and info_keys and toc. I think > > mpsuzuki's patch addresses these problems. > > _______________________________________________ > > poppler mailing list > > [email protected] > > https://lists.freedesktop.org/mailman/listinfo/poppler > > _______________________________________________ > poppler mailing list > [email protected] > https://lists.freedesktop.org/mailman/listinfo/poppler _______________________________________________ poppler mailing list [email protected] https://lists.freedesktop.org/mailman/listinfo/poppler
