Dear Albert,

Thank you for response. From now, I would post proposed patch
as attachment - if too big, I would file to bugzilla. Here I
attach pt1 and pt2 as separated attachments.

In fact, I'm unfamiliar with how the cpp-frontend users think
about a BOM in ustring object. If there are so many existing
implementations assuming as if ustring always starts with a BOM
(and they have their own routines for the concatenation, splicing
and replacing), we should care for that. Please let me hear how
the users think.

I guess this is not very important, I mean after all it was broken so i guess noone could make it really work?

OK... I found a few comments from Unicode on BOM:
        http://www.unicode.org/faq/utf_bom.html#bom6
It seems that handing BOM during the string is not straight-
forward (just removing it can cause semantically difference),
so I suggest to use non-BOM approach (of current patch),
until some complains from the users/reviewers.

what we need to do is document exactly how it behaves (since it seems to be a bit under documented now).

I see. where is the appropriate place to add a document of
poppler::ustring class itself?

Regards,
mpsuzuki

P.S.
Thanks to Jeroen too for the info how to check the commit quickly,
it is very helpful to make the attachments of this :-)

Jeroen Ooms wrote:
On Mon, Mar 26, 2018 at 10:06 PM, Albert Astals Cid <[email protected]> wrote:
El diumenge, 25 de març de 2018, a les 5:39:18 CEST, suzuki toshiya va
escriure:
Hi all,

Finally I think I found the root of issue and I can propose a fix.
pre-patch situation is like this:
https://travis-ci.org/mpsuzuki/poppler/builds/357212162

post-patch situation is like this:
https://travis-ci.org/mpsuzuki/poppler/builds/357956103

My fix consists from 2 parts.
Can you post the patches either to bugzilla or as attachments to the mailing
list? I don't feel confortable using github.

One easy way to get patches from Github without actually having to
pull in the repository is simply appending ".diff" or  ".patch to the
Github commit URLs that he posted:

 - 
https://github.com/mpsuzuki/poppler/commit/7404f5effa8e303399e5101d54ff954ee5153e44.diff
 - 
https://github.com/mpsuzuki/poppler/commit/b3230c7098b891da0b92742264d78c9bd86750bd.diff
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

diff --git a/cpp/poppler-document.cpp b/cpp/poppler-document.cpp
index 477a7556..ae1b493e 100644
--- a/cpp/poppler-document.cpp
+++ b/cpp/poppler-document.cpp
@@ -342,7 +342,7 @@ ustring document::info_key(const std::string &key) const
         return ustring();
     }
 
-    return detail::unicode_GooString_to_ustring(goo_value.get());
+    return ustring::from_utf8(goo_value->getCString(), goo_value->getLength());
 }
 
 /**
@@ -433,7 +433,7 @@ ustring document::get_title() const
         return ustring();
     }
 
-    return detail::unicode_GooString_to_ustring(goo_title.get());
+    return ustring::from_utf8(goo_title->getCString(), goo_title->getLength());
 }
 
 /**
@@ -477,7 +477,7 @@ ustring document::get_author() const
         return ustring();
     }
 
-    return detail::unicode_GooString_to_ustring(goo_author.get());
+    return ustring::from_utf8(goo_author->getCString(), goo_author->getLength());
 }
 
 /**
@@ -521,7 +521,7 @@ ustring document::get_subject() const
         return ustring();
     }
 
-    return detail::unicode_GooString_to_ustring(goo_subject.get());
+    return ustring::from_utf8(goo_subject->getCString(), goo_subject->getLength());
 }
 
 /**
@@ -565,7 +565,7 @@ ustring document::get_keywords() const
         return ustring();
     }
 
-    return detail::unicode_GooString_to_ustring(goo_keywords.get());
+    return ustring::from_utf8(goo_keywords->getCString(), goo_keywords->getLength());
 }
 
 /**
@@ -609,7 +609,7 @@ ustring document::get_creator() const
         return ustring();
     }
 
-    return detail::unicode_GooString_to_ustring(goo_creator.get());
+    return ustring::from_utf8(goo_creator->getCString(), goo_creator->getLength());
 }
 
 /**
@@ -653,7 +653,7 @@ ustring document::get_producer() const
         return ustring();
     }
 
-    return detail::unicode_GooString_to_ustring(goo_producer.get());
+    return ustring::from_utf8(goo_producer->getCString(), goo_producer->getLength());
 }
 
 /**
@@ -838,7 +838,7 @@ ustring document::metadata() const
 {
     std::unique_ptr<GooString> md(d->doc->getCatalog()->readMetadata());
     if (md.get()) {
-        return detail::unicode_GooString_to_ustring(md.get());
+        return ustring::from_utf8(md->getCString(), md->getLength());
     }
     return ustring();
 }
diff --git a/cpp/poppler-embedded-file.cpp b/cpp/poppler-embedded-file.cpp
index 0ed5a6be..8c1c08c9 100644
--- a/cpp/poppler-embedded-file.cpp
+++ b/cpp/poppler-embedded-file.cpp
@@ -88,7 +88,7 @@ std::string embedded_file::name() const
 ustring embedded_file::description() const
 {
     GooString *goo = d->file_spec->getDescription();
-    return goo ? detail::unicode_GooString_to_ustring(goo) : ustring();
+    return goo ? ustring::from_utf8(goo->getCString(), goo->getLength()) : ustring();
 }
 
 /**
diff --git a/cpp/poppler-page.cpp b/cpp/poppler-page.cpp
index a4ebfe51..70b0ba4a 100644
--- a/cpp/poppler-page.cpp
+++ b/cpp/poppler-page.cpp
@@ -163,7 +163,7 @@ ustring page::label() const
         return ustring();
     }
 
-    return detail::unicode_GooString_to_ustring(&goo);
+    return ustring::from_utf8(goo.getCString(), goo.getLength());
 }
 
 /**
@@ -285,7 +285,7 @@ ustring page::text(const rectf &r, text_layout_enum layout_mode) const
     } else {
         s.reset(td.getText(r.left(), r.top(), r.right(), r.bottom()));
     }
-    return ustring::from_utf8(s->getCString());
+    return ustring::from_utf8(s->getCString(), s->getLength());
 }
 
 /*
@@ -355,7 +355,7 @@ std::vector<text_box> page::text_list() const
             TextWord *word = word_list->get(i);
 
             std::unique_ptr<GooString> gooWord{word->getText()};
-            ustring ustr = detail::unicode_GooString_to_ustring(gooWord.get());
+            ustring ustr = ustring::from_utf8(gooWord->getCString(), gooWord->getLength());
 
             double xMin, yMin, xMax, yMax;
             word->getBBox(&xMin, &yMin, &xMax, &yMax);
diff --git a/cpp/poppler-private.cpp b/cpp/poppler-private.cpp
index d326d1b3..27efa054 100644
--- a/cpp/poppler-private.cpp
+++ b/cpp/poppler-private.cpp
@@ -56,41 +56,6 @@ rectf detail::pdfrectangle_to_rectf(const PDFRectangle &pdfrect)
     return rectf(pdfrect.x1, pdfrect.y1, pdfrect.x2 - pdfrect.x1, pdfrect.y2 - pdfrect.y1);
 }
 
-ustring detail::unicode_GooString_to_ustring(GooString *str)
-{
-    const char *data = str->getCString();
-    const int len = str->getLength();
-
-    int i = 0;
-    bool is_unicode = false;
-    if ((data[0] & 0xff) == 0xfe && (len > 1 && (data[1] & 0xff) == 0xff)) {
-        is_unicode = true;
-        i = 2;
-    }
-    ustring::size_type ret_len = len - i;
-    if (is_unicode) {
-        ret_len >>= 1;
-    }
-    ustring ret(ret_len, 0);
-    size_t ret_index = 0;
-    ustring::value_type u;
-    if (is_unicode) {
-        while (i < len) {
-            u = ((data[i] & 0xff) << 8) | (data[i + 1] & 0xff);
-            i += 2;
-            ret[ret_index++] = u;
-        }
-    } else {
-        while (i < len) {
-            u = data[i] & 0xff;
-            ++i;
-            ret[ret_index++] = u;
-        }
-    }
-
-    return ret;
-}
-
 ustring detail::unicode_to_ustring(const Unicode *u, int length)
 {
     ustring str(length * 2, 0);
diff --git a/cpp/poppler-private.h b/cpp/poppler-private.h
index d954bdb2..fdba2d35 100644
--- a/cpp/poppler-private.h
+++ b/cpp/poppler-private.h
@@ -49,7 +49,6 @@ void error_function(void *data, ErrorCategory category, Goffset pos, char *msg);
 
 rectf pdfrectangle_to_rectf(const PDFRectangle &pdfrect);
 
-ustring unicode_GooString_to_ustring(GooString *str);
 ustring unicode_to_ustring(const Unicode *u, int length);
 GooString* ustring_to_unicode_GooString(const ustring &str);
 
diff --git a/cpp/poppler-document.cpp b/cpp/poppler-document.cpp
index ae1b493e..92ebdf51 100644
--- a/cpp/poppler-document.cpp
+++ b/cpp/poppler-document.cpp
@@ -342,7 +342,14 @@ ustring document::info_key(const std::string &key) const
         return ustring();
     }
 
-    return ustring::from_utf8(goo_value->getCString(), goo_value->getLength());
+    if (goo_value->hasUnicodeMarker()) {
+        // if a BOM is found in the octet buffer, we assume as it is UTF-16.
+        // some iconv() generates UTF-8 with unneeded BOM in UTF-8 expression,
+        // the beginning BOM is skipped.
+        return ustring::from_utf16(goo_value->getCString() + 2, goo_value->getLength() - 2);
+    } else {
+        return ustring::from_utf8(goo_value->getCString(), goo_value->getLength());
+    }
 }
 
 /**
diff --git a/cpp/poppler-global.cpp b/cpp/poppler-global.cpp
index cc51e3d3..db52b90b 100644
--- a/cpp/poppler-global.cpp
+++ b/cpp/poppler-global.cpp
@@ -226,7 +226,11 @@ byte_array ustring::to_utf8() const
         return byte_array();
     }
 
-    MiniIconv ic("UTF-8", "UTF-16");
+#ifdef WORDS_BIGENDIAN
+    MiniIconv ic("UTF-8", "UTF-16BE");
+#else
+    MiniIconv ic("UTF-8", "UTF-16LE");
+#endif
     if (!ic.is_valid()) {
         return byte_array();
     }
@@ -265,6 +269,26 @@ std::string ustring::to_latin1() const
     return ret;
 }
 
+ustring ustring::from_utf16(const char *str, int len)
+{
+
+    // strlen for UTF-16 buffer.
+    if (len < 0) {
+        for (int i = 0; ; i += 2) {
+            if (str[i] == 0 && str[i + 1] == 0) {
+                len = i;
+                break;
+            }
+        }
+    }
+
+    ustring ret((len/2), 0);
+    for (int i = 0; i < len; i += 2) {
+        ret[i / 2] = (str[i] << 8) + (str[i + 1]);
+    }
+    return ret;
+}
+
 ustring ustring::from_utf8(const char *str, int len)
 {
     if (len <= 0) {
@@ -274,7 +298,11 @@ ustring ustring::from_utf8(const char *str, int len)
         }
     }
 
-    MiniIconv ic("UTF-16", "UTF-8");
+#ifdef WORDS_BIGENDIAN
+    MiniIconv ic("UTF-16BE", "UTF-8");
+#else
+    MiniIconv ic("UTF-16LE", "UTF-8");
+#endif
     if (!ic.is_valid()) {
         return ustring();
     }
diff --git a/cpp/poppler-global.h b/cpp/poppler-global.h
index eb7ec244..8a0ff4b6 100644
--- a/cpp/poppler-global.h
+++ b/cpp/poppler-global.h
@@ -98,6 +98,9 @@ class POPPLER_CPP_EXPORT ustring : public std::basic_string<unsigned short>
     ustring(const std::string &);
     operator std::string() const;
     ustring& operator=(const std::string &);
+
+    static ustring from_utf16(const char *str, int len);
+    friend class document;
 };
 #ifdef _MSC_VER
 #pragma warning(pop)
diff --git a/cpp/poppler-private.cpp b/cpp/poppler-private.cpp
index 27efa054..711f37ab 100644
--- a/cpp/poppler-private.cpp
+++ b/cpp/poppler-private.cpp
@@ -58,7 +58,7 @@ rectf detail::pdfrectangle_to_rectf(const PDFRectangle &pdfrect)
 
 ustring detail::unicode_to_ustring(const Unicode *u, int length)
 {
-    ustring str(length * 2, 0);
+    ustring str(length, 0);
     ustring::iterator it = str.begin();
     const Unicode *uu = u;
     for (int i = 0; i < length; ++i) {
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to