Dear Albert,

Thank you, I'm glad to hear that one of the direction could be
acceptable. Maybe the consideration of GlobalParams::textEncoding
would be discussed in future when cpp frontend introduces an API
to modify it to non-Unicode values.

Now I'm discussing with Jeroen about how to fix other metadata
(not related with text_list() API), please wait a while.

Regards,
mpsuzuki

Albert Astals Cid wrote:
El dimecres, 14 de març de 2018, a les 18:17:47 CET, suzuki toshiya va escriure:
Dear Adam,

The 2nd option, iconv + GlobalParams::textEncoding solution might be
something like:
https://github.com/mpsuzuki/poppler/commit/80f088c4a94b151ddb90e05147a8dc456
5e01d89 ?

Seems a bit too much to me.

I've personally had had no time to test the other solution you sent (replacing unicode_GooString_to_ustring with from_utf8), but if that one works, it seems much simpler and straighforward and I'd like to commit that.

Cheers,
  Albert

Regards,
mpsuzuki

suzuki toshiya wrote:
Oops, I'm quite sorry for my mistake which make people confused as
if my bits are in github.com/freedesktop. The right places are:

sample PDF file
https://github.com/mpsuzuki/poppler/blob/cpp-textlist-encoding-issue/cpp/t
ests/HereIsUSASCII.pdf

a easiest (and oversimplified) fix for this issue
https://github.com/mpsuzuki/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0bc72
728fc85f9

Regards,
mpsuzuki

suzuki toshiya wrote:
Dear Jeroen, Adam,

Sorry for long latency about this issue. I would try to draft
the solutions suggested by Adam.

Yet I'm not sure what I'm seeing now is same trouble with you.
In my case, the testing PDF is:
https://rawgit.com/freedesktop/poppler/1fb325f4b0a92ca28c110d46dc7a9dba2a
d94367/cpp/tests/HereIsUSASCII.pdf (maybe I should provide a PDF showing
surrogate characters to
clarify the difference of UTF-8 & UTF-16)
I see your testing code shows same outputs for ASCII, but
different outputs for Cyrill etc. So, the encodings by text()
and textlist() are different, although their types are same
(ustring). It should be fixed. However, US-ASCII characters
are not garbled. If it's different from the trouble you're
seeing, please let me know.

Now the easiest solution, using ustring::from_utf8() is drafted.
https://github.com/freedesktop/poppler/commit/9f2ac5773553a1b83bc8b7bbfc0
bc72728fc85f9 Please check if it works for you. I think it works well in
my
environment.

I would proceed to the next one, implementing something like
ustring::from_utf8() which reflects GlobalParams::textEncoding.

Regards,
mpsuzuki

Adam Reichold wrote:
Hello Jeroen,

Am 06.03.2018 um 12:59 schrieb Jeroen Ooms:
On Tue, Mar 6, 2018 at 10:31 AM, Adam Reichold

<[email protected]> wrote:
Hello mpsuzuki,

from a glance at the code, it seems page::text uses ustring::from_utf8
to convert Poppler's GooString into ustring which seems correct if
GlobalParams::textEncoding has its default value of "UTF-8" .
I don't understand this part. Why is textEncoding a global property?
Shouldn't this be a property of single pdf document? Is there some way
I can read a document's encoding from the C++ api (without including
GlobalParams.h).

The pdf spec states that different strings may have different
encodings. Perhaps it would be possible to expose an encoding field in
the ustring class? If there would be a way to know the encoding of a
ustring, I can get the raw data and convert it to a suitable encoding
myself. This would be much better than making assumptions.
This is not the encoding of the text in the PDF document, but the
encoding of the GooString that are returned by the internal Poppler API.
Also I think the ustring class is intended to always store UTF-16
encoded data.

Best regards, Adam.
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler
_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler






_______________________________________________
poppler mailing list
[email protected]
https://lists.freedesktop.org/mailman/listinfo/poppler

Reply via email to