[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
(Sorry if this messes-up the thread order, it is meant as a reply to the
original RFC.)

Dear list,

newbie here. After much hesitation I decided to put forward a use case
which bothers me about the current proposal. Disclaimer: I happen to write
a library which is directly influenced by this.

As you may know, PDF operates over bytes and an integer or floating-point
number is written down as-is, for example "100" or "1.23".

However, the proposal drops "%d", "%f" and "%x" formats and the suggested
workaround for writing down a number is to use ".encode('ascii')", which I
think has two problems:

One is that it needs to construct one additional object per formatting as
opposed to Python 2; it is not uncommon for a PDF file to contain millions
of numbers.

The second problem is that, in my eyes, it is very counter-intuitive to
require the use of str only to get formatting on bytes. Consider the case
where a large bytes object is created out of many smaller bytes objects. If
I wanted to format a part I had to use str instead. For example:

content = b''.join([
b'header',
b'some dictionary structure',
b'part 1 abc',
('part 2 %.3f' % number).encode('ascii'),
b'trailer'])

In the case of PDF, the embedding of an image into PDF looks like:

10 0 obj
  << /Type /XObject
 /Width 100
 /Height 100
 /Alternates 15 0 R
 /Length 2167
  >>
stream
...binary image data...
endstream
endobj

Because of the image it makes sense to store such structure inside bytes.
On the other hand, there may well be another "obj" which contains the
coordinates of Bezier paths:

11 0 obj
...
stream
0.5 0.1 0.2 RG
300 300 m
300 400 400 400 400 300 c
b
endstream
endobj

To summarize, there are cases which mix "binary" and "text" and, in my
opinion, dropping the bytes-formatting of numbers makes it more complicated
than it was. I would appreciate any explanation on how:

b'%.1f %.1f %.1f RG' % (r, g, b)

is more confusing than:

b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r,
g, b)))

Similar situation exists for HTTP ("Content-Length: 123") and ASCII STL
("vertex 1.0 0.0 0.0").

Thanks and have a nice day,

Juraj Sukop

PS: In the case the proposal will not include the number formatting, it
would be nice to list there a set of guidelines or examples on how to
proceed with porting Python 2 formats to Python 3.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Fri, Jan 10, 2014 at 10:52 PM, Chris Barker wrote:

> On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop wrote:
>
>> As you may know, PDF operates over bytes and an integer or floating-point
>> number is written down as-is, for example "100" or "1.23".
>>
>
> Just to be clear here -- is PDF specifically bytes+ascii?
>
> Or could there be some-other-encoding unicode in there?
>

>From the specs: "At the most fundamental level, a PDF file is a sequence of
8-bit bytes." But it is also possible to represent a PDF using printable
ASCII + whitespace by using escapes and "filters". Then, there are also
"text strings" which might be encoded in UTF+16.

What this all means is that the PDF objects are expressed in ASCII,
"stream" objects like images and fonts may have a binary part and I never
saw those UTF+16 strings.


u"stream\n%s\nendstream\nendobj"%binary_data.decode('latin-1')
>

The argument for dropping "%f" et al. has been that if something is a text,
then it should be Unicode. Conversely, if it is not text, then it should
not be Unicode.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner
wrote:

>
> What not building "10 0 obj ... stream" and "endstream endobj" in
> Unicode and then encode to ASCII? Example:
>
> data = b''.join((
>   ("%d %d obj ... stream" % (10, 0)).encode('ascii'),
>   binary_image_data,
>   ("endstream endobj").encode('ascii'),
> ))
>

The key is "encode to ASCII" which means that the result is bytes. Then,
there is this "11 0 obj" which should also be bytes. But it has no
"binary_image_data" - only lots of numbers waiting to be somehow converted
to bytes. I already mentioned the problems with ".encode('ascii')" but it
does not stop here. Numbers may appear not only inside "streams" but almost
anywhere: in the header there is PDF version, an image has to have "width"
and "height", at the end of PDF there is a structure containing offsets to
all of the objects in file. Basically, to ".encode('ascii')" every possible
number is not exactly simple or pretty.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-10 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou wrote:

> Also, when you say you've never encountered UTF-16 text in PDFs, it
> sounds like those people who've never encountered any non-ASCII data in
> their programs.


Let me clarify: one does not think in "writing text in Unicode"-terms in
PDF. Instead, one records the sequence of "character codes" which
correspond to "glyphs" or the glyph IDs directly. That's because one
Unicode character may have more than one glyph and more characters can be
shown as one glyph.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson  wrote:

>
> Hi Juraj,
>

Hello Cameron.


>   data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) )
>

Thanks for the suggestion! The problem with "bytify" is that some items
might require different formatting than other items. For example, in
"Cross-Reference Table" there are three different formats: non-padded
integer ("1"), 10- and 15digit integer, ("03", "65535").
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-11 Thread Juraj Sukop
On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano wrote:

>
> I'm sorry, I don't understand what you mean here. I'm honestly not
> trying to be difficult, but you sound confident that you understand what
> you are doing, but your description doesn't make sense to me. To me, it
> looks like you are conflating bytes and ASCII characters, that is,
> assuming that characters "are" in some sense identical to their ASCII
> representation. Let me explain:
>
> The integer that in English is written as 100 is represented in memory
> as bytes 0x0064 (assuming a big-endian C short), so when you say "an
> integer is written down AS-IS" (emphasis added), to me that says that
> the PDF file includes the bytes 0x0064. But then you go on to write the
> three character string "100", which (assuming ASCII) is the bytes
> 0x313030. Going from the C short to the ASCII representation 0x313030 is
> nothing like inserting the int "as-is". To put it another way, the
> Python 2 '%d' format code does not just copy bytes.
>

Sorry, I should've included an example: when I said "as-is" I meant "1",
"0", "0" so that would be yours "0x313030."


> If you consider PDF as binary with occasional pieces of ASCII text, then
> working with bytes makes sense. But I wonder whether it might be better
> to consider PDF as mostly text with some binary bytes. Even though the
> bulk of the PDF will be binary, the interesting bits are text. E.g. your
> example:
>
> Even though the binary image data is probably much, much larger in
> length than the text shown above, it's (probably) trivial to deal with:
> convert your image data into bytes, decode those bytes into Latin-1,
> then concatenate the Latin-1 string into the text above.
>

This is similar to what Chris Barker suggested. I also don't try to be
difficult here but please explain to me one thing. To treat bytes as if
they were Latin-1 is bad idea, that's why "%f" got dropped in the first
place, right? How is it then alright to put an image inside an Unicode
string?

Also, apart from the in/out conversions, do any other difficulties come to
your mind?

Please also take note that in Python 3.3 and better, the internal
> representation of Unicode strings containing only code points up to 255
> (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte
> per character.
>

I guess you meant [C]Python...

In any case, thanks for the detailed reply.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano wrote:

> On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote:
>
> > AFAIK (and just for the record), there could be both Latin1 text and
> UTF-16
> > in a PDF (and other encodings too), depending on the font used:
> [...]
> > In Python2, txt is just a str, but in Python3 handling everything as
> latin1
> > string obviously doesn't work for TTF in this case.
>
> Nobody is suggesting that you use Latin-1 for *everything*. We're
> suggesting that you use it for blobs of binary data that represent
> arbitrary bytes. First you have to get your binary data in the first
> place, using whatever technique is necessary.


Just to check I understood what you are saying. Instead of writing:

content = b'\n'.join([
b'header',
b'part 2 %.3f' % number,
binary_image_data,
utf16_string.encode('utf-16be'),
b'trailer'])

it should now look like:

content = '\n'.join([
'header',
'part 2 %.3f' % number,
binary_image_data.decode('latin-1'),
utf16_string.encode('utf-16be').decode('latin-1'),
'trailer']).encode('latin-1')

Correct?
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
On Sun, Jan 12, 2014 at 2:16 PM, Nick Coghlan  wrote:

> Why are you proposing to do the *join* in text space? Encode all the parts
> separately, concatenate them with b'\n'.join() (or whatever separator is
> appropriate). It's only the *text formatting operation* that needs to be
> done in text space and then explicitly encoded (and this example doesn't
> even need latin-1,ASCII is sufficient):
>
I apparently misunderstood what was Steven suggesting, thanks for the
clarification.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-12 Thread Juraj Sukop
Wait a second, this is how I understood it but what Nick said made me think
otherwise...

On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano wrote:

> On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote:
> > On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano  >wrote:
> >
> > Just to check I understood what you are saying. Instead of writing:
> >
> > content = b'\n'.join([
> > b'header',
> > b'part 2 %.3f' % number,
> > binary_image_data,
> > utf16_string.encode('utf-16be'),
> > b'trailer'])
>
> Which doesn't work, since bytes don't support %f in Python 3.
>

I know and this was an example of the ideal (for me, anyway) way of
formatting bytes.


> First, "utf16_string" confuses me. What is it? If it is a Unicode
> string, i.e.:
>

It is a Unicode string which happens to contain code points outside U+00FF
(as with the TTF example above), so that it triggers the (at least) 2-bytes
memory representation in CPython 3.3+. I agree, I chose the variable name
poorly, my bad.


>
> content = '\n'.join([
> 'header',
> 'part 2 %.3f' % number,
> binary_image_data.decode('latin-1'),
> utf16_string,  # Misleading name, actually Unicode string
> 'trailer'])
>

Which, because of that horribly-named-variable, prevents the use of simple
memcpy and makes the image data occupy way more memory than as when it was
in simple bytes.


> Both examples assume that you intend to do further processing of content
> before sending it, and will encode just before sending:
>

Not really, I was interested to compare it to bytes formatting, hence it
included the "encode()" as well.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Internal representation of strings and Micropython

2014-06-04 Thread Juraj Sukop
On Wed, Jun 4, 2014 at 11:36 AM, Stephen J. Turnbull 
wrote:

>
> I think you really need to check what the applications are in detail.
> UTF-8 costs about 35% more storage for Japanese, and even more for
> Chinese, than does UTF-16.


"UTF-8 can be smaller even for Asian languages, e.g.: front page of
Wikipedia Japan: 83 kB in UTF-8, 144 kB in UTF-16"
>From http://www.lua.org/wshop12/Ierusalimschy.pdf (p. 12)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com