[Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
(Sorry if this messes-up the thread order, it is meant as a reply to the original RFC.) Dear list, newbie here. After much hesitation I decided to put forward a use case which bothers me about the current proposal. Disclaimer: I happen to write a library which is directly influenced by this. As you may know, PDF operates over bytes and an integer or floating-point number is written down as-is, for example "100" or "1.23". However, the proposal drops "%d", "%f" and "%x" formats and the suggested workaround for writing down a number is to use ".encode('ascii')", which I think has two problems: One is that it needs to construct one additional object per formatting as opposed to Python 2; it is not uncommon for a PDF file to contain millions of numbers. The second problem is that, in my eyes, it is very counter-intuitive to require the use of str only to get formatting on bytes. Consider the case where a large bytes object is created out of many smaller bytes objects. If I wanted to format a part I had to use str instead. For example: content = b''.join([ b'header', b'some dictionary structure', b'part 1 abc', ('part 2 %.3f' % number).encode('ascii'), b'trailer']) In the case of PDF, the embedding of an image into PDF looks like: 10 0 obj << /Type /XObject /Width 100 /Height 100 /Alternates 15 0 R /Length 2167 >> stream ...binary image data... endstream endobj Because of the image it makes sense to store such structure inside bytes. On the other hand, there may well be another "obj" which contains the coordinates of Bezier paths: 11 0 obj ... stream 0.5 0.1 0.2 RG 300 300 m 300 400 400 400 400 300 c b endstream endobj To summarize, there are cases which mix "binary" and "text" and, in my opinion, dropping the bytes-formatting of numbers makes it more complicated than it was. I would appreciate any explanation on how: b'%.1f %.1f %.1f RG' % (r, g, b) is more confusing than: b'%s %s %s RG' % tuple(map(lambda x: (u'%.1f' % x).encode('ascii'), (r, g, b))) Similar situation exists for HTTP ("Content-Length: 123") and ASCII STL ("vertex 1.0 0.0 0.0"). Thanks and have a nice day, Juraj Sukop PS: In the case the proposal will not include the number formatting, it would be nice to list there a set of guidelines or examples on how to proceed with porting Python 2 formats to Python 3. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 10:52 PM, Chris Barker wrote: > On Fri, Jan 10, 2014 at 9:17 AM, Juraj Sukop wrote: > >> As you may know, PDF operates over bytes and an integer or floating-point >> number is written down as-is, for example "100" or "1.23". >> > > Just to be clear here -- is PDF specifically bytes+ascii? > > Or could there be some-other-encoding unicode in there? > >From the specs: "At the most fundamental level, a PDF file is a sequence of 8-bit bytes." But it is also possible to represent a PDF using printable ASCII + whitespace by using escapes and "filters". Then, there are also "text strings" which might be encoded in UTF+16. What this all means is that the PDF objects are expressed in ASCII, "stream" objects like images and fonts may have a binary part and I never saw those UTF+16 strings. u"stream\n%s\nendstream\nendobj"%binary_data.decode('latin-1') > The argument for dropping "%f" et al. has been that if something is a text, then it should be Unicode. Conversely, if it is not text, then it should not be Unicode. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Fri, Jan 10, 2014 at 11:12 PM, Victor Stinner wrote: > > What not building "10 0 obj ... stream" and "endstream endobj" in > Unicode and then encode to ASCII? Example: > > data = b''.join(( > ("%d %d obj ... stream" % (10, 0)).encode('ascii'), > binary_image_data, > ("endstream endobj").encode('ascii'), > )) > The key is "encode to ASCII" which means that the result is bytes. Then, there is this "11 0 obj" which should also be bytes. But it has no "binary_image_data" - only lots of numbers waiting to be somehow converted to bytes. I already mentioned the problems with ".encode('ascii')" but it does not stop here. Numbers may appear not only inside "streams" but almost anywhere: in the header there is PDF version, an image has to have "width" and "height", at the end of PDF there is a structure containing offsets to all of the objects in file. Basically, to ".encode('ascii')" every possible number is not exactly simple or pretty. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 12:49 AM, Antoine Pitrou wrote: > Also, when you say you've never encountered UTF-16 text in PDFs, it > sounds like those people who've never encountered any non-ASCII data in > their programs. Let me clarify: one does not think in "writing text in Unicode"-terms in PDF. Instead, one records the sequence of "character codes" which correspond to "glyphs" or the glyph IDs directly. That's because one Unicode character may have more than one glyph and more characters can be shown as one glyph. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 5:14 AM, Cameron Simpson wrote: > > Hi Juraj, > Hello Cameron. > data = b' '.join( bytify( [ 10, 0, obj, binary_image_data, ... ] ) ) > Thanks for the suggestion! The problem with "bytify" is that some items might require different formatting than other items. For example, in "Cross-Reference Table" there are three different formats: non-padded integer ("1"), 10- and 15digit integer, ("03", "65535"). ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sat, Jan 11, 2014 at 6:36 AM, Steven D'Aprano wrote: > > I'm sorry, I don't understand what you mean here. I'm honestly not > trying to be difficult, but you sound confident that you understand what > you are doing, but your description doesn't make sense to me. To me, it > looks like you are conflating bytes and ASCII characters, that is, > assuming that characters "are" in some sense identical to their ASCII > representation. Let me explain: > > The integer that in English is written as 100 is represented in memory > as bytes 0x0064 (assuming a big-endian C short), so when you say "an > integer is written down AS-IS" (emphasis added), to me that says that > the PDF file includes the bytes 0x0064. But then you go on to write the > three character string "100", which (assuming ASCII) is the bytes > 0x313030. Going from the C short to the ASCII representation 0x313030 is > nothing like inserting the int "as-is". To put it another way, the > Python 2 '%d' format code does not just copy bytes. > Sorry, I should've included an example: when I said "as-is" I meant "1", "0", "0" so that would be yours "0x313030." > If you consider PDF as binary with occasional pieces of ASCII text, then > working with bytes makes sense. But I wonder whether it might be better > to consider PDF as mostly text with some binary bytes. Even though the > bulk of the PDF will be binary, the interesting bits are text. E.g. your > example: > > Even though the binary image data is probably much, much larger in > length than the text shown above, it's (probably) trivial to deal with: > convert your image data into bytes, decode those bytes into Latin-1, > then concatenate the Latin-1 string into the text above. > This is similar to what Chris Barker suggested. I also don't try to be difficult here but please explain to me one thing. To treat bytes as if they were Latin-1 is bad idea, that's why "%f" got dropped in the first place, right? How is it then alright to put an image inside an Unicode string? Also, apart from the in/out conversions, do any other difficulties come to your mind? Please also take note that in Python 3.3 and better, the internal > representation of Unicode strings containing only code points up to 255 > (i.e. pure ASCII or pure Latin-1) is very efficient, using only one byte > per character. > I guess you meant [C]Python... In any case, thanks for the detailed reply. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano wrote: > On Sat, Jan 11, 2014 at 08:13:39PM -0200, Mariano Reingart wrote: > > > AFAIK (and just for the record), there could be both Latin1 text and > UTF-16 > > in a PDF (and other encodings too), depending on the font used: > [...] > > In Python2, txt is just a str, but in Python3 handling everything as > latin1 > > string obviously doesn't work for TTF in this case. > > Nobody is suggesting that you use Latin-1 for *everything*. We're > suggesting that you use it for blobs of binary data that represent > arbitrary bytes. First you have to get your binary data in the first > place, using whatever technique is necessary. Just to check I understood what you are saying. Instead of writing: content = b'\n'.join([ b'header', b'part 2 %.3f' % number, binary_image_data, utf16_string.encode('utf-16be'), b'trailer']) it should now look like: content = '\n'.join([ 'header', 'part 2 %.3f' % number, binary_image_data.decode('latin-1'), utf16_string.encode('utf-16be').decode('latin-1'), 'trailer']).encode('latin-1') Correct? ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
On Sun, Jan 12, 2014 at 2:16 PM, Nick Coghlan wrote: > Why are you proposing to do the *join* in text space? Encode all the parts > separately, concatenate them with b'\n'.join() (or whatever separator is > appropriate). It's only the *text formatting operation* that needs to be > done in text space and then explicitly encoded (and this example doesn't > even need latin-1,ASCII is sufficient): > I apparently misunderstood what was Steven suggesting, thanks for the clarification. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5
Wait a second, this is how I understood it but what Nick said made me think otherwise... On Sun, Jan 12, 2014 at 6:22 PM, Steven D'Aprano wrote: > On Sun, Jan 12, 2014 at 12:52:18PM +0100, Juraj Sukop wrote: > > On Sun, Jan 12, 2014 at 2:35 AM, Steven D'Aprano >wrote: > > > > Just to check I understood what you are saying. Instead of writing: > > > > content = b'\n'.join([ > > b'header', > > b'part 2 %.3f' % number, > > binary_image_data, > > utf16_string.encode('utf-16be'), > > b'trailer']) > > Which doesn't work, since bytes don't support %f in Python 3. > I know and this was an example of the ideal (for me, anyway) way of formatting bytes. > First, "utf16_string" confuses me. What is it? If it is a Unicode > string, i.e.: > It is a Unicode string which happens to contain code points outside U+00FF (as with the TTF example above), so that it triggers the (at least) 2-bytes memory representation in CPython 3.3+. I agree, I chose the variable name poorly, my bad. > > content = '\n'.join([ > 'header', > 'part 2 %.3f' % number, > binary_image_data.decode('latin-1'), > utf16_string, # Misleading name, actually Unicode string > 'trailer']) > Which, because of that horribly-named-variable, prevents the use of simple memcpy and makes the image data occupy way more memory than as when it was in simple bytes. > Both examples assume that you intend to do further processing of content > before sending it, and will encode just before sending: > Not really, I was interested to compare it to bytes formatting, hence it included the "encode()" as well. ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com
Re: [Python-Dev] Internal representation of strings and Micropython
On Wed, Jun 4, 2014 at 11:36 AM, Stephen J. Turnbull wrote: > > I think you really need to check what the applications are in detail. > UTF-8 costs about 35% more storage for Japanese, and even more for > Chinese, than does UTF-16. "UTF-8 can be smaller even for Asian languages, e.g.: front page of Wikipedia Japan: 83 kB in UTF-8, 144 kB in UTF-16" >From http://www.lua.org/wshop12/Ierusalimschy.pdf (p. 12) ___ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com