Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Mark Shannon

On 06/01/14 13:24, Victor Stinner wrote:

Hi,

bytes % args and bytes.format(args) are requested by Mercurial and

[snip]

I'm opposed to adding methods to bytes for this, as I think it goes 
against the reason for the separation of str and bytes in the first place.


str objects are pieces of text, a list of unicode characters.
In other words they have meaning independent of their context.

bytes are just a sequence of 8bit clumps.
The meaning of bytes depends on the encoding, but the proposed methods 
will have no encoding, but presume meaning.

What does b'%s' % 7 do?
u'%s' % 7 calls 7 .__str__() which returns a (unicode) string.
By implication b'%s' % 7 would call 7 .__str__() and ...
And then what? Use the "default" encoding? ASCII?
Explicit is better than implicit.

I am not opposed to adding new functionality, as long as it is not 
overloading the % operator or format() method.


binascii.format() perhaps?

Cheers,
Mark.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread M.-A. Lemburg
On 06.01.2014 14:24, Victor Stinner wrote:
> Hi,
> 
> bytes % args and bytes.format(args) are requested by Mercurial and
> Twisted projects. The issue #3982 was stuck because nobody proposed a
> complete definition of the "new" features. Here is a try as a PEP.
> 
> The PEP is a draft with open questions. First, I'm not sure that both
> bytes%args and bytes.format(args) are needed. The implementation of
> .format() is more complex, so why not only adding bytes%args?

+1 on doing all of this.

I'd simply copy over the Python 2 PyString code and start working
from there.

Readding these features makes live a lot easier in situations where you
have to work on data which is encoded text using multiple (sometimes
even unknown) encodings in a single data chunk. Think MIME messages,
mbox files, diffs, etc.

In such situations you often know the encoding of the part you're
working on (in most cases ASCII), but not necessarily the encodings
of other parts of the chunks.

You could work around this by decoding from Latin-1, then using Unicode
methods and encoding back to Latin-1, but the risk of letting Mojibake
enter your application in uncontrolled ways are high.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 08 2014)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Victor Stinner
Hi,

2014/1/8 Mark Shannon :
> I'm opposed to adding methods to bytes for this, as I think it goes against
> the reason for the separation of str and bytes in the first place.

Well, sometimes practicability beats purity. Many developers
complained that Python 3 is too string. The motivation of the PEP is
to ease the transition from Python 2 to Python 3 and be able to write
the same code base for the two versions.

> bytes are just a sequence of 8bit clumps.
> The meaning of bytes depends on the encoding, but the proposed methods will
> have no encoding, but presume meaning.

Many protocols mix ASCII text with binary bytes. For example, an HTTP
server writes headers and then copy the content of a binary file (ex:
PNG picture, gzipped HTML page, whatever) *in the same stream*. There
are many similar examples. Just another one: PDF mixes ASCII text with
binary.

> What does b'%s' % 7 do?

See Examples of the PEP:

b'a%sc%s' % (b'b', 4) gives b'abc4'

(so b'%s' % 7 gives b'7')

> u'%s' % 7 calls 7 .__str__() which returns a (unicode) string.
> By implication b'%s' % 7 would call 7 .__str__() and ...

Why do you think do? bytes and str will have two separated
implementations, but might share some functions. CPython already has a
"stringlib" which shares as much code as possible between bytes and
str. For example, the "fastsearch" code is shared.

> And then what? Use the "default" encoding? ASCII?

Bytes have no encoding. There are just bytes :-)

IMO the typical usecase will by b'%s: %s' % (b'Header', binary_data)

> I am not opposed to adding new functionality, as long as it is not
> overloading the % operator or format() method.

Ok, I will record your oppisition in the PEP.

> binascii.format() perhaps?

Please read the Rationale of the PEP again, binascii.format() doesn't
solve the described use case.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Victor Stinner
Hi,

2014/1/8 M.-A. Lemburg :
> I'd simply copy over the Python 2 PyString code and start working
> from there.

It's not possible to reuse directly all Python 2 code because some
helpers have been modified to work on Unicode. The PEP 460 adds also
more work to other implementations of Python.

IMO some formatting commands must not be implemented. For example,
alignment is used to display something on screen, not in network
protocols or binary file formats. It's also why the issue #3982 was
stuck, we must define exactly the feature set of the new methods
(bytes % args, bytes.format).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Chris Angelico
On Wed, Jan 8, 2014 at 9:12 PM, Victor Stinner  wrote:
> IMO some formatting commands must not be implemented. For example,
> alignment is used to display something on screen, not in network
> protocols or binary file formats.

Must not, or need not? I can understand that those sorts of features
would be less valuable, but they do make sense.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Antoine Pitrou
On Wed, 08 Jan 2014 13:51:36 +0900
"Stephen J. Turnbull"  wrote:
> Benjamin Peterson writes:
> 
>  > I agree. This is a very important, much-requested feature for low-level
>  > networking code.
> 
> I hear it's much-requested, but is there any description of typical
> use cases?  The ones I've seen on this list and on -ideas are
> typically stream-oriented, and seem like they would be perfectly
> well-served in terms of code readability and algorithmic accuracy by
> reading with .decode('ascii', errors='surrogateescape') and writing
> with .encode() and the same parameters (or as latin1).

It's a matter of convenience. Sometimes you're just interpolating bytes
data together and it's a bit suboptimal to have to do a
decode()-encode() dance around that.

That said, the whole issue is slightly overblown as well: network
programming in 3.x is perfectly reasonable, as the existence of Tornado
and Tulip shows.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Antoine Pitrou
On Wed, 8 Jan 2014 11:02:19 +0100
Victor Stinner  wrote:
> 
> > What does b'%s' % 7 do?
> 
> See Examples of the PEP:
> 
> b'a%sc%s' % (b'b', 4) gives b'abc4'
[...]
> > And then what? Use the "default" encoding? ASCII?
> 
> Bytes have no encoding. There are just bytes :-)

Therefore you shouldn't accept integers. It does not make sense to
format 4 as b'4'.

> IMO the typical usecase will by b'%s: %s' % (b'Header', binary_data)

Agreed.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread M.-A. Lemburg
On 08.01.2014 11:12, Victor Stinner wrote:
> Hi,
> 
> 2014/1/8 M.-A. Lemburg :
>> I'd simply copy over the Python 2 PyString code and start working
>> from there.
> 
> It's not possible to reuse directly all Python 2 code because some
> helpers have been modified to work on Unicode. The PEP 460 adds also
> more work to other implementations of Python.
> 
> IMO some formatting commands must not be implemented. For example,
> alignment is used to display something on screen, not in network
> protocols or binary file formats. It's also why the issue #3982 was
> stuck, we must define exactly the feature set of the new methods
> (bytes % args, bytes.format).

I'd use practicality beats purity here.

As I mentioned in my reply, such formatting methods would indeed be
used on data that is text. It's just that this text would be embedded
inside an otherwise binary blob.

You could do the alignment in Unicode first, then encode it and format
it into the binary blob, but really: why bother with that extra
round-trip ?

The main purpose of the readdition would be to simplify porting
applications to Python 3, while keeping them compatible with
Python 2 as well.

If you need to do the Unicode round-trip just to align a
string in some fixed sized field, you might as well convert
the whole operation to a function which deals with all this
based on whether Python 2 or 3 is running and you'd lose
the intended simplification of the readdition.

PS: The PEP mentions having to code for Python 3.0-3.4 as well,
which would don't support the new methods. I think it's perfectly
fine to have newly ported code to require Python 2.7/3.5+. After
all, the porting effort will take some time as well.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, Jan 08 2014)
>>> Python Projects, Consulting and Support ...   http://www.egenix.com/
>>> mxODBC.Zope/Plone.Database.Adapter ...   http://zope.egenix.com/
>>> mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


: Try our mxODBC.Connect Python Database Interface for free ! ::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Antoine Pitrou

Hi Victor,

On Mon, 6 Jan 2014 14:24:50 +0100
Victor Stinner  wrote:
> Hi,
> 
> bytes % args and bytes.format(args) are requested by Mercurial and
> Twisted projects. The issue #3982 was stuck because nobody proposed a
> complete definition of the "new" features. Here is a try as a PEP.

There is a good use case at:
https://mail.python.org/pipermail/python-ideas/2014-January/024803.html

Regards

Antoine.


> 
> The PEP is a draft with open questions. First, I'm not sure that both
> bytes%args and bytes.format(args) are needed. The implementation of
> .format() is more complex, so why not only adding bytes%args? Then,
> the following points must be decided to define the complete list of
> supported features (formatters):
> 
> * Format integer to hexadecimal? ``%x`` and ``%X``
> * Format integer to octal? ``%o``
> * Format integer to binary? ``{!b}``
> * Alignment?
> * Truncating? Truncate or raise an error?
> * format keywords? ``b'{arg}'.format(arg=5)``
> * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
> * Floating point number?
> * ``%i``, ``%u`` and ``%d`` formats for integer numbers?
> * Signed number? ``%+i`` and ``%-i``
> 
> 
> HTML version of the PEP:
> http://www.python.org/dev/peps/pep-0460/
> 
> Inline copy:
> 
> PEP: 460
> Title: Add bytes % args and bytes.format(args) to Python 3.5
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner 
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 6-Jan-2014
> Python-Version: 3.5
> 
> 
> Abstract
> 
> 
> Add ``bytes % args`` operator and ``bytes.format(args)`` method to
> Python 3.5.
> 
> 
> Rationale
> =
> 
> ``bytes % args`` and ``bytes.format(args)`` have been removed in Python
> 2. This operator and this method are requested by Mercurial and Twisted
> developers to ease porting their project on Python 3.
> 
> Python 3 suggests to format text first and then encode to bytes. In
> some cases, it does not make sense because arguments are bytes strings.
> Typical usage is a network protocol which is binary, since data are
> send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP,
> POP, FTP are ASCII commands interspersed with binary data.
> 
> Using multiple ``bytes + bytes`` instructions is inefficient because it
> requires temporary buffers and copies which are slow and waste memory.
> Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``.
> 
> ``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even
> before the first release of Python 3.0 (see issue #3982).
> 
> ``struct.pack()`` is incomplete. For example, a number cannot be
> formatted as decimal and it does not support padding bytes string.
> 
> Mercurial 2.8 still supports Python 2.4.
> 
> 
> Needed and excluded features
> 
> 
> Needed features
> 
> * Bytes strings: bytes, bytearray and memoryview types
> * Format integer numbers as decimal
> * Padding with spaces and null bytes
> * "%s" should use the buffer protocol, not str()
> 
> The feature set is minimal to keep the implementation as simple as
> possible to limit the cost of the implementation. ``str % args`` and
> ``str.format(args)`` are already complex and difficult to maintain, the
> code is heavily optimized.
> 
> Excluded features:
> 
> * no implicit conversion from Unicode to bytes (ex: encode to ASCII or
>   to Latin1)
> * Locale support (``{!n}`` format for numbers). Locales are related to
>   text and usually to an encoding.
> * ``repr()``, ``ascii()``: ``%r``, ``{!r}``, ``%a`` and ``{!a}``
>   formats. ``repr()`` and ``ascii()`` are used to debug, the output is
>   displayed a terminal or a graphical widget. They are more related to
>   text.
> * Attribute access: ``{obj.attr}``
> * Indexing: ``{dict[key]}``
> * Features of struct.pack(). For example, format a number as 32 bit unsigned
>   integer in network endian. The ``struct.pack()`` can be used to prepare
>   arguments, the implementation should be kept simple.
> * Features of int.to_bytes().
> * Features of ctypes.
> * New format protocol like a new ``__bformat__()`` method. Since the
> * list of
>   supported types is short, there is no need to add a new protocol.
>   Other types must be explicitly casted.
> * Alternate format for integer. For example, ``'{|#x}'.format(0x123)``
>   to get ``0x123``. It is more related to debug, and the prefix can be
>   easily be written in the format string (ex: ``0x%x``).
> * Relation with format() and the __format__() protocol. bytes.format()
>   and str.format() are unrelated.
> 
> Unknown:
> 
> * Format integer to hexadecimal? ``%x`` and ``%X``
> * Format integer to octal? ``%o``
> * Format integer to binary? ``{!b}``
> * Alignment?
> * Truncating? Truncate or raise an error?
> * format keywords? ``b'{arg}'.format(arg=5)``
> * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
> * Floating point number?
> * ``%i``, ``%u`` and ``%d`` formats for integer numbers?
> * Signed number? ``%+i`` and ``%-i``
> 
> 
> bytes %

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Daniel Holth
On Tue, Jan 7, 2014 at 10:36 AM, Stephen J. Turnbull  wrote:
> Daniel Holth writes:
>
>  > Isn't it true that if you have bytes > 127 or surrogate escapes then
>  > encoding to latin1 is no longer as fast as memcpy?
>
> Be careful.  As phrased, the question makes no sense.  You don't "have
> bytes" when you are encoding, you have characters.
>
> If you mean "what happens when my str contains characters in the range
> 128-255?", the answer is encoding a str in 8-bit representation to
> latin1 is effectively memcpy.  If you read in latin1, it's memcpy all
> the way (unless you combine it with a non-latin1 string, in which case
> you're in the cases below).
>
> If you mean "what happens when my str contains characters in the range
>> 255", you have to truncate 16-bit units to 8 bit units; no memcpy.
>
> Surrogates require >= 16 bits; no memcpy.

That is neat.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [RELEASED] Python 3.4.0b2

2014-01-08 Thread Bob Hanson
[Top-post fixed (use-case is an exception to the GvR rule ;-) )
and some attributions restored with my additional comments
following for the ease of future readers.]

TL;DR: Outbound-connection attempts seem to be happening only to
me, therefore, most likely not a Python problem -- but some
problem at my end. Thanks to all.

On Mon, 6 Jan 2014 05:43:38 -1000, Guido van Rossum wrote:

> On Mon, Jan 6, 2014 at 5:29 AM, Bob Hanson  wrote:
>
> > [For the record: I'm running 32bit Windows XP (Pro) SP2 and
> > installing "for all users."]
> >
> > TL;DR: No matter what I tried this morning re uninstalling and
> > reinstalling 3.4.0b2, pip or no pip, MSI still tried to connect
> > to the Akamai URLs.
> >
> > On Sun, 05 Jan 2014 23:06:49 -0500, R. David Murray wrote:
> >
> > > On Sun, 05 Jan 2014 19:32:15 -0800, Bob Hanson wrote:
> > > 
> > > > Still wondering why [...] msiexec.exe [is] trying to connect out while
> > > > installing 3.4.0b2 from my harddrive...?
> > >
> > > The ensurepip developers will have to say for sure, but my understanding
> > > is that it does *not* go out to the network.  On the other hand, it is
> > > conceivable that pip 1.5, unlike the earlier version in Beta1, is doing
> > > some sort of "up to date check" that it shouldn't be doing in the
> > > ensurepip scenario.
> > >
> > > I presume you did have the installer install pip.
> > 
> > To be honest, I forgot all about pip [...] didn't
> > even notice a checkbox for that option.
> >
> > > If you haven't already, You might try reinstalling and unchecking
> > > that option, and see if it msiexec still tries to go out to the
> > > network.  That would confirm it is ensurepip that is the issue
> > > (although that does seem most likely).
> >
> > [...snip synopsis of various uninstall-reinstall dances...]
> >
> > So, whatever I have tried -- pip or no pip -- msiexec.exe still
> > attempts to connect to those Akamai URLs.
> 
> Since MSIEXEC.EXE is a legit binary (not coming from our packager) and
> Akamai is a legitimate company (MS most likely has an agreement with
> them), at this point I would assume that there's something that
> MSIEXEC.EXE wants to get from Akamai, which is unintentionally but
> harmlessly triggered by the Python install. Could it be checking for
> upgrades?

When I read this comment of yours, Guido, I immediately started
wondering about this. You may well be right -- indeed, I have a
very old install (c.2007) which has not been updated (other than
one or three new MS "drivers"). 

Perhaps the Python 3.4.0b2 MSI installer uses a new capability,
which, as you say, causes the installer to at least attempt to
upgrade...?

In any event, as there's been no other reports, this seems to be
something happening only to me. As such, it seems to be not a
Python problem, but some misconfiguration on my own system, say.

If I retain interest in investigating this, and if I *do* find an
actual problem with Python, I'll post again.

Thanks go to you, Guido, as well as to Tim and all the others who
helped me with this.

Regards,
Bob Hanson

-- 
Write once, read many.

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [RELEASED] Python 3.4.0b2

2014-01-08 Thread Nick Coghlan
On 9 January 2014 00:43, Bob Hanson  wrote:
> When I read this comment of yours, Guido, I immediately started
> wondering about this. You may well be right -- indeed, I have a
> very old install (c.2007) which has not been updated (other than
> one or three new MS "drivers").
>
> Perhaps the Python 3.4.0b2 MSI installer uses a new capability,
> which, as you say, causes the installer to at least attempt to
> upgrade...?

I believe the pip bootstrapping involves an MSI feature we haven't
previously used (MvL would be able to confirm). If so, then MSI may be
looking for a new version to interpret that new setting.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Changing Clinic's output

2014-01-08 Thread Barry Warsaw
On Jan 07, 2014, at 10:39 PM, Serhiy Storchaka wrote:

>Only this option will solve all my issues.

How hard would it be to put together some sample branches that provide
concrete examples of the various options?

My own opinion could easily be influenced by having some hands-on time with
actual code, and I suspect even Guido could be influenced if he could pull
some things up in his editor and take a look around.

-Barry
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Changing Clinic's output

2014-01-08 Thread Larry Hastings

On 01/08/2014 07:08 AM, Barry Warsaw wrote:

On Jan 07, 2014, at 10:39 PM, Serhiy Storchaka wrote:


Only this option will solve all my issues.

How hard would it be to put together some sample branches that provide
concrete examples of the various options?

My own opinion could easily be influenced by having some hands-on time with
actual code, and I suspect even Guido could be influenced if he could pull
some things up in his editor and take a look around.


I plan to prototype the "accumulator" later today.  It probably wouldn't 
be hard to make the prototype support writing out to a separate file, so 
I'll try to do that too.



//arry/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Changing Clinic's output

2014-01-08 Thread Brett Cannon
On Tue, Jan 7, 2014 at 7:07 PM, Larry Hastings  wrote:

>
>
> On 01/07/2014 03:38 PM, Brett Cannon wrote:
>
> On Tue, Jan 7, 2014 at 6:24 PM, Larry Hastings  wrote:
>
>>  For what it's worth, if we use the "accumulator" approach I propose
>> that the generated code doesn't go at the very end of the file.  Instead, I
>> suggest they should go *near* the end, below the implementations of the
>> module / class methods, but above the methoddef/type structures and the
>> module init function.
>>
>
>  If it is accumulated in a single location should it just be a single
> block for everything towards the end? Then forward declarations would go
> away (you could still have it as a comment to copy-and-paste where you
> define the implementation) and you can have a single macro for the
> PyMethodDef values, each class, etc. If you accumulated the PyMethodDef
> values into a single macro it would help make up for the convenience lost
> of converting a function by just cutting the old call signature up to the
> new *_impl() function.
>
>
> I *think* that would complicate some use cases.  People occasionally call
> these parsing functions from other functions, or spread their methoddef /
> typeobject structures throughout the file rather than putting them all at
> the end.
>
> I'm proposing that the blob of text immediately between the Clinic input
> and the body of the impl contain (newlines added here for clarity):
>
> static char *parsing_function_doc;
>
> static PyObject *
> parsing_function(...);
>
> #define PARSING_FUNCTION_METHODDEF \
> { ... }
>
> static PyObject *
> parsing_function_impl(...)
>
> Then the "accumulator" would get the text of the docstring and the
> definition of the parsing_function.
>
>
> On the other hand, if we wanted to take this opportunity to force everyone
> to standardize (all methoddefs and typeobjects go at the end!) we could
> probably make it work with one giant block near the end.
>
> Or I could make it flexible on what went into the accumulator and what
> went into the normal output block, and the default could be
> everything-in-the-accumulator.  Making the common easy and the uncommon
> possible and all that.  Yeah, that seems best.
>

So let's make this idea concrete to focus a possible discussion. Using the
example from the Clinic HOWTO and converting to how I see it working:



/*[clinic input]
pickle.Pickler.dump

obj: 'O'
The object to be pickled.
/

Write a pickled representation of obj to the open file.
[clinic start generated code]*/

static PyObject *
pickle_Pickler_dump_impl(PyObject *self, PyObject *obj)
/*[clinic end generated code:
checksum=3bd30745bf206a48f8b576a1da3d90f55a0a4187]*/
{
/* Check whether the Pickler was initialized correctly (issue3664).
   Developers often forget to call __init__() in their subclasses, which
   would trigger a segfault without this check. */
if (self->write == NULL) {
PyErr_Format(PicklingError,
 "Pickler.__init__() was not called by %s.__init__()",
 Py_TYPE(self)->tp_name);
return NULL;
}

if (_Pickler_ClearBuffer(self) < 0)
return NULL
...
}

...

/*[clinic accumulate]*/
PyDoc_STRVAR(pickle_Pickler_dump__doc__,
"Write a pickled representation of obj to the open file.\n"
"\n"

static PyObject *
_pickle_Pickler_dump(PyObject *args)
{
  ...
  return pickle_Pickler_dump_impl(...);
}

#define _PICKLE_PICKLER_DUMP_METHODDEF\
{"dump", (PyCFunction)_pickle_Pickler_dump, METH_O,
_pickle_Pickler_dump__doc__},

... any other pickler.Pickler Clinic stuff that does not directly involve
the the impl function ...

#define _PICKLE_PICKLER_METHODDEF_ACCUMULATED  \
  _PICKLE_PICKLER_DUMP_METHODDEF  \
  ... any other MethodDef entries for pickle.Pickler

/*[clinic end accumulate: checksum=0123456789]*/

... pickle.Pickler struct where _PICKLE_PICKLER_METHODDEF_ACCUMULATED is
all that is needed for the non-magical class methods ...


###


Another potential perk of doing a gathering of Clinic output is that if we
take it to it's logical conclusion, then you can start to do things like
define a method like pickle.Pickler.__init__, etc., have Clinic handle
docstrings for modules and classes, and then it can end up spitting out the
type struct entirely for you, negating the typical need to do all of that
by hand (I don't know about the rest of you but I always just copy and
paste that struct anyway, so having a tool slot in the right method names
for the right positions would save me busy work). It could then go as far
as then spit out the module initialization function definition line and
then all you would need to do is fill that in; Clinic could handle all
other module-level details for you in the very common case.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.or

Re: [Python-Dev] Changing Clinic's output

2014-01-08 Thread Larry Hastings



On 01/08/2014 07:33 AM, Brett Cannon wrote:
So let's make this idea concrete to focus a possible discussion. Using 
the example from the Clinic HOWTO and converting to how I see it working:

[...]


Yep.  And what I was proposing is much the same, except there are a 
couple extra lines in the "generated code" section.  I'd keep the 
#define for the methoddef there, and add a prototype for the generated 
parsing function (_pickle_Pickler_dump) and the docstring.





/*[clinic input]
pickle.Pickler.dump

obj: 'O'
The object to be pickled.
/

Write a pickled representation of obj to the open file.
[clinic start generated code]*/
PyDoc_VAR(pickle_Pickler_dump__doc__);

static PyObject *
_pickle_Pickler_dump(PyObject *args);

#define _PICKLE_PICKLER_DUMP_METHODDEF\
{"dump", (PyCFunction)_pickle_Pickler_dump, METH_O, 
_pickle_Pickler_dump__doc__},


static PyObject *
pickle_Pickler_dump_impl(PyObject *self, PyObject *obj)
/*[clinic end generated code: 
checksum=3bd30745bf206a48f8b576a1da3d90f55a0a4187]*/

{
/* Check whether the Pickler was initialized correctly (issue3664).
   Developers often forget to call __init__() in their subclasses, 
which

   would trigger a segfault without this check. */
...
}

Another potential perk of doing a gathering of Clinic output is that 
if we take it to it's logical conclusion, then you can start to do 
things like define a method like pickle.Pickler.__init__, etc., have 
Clinic handle docstrings for modules and classes, and then it can end 
up spitting out the type struct entirely for you, negating the typical 
need to do all of that by hand (I don't know about the rest of you but 
I always just copy and paste that struct anyway, so having a tool slot 
in the right method names for the right positions would save me busy 
work).


Surely new code should use the functional API for creating types?  
Anyway, yes, in the future it would be nice to get rid of a bunch of the 
busywork associated with implementing a Python builtin type, and 
Argument Clinic could definitely help with that.



//arry/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Changing Clinic's output

2014-01-08 Thread Brett Cannon
On Wed, Jan 8, 2014 at 10:46 AM, Larry Hastings  wrote:

>
>
> On 01/08/2014 07:33 AM, Brett Cannon wrote:
>
> So let's make this idea concrete to focus a possible discussion. Using the
> example from the Clinic HOWTO and converting to how I see it working:
> [...]
>
>
> Yep.  And what I was proposing is much the same, except there are a couple
> extra lines in the "generated code" section.  I'd keep the #define for the
> methoddef there, and add a prototype for the generated parsing function
> (_pickle_Pickler_dump) and the docstring.
>

I assume that's for flexibility in case someone has their module structured
in a way that doesn't lend itself to having it all accumulated at the end
of the file? Or is there something I'm overlooking? I would assume being
able to put the accumulator block where ever you want with enough forward
declarations would still be enough to allow for it to work with almost any
structured format of a file and have almost all the generated code in a
single place. I can definitely live with what you are proposing, just
trying to understand the logic as shifting almost all generated stuff in a
single place does make Clinic comments read like fancy docstrings which is
nice.


>
>
>   
>
>  /*[clinic input]
> pickle.Pickler.dump
>
>  obj: 'O'
> The object to be pickled.
> /
>
>  Write a pickled representation of obj to the open file.
> [clinic start generated code]*/
>  PyDoc_VAR(pickle_Pickler_dump__doc__);
>
> static PyObject *
> _pickle_Pickler_dump(PyObject *args);
>
>  #define _PICKLE_PICKLER_DUMP_METHODDEF\
> {"dump", (PyCFunction)_pickle_Pickler_dump, METH_O,
> _pickle_Pickler_dump__doc__},
>
>  static PyObject *
> pickle_Pickler_dump_impl(PyObject *self, PyObject *obj)
> /*[clinic end generated code:
> checksum=3bd30745bf206a48f8b576a1da3d90f55a0a4187]*/
> {
> /* Check whether the Pickler was initialized correctly (issue3664).
>Developers often forget to call __init__() in their subclasses,
> which
>would trigger a segfault without this check. */
> ...
> }
>
>  Another potential perk of doing a gathering of Clinic output is that if
> we take it to it's logical conclusion, then you can start to do things like
> define a method like pickle.Pickler.__init__, etc., have Clinic handle
> docstrings for modules and classes, and then it can end up spitting out the
> type struct entirely for you, negating the typical need to do all of that
> by hand (I don't know about the rest of you but I always just copy and
> paste that struct anyway, so having a tool slot in the right method names
> for the right positions would save me busy work).
>
>
> Surely new code should use the functional API for creating types?
>

Yes. Shows how long it has been since I have written a C type from scratch.
=)


>   Anyway, yes, in the future it would be nice to get rid of a bunch of the
> busywork associated with implementing a Python builtin type, and Argument
> Clinic could definitely help with that.
>

I think that will be the big long-term win; taking out nearly all
boilerplate in creating an extension module and maintaining it (in case
something changes, e.g. the module init function signature).
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Changing Clinic's output

2014-01-08 Thread Larry Hastings


On 01/08/2014 08:04 AM, Brett Cannon wrote:
On Wed, Jan 8, 2014 at 10:46 AM, Larry Hastings > wrote:


Yep.  And what I was proposing is much the same, except there are
a couple extra lines in the "generated code" section.  I'd keep
the #define for the methoddef there, and add a prototype for the
generated parsing function (_pickle_Pickler_dump) and the docstring.


I assume that's for flexibility in case someone has their module 
structured in a way that doesn't lend itself to having it all 
accumulated at the end of the file? Or is there something I'm overlooking?


No, you're not overlooking anything, that's why.  It's for files that 
have getsetdef / methoddef / typeobject structures all over the place 
instead of keeping them all at the end.  My mindset is trying to avoid 
requiring big changes for Argument Clinic support like "step 87: now 
move all your getsetdef / methoddef / typeobject to the end of your 
file, below the accumulator output block".  Argument Clinic is 
contributing enough churn as it is don'tchathink!



//arry/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Ethan Furman

On 01/08/2014 02:28 AM, Antoine Pitrou wrote:

On Wed, 8 Jan 2014 11:02:19 +0100
Victor Stinner  wrote:



What does b'%s' % 7 do?


See Examples of the PEP:

b'a%sc%s' % (b'b', 4) gives b'abc4'

[...]

And then what? Use the "default" encoding? ASCII?


Bytes have no encoding. There are just bytes :-)


Therefore you shouldn't accept integers. It does not make sense to
format 4 as b'4'.


Agreed.  I would have that it would result in b'\x04'.

--
~Ethan~
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Victor Stinner
2014/1/8 Ethan Furman :
>> Therefore you shouldn't accept integers. It does not make sense to
>> format 4 as b'4'.
>
> Agreed.  I would have that it would result in b'\x04'.

The PEP proposes b'%c' % 4 => b'\x04.

Antoine gave me a good argument against supporting b'%s' % int: how
would int subclasses be handled? int has no __bytes__() nor
__bformat__() method. bytes(int) returns a string of null bytes.

I'm maybe simpler to only support %s format with bytes-like objects
(bytes, bytearray, memoryview).

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Stefan Behnel
Victor Stinner, 07.01.2014 19:14:
> 2014/1/7 Stefan Behnel:
>> Victor Stinner, 06.01.2014 14:24:
>>> ``struct.pack()`` is incomplete. For example, a number cannot be
>>> formatted as decimal and it does not support padding bytes string.
>>
>> Then what about extending the struct module in a way that makes it cover
>> more use cases like these?
> 
> The idea of the PEP is to simply the portage work of Twisted and
> Mercurial developers. So the same code should work on Python 2 and
> Python 3.

Is it really a requirement that existing Py2 code must work unchanged in
Py3? Why can't someone write a third-party library that does what these
projects need, and that works in both Py2 and Py3, so that these projects
can be modified to use that library and thus get on with their porting to
Py3? Or rather one library that does what some projects need and another
one that does what other projects need, because it's quite likely that the
requirements are not really as largely identical as it seems when seen
through the old and milky Py2 glasses.

One idea of designing a Py3 was to simplify the language. Getting all Py2
"features" back in doesn't help on that path. If something can easily be
done in an external module, I think it should be done there.

Stefan


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Eric Snow
On Wed, Jan 8, 2014 at 3:40 AM, M.-A. Lemburg  wrote:
> PS: The PEP mentions having to code for Python 3.0-3.4 as well,
> which would don't support the new methods. I think it's perfectly
> fine to have newly ported code to require Python 2.7/3.5+. After
> all, the porting effort will take some time as well.

tl;dr We must get the relevant library projects involved in this
discussion.  I prefer Nick's solution to the problem at hand.


I've mostly stayed out of this discussion because I neither have many
unicode-related use-cases nor a deep understanding of all the issues.
However, my investment in the community is such that I've been
following these discussions and hope to add what I can in what few
places I chime in. :)


Requiring 3.5 may be tricky though.  How soon will 3.5 show up in OS
distros or be their system Python?  Getting 3.5 on their system may
not be a simple option for some (perhaps too few to matter?) and may
be seen as too onerous to others.  This effort is meant to ease
porting to Python 3 and not as just a carrot like most other new
features.

It boils down to 3.5 being *the* target for porting from 2.7.
Otherwise we'd be better off adding a new type to 3.5 for the
wire-protocol use cases and providing a 2.7/3.x backport on the
cheeseshop that would facilitate porting such code bases to 3.5.  My
understanding is that is basically what Nick has proposed (sorry,
Nick, if I've misunderstood).  The latter approach makes more sense to
me.

However, it seems like this whole discussion is motivated by a
particular group of library projects.  Regardless of what we discuss
or the solutions on which we resolve, we'd be making a mistake if we
did not do our utmost to ensure those projects are directly involved
in these discussions.

-eric
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Antoine Pitrou
On Wed, 8 Jan 2014 11:16:49 -0700
Eric Snow  wrote:
> 
> It boils down to 3.5 being *the* target for porting from 2.7.

No. Please let's stop being self-deprecating. 3.3 is fine as a porting
target, as the many high-profile libraries which have already been
ported can attest.

> Otherwise we'd be better off adding a new type to 3.5 for the
> wire-protocol use cases

I'm completely opposed to a new type.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Eric Snow
On Mon, Jan 6, 2014 at 6:24 AM, Victor Stinner  wrote:
> Abstract
> 
>
> Add ``bytes % args`` operator and ``bytes.format(args)`` method to
> Python 3.5.
>
>
> Rationale
> =
>
> ``bytes % args`` and ``bytes.format(args)`` have been removed in Python
> 2. This operator and this method are requested by Mercurial and Twisted
> developers to ease porting their project on Python 3.
>
> Python 3 suggests to format text first and then encode to bytes. In
> some cases, it does not make sense because arguments are bytes strings.
> Typical usage is a network protocol which is binary, since data are
> send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP,
> POP, FTP are ASCII commands interspersed with binary data.
>
> Using multiple ``bytes + bytes`` instructions is inefficient because it
> requires temporary buffers and copies which are slow and waste memory.
> Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``.
>
> ``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even
> before the first release of Python 3.0 (see issue #3982).
>
> ``struct.pack()`` is incomplete. For example, a number cannot be
> formatted as decimal and it does not support padding bytes string.
>
> Mercurial 2.8 still supports Python 2.4.

As an alternative, we could provide an import hook via some channel
(cheeseshop? recipe?) that converts just b'' formatting into some
Python 3 equivalent (when run under Python 3).  The argument against
such import hooks is usually that they have an adverse impact on the
output of tracebacks.  However, I'd expect most b'' formatting to
happen on a single line and that the replacement source would stay on
that single line.

Such an import hook would lessen the desire for bytes formatting.  As
I mentioned elsewhere, Nick's counter-proposal of a separate
wire-protocol-friendly type makes more sense to me more than adding
formatting to Python 3's bytes type.  As others have opined,
formatting a bytes object is out of place.  The need is limited in
scope and audience, but apparently real.  Adding that capability
directly to bytes in 3.5 should be a last resort to which we appeal
only when we exhaust our other options.

-eric
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Antoine Pitrou
On Wed, 8 Jan 2014 11:59:51 -0700
Eric Snow  wrote:
> As others have opined,
> formatting a bytes object is out of place.

However, interpolating a bytes object isn't out of place, and it is
what a minimal "formatting" primitive could do.

Regards

Antoine.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Stefan Behnel
Victor Stinner, 06.01.2014 14:24:
> Abstract
> 
> Add ``bytes % args`` operator and ``bytes.format(args)`` method to
> Python 3.5.

Here is a counterproposal. Let someone who needs this feature write a
library that does byte string formatting. That properly handles it, a full
featured tool set. Write it in Cython if you need raw speed, that will also
help in making it run in both Python 2 and Python 3, or in providing easy
integration with buffers like the array module, various byte containers,
NumPy, etc.

I'm confident that this will show that the current Py2 code that
(legitimately) does byte string formatting can actually be improved,
simplified or sped up, at least in some corners. I'm sure Py2 byte string
formatting wasn't perfect for this use case either, it just happened to be
there, so everyone used it and worked around its particular quirks for the
particular use case at hand. (Think of "%s" % some_unicode_value, for example.)

Instead of waiting for 3.5, a third party library allows users to get
started porting their code earlier, and to make it work unchanged with
Python versions before 3.5.

Stefan


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Matt Billenstein
On Wed, Jan 08, 2014 at 07:12:06PM +0100, Stefan Behnel wrote:
> Why can't someone write a third-party library that does what these projects
> need, and that works in both Py2 and Py3, so that these projects can be
> modified to use that library and thus get on with their porting to Py3?

Apologies if this is out of place and slightly OT and soap-boxey...

Does it not strike anyone here how odd it is that one would need a library to
manipulate binary data in a programming language with "batteries included" on a
binary computer?  And maybe you can do it with existing facilities in both
versions of Python, although in python3, I need to understand what bytes,
format, ascii, and surrogateescape mean - among other things.

I started in Python blissfully unaware of unicode - it was a different time for
sure, but what I knew from C worked pretty much the same in Python - I could
read some binary data out of a file, twiddle some bits, and write it back out
again without any of these complexities - life was good and granted I was
naive, but it made Python approachable for me and I enjoyed it.  I stuck with
it and learned about unicode and the complexities of encoding data and now I'm
astonished at how many professional programmers don't know the slightest bit
about it and how horribly munged some data you can consume on the web might be
- I agree it's all quite a mess.

So now I'm getting more serious about Python3 and my fear is that the
development community (python3) has fractured from the user community (python2)
in that they've built something that solves their problems (to oversimplify
lets say a webapp) - sure, a bunch of stuff got fixed along the way and we gave
the users division they would expect (3/2 == 1.5), but somewhere what I felt
was more like a hobbyist language has become big and complex and "we need to
protect our users from doing the wrong thing."

And I think everyone was well intentioned - and python3 covers most of the
bases, but working with binary data is not only a "wire-protocol programmer's"
problem.  Needing a library to wrap bytesthing.format('ascii', 
'surrogateescape')
or some such thing makes python3 less approachable for those who haven't
learned that yet - which was almost all of us at some point when we started
programming.

I appreciate everyone's hard work - I'm confident the community will cross the
2-3 chasm and I hope we preserve the approachability I first came to love about
Python when I started using it for all sorts of applications.

thx

m

-- 
Matt Billenstein
m...@vazor.com
http://www.vazor.com/
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Antoine Pitrou

Hi,

Another remark about the PEP: it should define bytearray % args and
bytearray.format(args) as well.

Regards

Antoine.



On Mon, 6 Jan 2014 14:24:50 +0100
Victor Stinner  wrote:

> Hi,
> 
> bytes % args and bytes.format(args) are requested by Mercurial and
> Twisted projects. The issue #3982 was stuck because nobody proposed a
> complete definition of the "new" features. Here is a try as a PEP.
> 
> The PEP is a draft with open questions. First, I'm not sure that both
> bytes%args and bytes.format(args) are needed. The implementation of
> .format() is more complex, so why not only adding bytes%args? Then,
> the following points must be decided to define the complete list of
> supported features (formatters):
> 
> * Format integer to hexadecimal? ``%x`` and ``%X``
> * Format integer to octal? ``%o``
> * Format integer to binary? ``{!b}``
> * Alignment?
> * Truncating? Truncate or raise an error?
> * format keywords? ``b'{arg}'.format(arg=5)``
> * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
> * Floating point number?
> * ``%i``, ``%u`` and ``%d`` formats for integer numbers?
> * Signed number? ``%+i`` and ``%-i``
> 
> 
> HTML version of the PEP:
> http://www.python.org/dev/peps/pep-0460/
> 
> Inline copy:
> 
> PEP: 460
> Title: Add bytes % args and bytes.format(args) to Python 3.5
> Version: $Revision$
> Last-Modified: $Date$
> Author: Victor Stinner 
> Status: Draft
> Type: Standards Track
> Content-Type: text/x-rst
> Created: 6-Jan-2014
> Python-Version: 3.5
> 
> 
> Abstract
> 
> 
> Add ``bytes % args`` operator and ``bytes.format(args)`` method to
> Python 3.5.
> 
> 
> Rationale
> =
> 
> ``bytes % args`` and ``bytes.format(args)`` have been removed in Python
> 2. This operator and this method are requested by Mercurial and Twisted
> developers to ease porting their project on Python 3.
> 
> Python 3 suggests to format text first and then encode to bytes. In
> some cases, it does not make sense because arguments are bytes strings.
> Typical usage is a network protocol which is binary, since data are
> send to and received from sockets. For example, SMTP, SIP, HTTP, IMAP,
> POP, FTP are ASCII commands interspersed with binary data.
> 
> Using multiple ``bytes + bytes`` instructions is inefficient because it
> requires temporary buffers and copies which are slow and waste memory.
> Python 3.3 optimizes ``str2 += str2`` but not ``bytes2 += bytes1``.
> 
> ``bytes % args`` and ``bytes.format(args)`` were asked since 2008, even
> before the first release of Python 3.0 (see issue #3982).
> 
> ``struct.pack()`` is incomplete. For example, a number cannot be
> formatted as decimal and it does not support padding bytes string.
> 
> Mercurial 2.8 still supports Python 2.4.
> 
> 
> Needed and excluded features
> 
> 
> Needed features
> 
> * Bytes strings: bytes, bytearray and memoryview types
> * Format integer numbers as decimal
> * Padding with spaces and null bytes
> * "%s" should use the buffer protocol, not str()
> 
> The feature set is minimal to keep the implementation as simple as
> possible to limit the cost of the implementation. ``str % args`` and
> ``str.format(args)`` are already complex and difficult to maintain, the
> code is heavily optimized.
> 
> Excluded features:
> 
> * no implicit conversion from Unicode to bytes (ex: encode to ASCII or
>   to Latin1)
> * Locale support (``{!n}`` format for numbers). Locales are related to
>   text and usually to an encoding.
> * ``repr()``, ``ascii()``: ``%r``, ``{!r}``, ``%a`` and ``{!a}``
>   formats. ``repr()`` and ``ascii()`` are used to debug, the output is
>   displayed a terminal or a graphical widget. They are more related to
>   text.
> * Attribute access: ``{obj.attr}``
> * Indexing: ``{dict[key]}``
> * Features of struct.pack(). For example, format a number as 32 bit unsigned
>   integer in network endian. The ``struct.pack()`` can be used to prepare
>   arguments, the implementation should be kept simple.
> * Features of int.to_bytes().
> * Features of ctypes.
> * New format protocol like a new ``__bformat__()`` method. Since the
> * list of
>   supported types is short, there is no need to add a new protocol.
>   Other types must be explicitly casted.
> * Alternate format for integer. For example, ``'{|#x}'.format(0x123)``
>   to get ``0x123``. It is more related to debug, and the prefix can be
>   easily be written in the format string (ex: ``0x%x``).
> * Relation with format() and the __format__() protocol. bytes.format()
>   and str.format() are unrelated.
> 
> Unknown:
> 
> * Format integer to hexadecimal? ``%x`` and ``%X``
> * Format integer to octal? ``%o``
> * Format integer to binary? ``{!b}``
> * Alignment?
> * Truncating? Truncate or raise an error?
> * format keywords? ``b'{arg}'.format(arg=5)``
> * ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
> * Floating point number?
> * ``%i``, ``%u`` and ``%d`` formats for integer numbers?
> * Signed number? ``%+i`` and ``%-i``
> 
> 
> bytes % args
>

Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Daniel Holth
On Wed, Jan 8, 2014 at 2:17 PM, Stefan Behnel  wrote:
> Victor Stinner, 06.01.2014 14:24:
>> Abstract
>> 
>> Add ``bytes % args`` operator and ``bytes.format(args)`` method to
>> Python 3.5.
>
> Here is a counterproposal. Let someone who needs this feature write a
> library that does byte string formatting. That properly handles it, a full
> featured tool set. Write it in Cython if you need raw speed, that will also
> help in making it run in both Python 2 and Python 3, or in providing easy
> integration with buffers like the array module, various byte containers,
> NumPy, etc.

> I'm confident that this will show that the current Py2 code that
> (legitimately) does byte string formatting can actually be improved,
> simplified or sped up, at least in some corners. I'm sure Py2 byte string
> formatting wasn't perfect for this use case either, it just happened to be
> there, so everyone used it and worked around its particular quirks for the
> particular use case at hand. (Think of "%s" % some_unicode_value, for 
> example.)
>
> Instead of waiting for 3.5, a third party library allows users to get
> started porting their code earlier, and to make it work unchanged with
> Python versions before 3.5.

Maybe we can enumerate some of the stated drawbacks of b''.format()

Convenient string processing tools for bytes will make people ignore
Unicode or fail to notice it or do it wrong? (As opposed to the
alternative causing them to learn how to process and produce Unicode
correctly?)

Similar APIs on bytes and str will prevent implicit "assert
isinstance(x, str)" checks?

More-prevalent bytes will propagate across the program causing bugs?
A-la open(b'filename').name vs open('filename').name ?

It will take a long time.

Hopeful benefits may include easier porting and greater Py3 adoption,
less encoding dances and/or decoding non-Unicode into Unicode just to
make things work, hopefully fewer surrogate-encoded bytes and
therefore fewer encoding-bugs-distant-from-source-of-invalid-text, ...
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread R. David Murray
On Wed, 08 Jan 2014 19:22:08 +, "Matt Billenstein"  wrote:
> I started in Python blissfully unaware of unicode - it was a different time 
> for
> sure, but what I knew from C worked pretty much the same in Python - I could
> read some binary data out of a file, twiddle some bits, and write it back out
> again without any of these complexities - life was good and granted I was
> naive, but it made Python approachable for me and I enjoyed it.  I stuck with
> it and learned about unicode and the complexities of encoding data and now I'm
> astonished at how many professional programmers don't know the slightest bit
> about it and how horribly munged some data you can consume on the web might be
> - I agree it's all quite a mess.
> 
> So now I'm getting more serious about Python3 and my fear is that the
> development community (python3) has fractured from the user community 
> (python2)
> in that they've built something that solves their problems (to oversimplify
> lets say a webapp) - sure, a bunch of stuff got fixed along the way and we 
> gave
> the users division they would expect (3/2 == 1.5), but somewhere what I felt

I believe this is a mis-perception.  I think Python3 is *simpler* and
*less complex* than Python2, both at the Python language level and at
the CPython implementation level.  (I'm using a definition of these
terms that roughly works out to "easier to understand".)

That was part of the point.  Python3 is *easier* to use for new projects
than Python2.  I'm not speaking from theory here, I've written and worked
on non-trivial new projects in both versions.[1]

It is true that in Python3 you *must* learn the difference between
bytes and strings.  But in the modern world, you had better learn to do
that anyway, and learn to do it right up front.  If you don't want to,
I suppose you could stay stuck in an earlier age and keep using Python2.

It also is true that it would be nice to have a more convenient API
for, as Antoine put it, interpolating into a binary stream.  But
really, the vast majority of programs have no need to do that.  It
is pretty much only the low level libraries, most of them dealing
with data-interchange (wire protocols), that would use this.

> was more like a hobbyist language has become big and complex and "we need to
> protect our users from doing the wrong thing."

As I just learned recently, Python was always intended to be a "real"
programming language, and not a hobbyist language :)  But it was also
always meant to be easy to learn and use.

Python3's goal is to make it *easier* to do the *right* thing.  The fact
that in some cases it also makes it harder to to the wrong thing is
mostly a consequence of making it easier to do the right thing.

Python's philosophy is still one of "consenting adults", despite a few
voices agitating for preventing users from shooting themselves in the
foot.  But making "the one obvious way to do it" easy, and consequently
making the other ways harder, fits in to its overall philosophy just fine.
As does trying to prevent the wrong thing from happening *by accident*
(read: mojibake).

--David

[1] I also find it easier to maintain my python3 programs than I do my
python2 programs, probably because I've gotten used to the convenience
of the new Python3 features, and miss them when working Python2.

[2] With perfect hindsight I think we'd have focused more right from
the start on single-codebase, rather than on 2to3; but perfect hindsight
doesn't do you any good when it comes to foresight.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Antoine Pitrou

Hi,

With Victor's consent, I overhauled PEP 460 and made the feature set
more restricted and consistent with the bytes/str separation. However, I
also added bytearray into the mix, as bytearray objects should
generally support the same operations as bytes (and they can be useful
*especially* for network programming).

Regards

Antoine.



On Mon, 6 Jan 2014 14:24:50 +0100
Victor Stinner  wrote:
> Hi,
> 
> bytes % args and bytes.format(args) are requested by Mercurial and
> Twisted projects. The issue #3982 was stuck because nobody proposed a
> complete definition of the "new" features. Here is a try as a PEP.
> 
> The PEP is a draft with open questions. First, I'm not sure that both
> bytes%args and bytes.format(args) are needed. The implementation of
> .format() is more complex, so why not only adding bytes%args? Then,
> the following points must be decided to define the complete list of
> supported features (formatters):


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Kristján Valur Jónsson

Believe it or not, sometimes you really don't care about encodings.
Sometimes you just want to parse text files.  Python 3 forces you to think 
about abstract concepts like encodings when all you want is to open that .txt 
file on the drive and extract some phone numbers and merge in some email 
addresses.  What encoding does the file have?  Do I care?  Must I care?
I have lots of little utilities, to help me with day to day stuff like this.  
One fine morning I decided to start usnig Python 3 for the job.  Imagine my 
surprise when it turned out to make my job more complicated, not easier.  
Suddenly I had to start thining about stuff that hadn't mattered at all, and 
still didn't really matter.  All it did was complicate things for no benefit.  

Python forcing you to think about this is like the cashier at the hardware 
store who won't let you buy the hammer you brought to the cash register because 
you don't know what wood its handle is made of.

Sure, Python should make it easier to do the *right* thing.  That's equivalent 
to placing the indicator selector at a convenient place near the steering 
wheel.  What it shouldn't do, is make the flashing of the indicator mandatory 
whenever you turn the wheel.

All of this talk is positive, though.  The fact that these topics have finally 
reached the halls of python-dev are indication that people out there are 
_trying_ to move to 3.3 :)

Cheers,

K


From: Python-Dev [python-dev-bounces+kristjan=ccpgames@python.org] on 
behalf of R. David Murray [rdmur...@bitdance.com]
Sent: Wednesday, January 08, 2014 21:29
To: python-dev@python.org
Subject: Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add   
bytes...)


...
It is true that in Python3 you *must* learn the difference between
bytes and strings.  But in the modern world, you had better learn to do
that anyway, and learn to do it right up front.  If you don't want to,
I suppose you could stay stuck in an earlier age and keep using Python2.

...

Python3's goal is to make it *easier* to do the *right* thing.  The fact
that in some cases it also makes it harder to to the wrong thing is
mostly a consequence of making it easier to do the right thing.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Joao S. O. Bueno
On 8 January 2014 20:04, Kristján Valur Jónsson  wrote:
> Believe it or not, sometimes you really don't care about encodings.
> Sometimes you just want to parse text files.  Python 3 forces you to think 
> about abstract concepts like encodings when all you want is to open that .txt 
> file on the drive and extract some phone numbers and merge in some email 
> addresses.  What encoding does the file have?  Do I care?  Must I care?

Kristján, the answer is obviously "yes you must" :-)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Victor Stinner
Hi,

>  Python 3 forces you to think about abstract concepts like encodings when all 
> you want is to open that .txt file on the drive and extract some phone 
> numbers and merge in some email addresses.

You can open a text file using ascii + surrogateescape, or just open
the file in binary.

Victor
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread R. David Murray
On Wed, 08 Jan 2014 22:04:56 +,  wrote:
> Believe it or not, sometimes you really don't care about encodings.
> Sometimes you just want to parse text files.  Python 3 forces you to
> think about abstract concepts like encodings when all you want is to
> open that .txt file on the drive and extract some phone numbers and
> merge in some email addresses.  What encoding does the file have?  Do
> I care?  Must I care?

Why *do* you care?  Isn't your system configured for utf-8, and all your
.txt files encoded with utf-8 by default?  Or at least configured
with a single consistent encoding?  If that's the case, Python3
doesn't make you think about the encoding.  Knowing the right encoding
is different from needing to know the difference between text and bytes;
you only need to worry about encodings when your system isn't configured
consistently to begin with.

If you do have to care, your little utilities only work by accident in
Python2, and must have produced mojibake when the encoding was wrong,
unless I'm completely confused.  So yeah, sorting that out is harder if
you were just living with the mojibake before...but if so I'm surprised
you haven't wanted to fix that before this.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Ben Finney
Kristján Valur Jónsson  writes:

> Believe it or not, sometimes you really don't care about encodings.
> Sometimes you just want to parse text files.

Files don't contain text, they contain bytes. Bytes only become text
when filtered through the correct encoding.

Python should not guess the encoding if it's unknown. Without the right
encoding, you don't get text, you get partial or complete gibberish.

So, if what you want is to parse text and not get gibberish, you need to
*tell* Python what the encoding is. That's a brute fact of the world of
text in computing.

> Python 3 forces you to think about abstract concepts like encodings
> when all you want is to open that .txt file on the drive and extract
> some phone numbers and merge in some email addresses.  What encoding
> does the file have?  Do I care?  Must I care?

Yes, you must.

> Python forcing you to think about this is like the cashier at the
> hardware store who won't let you buy the hammer you brought to the
> cash register because you don't know what wood its handle is made of.

The cashier is making a mistake: the hammer, regardless of the wood in
the handle, still functions just fine as a hammer. Hence, the question
is unimportant to the purpose.

The same is not true of changing the encoding for text. The encoding
matters, and the programmer needs to care.

-- 
 \ “How wonderful that we have met with a paradox. Now we have |
  `\some hope of making progress.” —Niels Bohr |
_o__)  |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread MRAB

On 2014-01-09 00:07, Ben Finney wrote:

Kristján Valur Jónsson  writes:


Believe it or not, sometimes you really don't care about encodings.
Sometimes you just want to parse text files.


Files don't contain text, they contain bytes. Bytes only become text
when filtered through the correct encoding.

Python should not guess the encoding if it's unknown. Without the right
encoding, you don't get text, you get partial or complete gibberish.

So, if what you want is to parse text and not get gibberish, you need to
*tell* Python what the encoding is. That's a brute fact of the world of
text in computing.


Python 3 forces you to think about abstract concepts like encodings
when all you want is to open that .txt file on the drive and extract
some phone numbers and merge in some email addresses.  What encoding
does the file have?  Do I care?  Must I care?


Yes, you must.


Python forcing you to think about this is like the cashier at the
hardware store who won't let you buy the hammer you brought to the
cash register because you don't know what wood its handle is made of.


The cashier is making a mistake: the hammer, regardless of the wood in
the handle, still functions just fine as a hammer. Hence, the question
is unimportant to the purpose.


On the other hand:

"I need a new battery."

"What kind of battery?"

"I don't care!"


The same is not true of changing the encoding for text. The encoding
matters, and the programmer needs to care.



___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Isaac Morland

On Wed, 8 Jan 2014, Kristján Valur Jónsson wrote:


Believe it or not, sometimes you really don't care about encodings.


Sometimes you just want to parse text files.  Python 3 forces you to 
think about abstract concepts like encodings when all you want is to 
open that .txt file on the drive and extract some phone numbers and 
merge in some email addresses.  What encoding does the file have?  Do I 
care?  Must I care?


Mostly staying out of this, but I need to say something here.

If you don't know what encoding the file has, you don't know what bytes 
correspond to phone numbers.  So yes, you must care, or else you simply 
cannot write your code.


Of course, in practice, it's probably encoded in an ASCII-compatible 
encoding, so '0' encodes as the single byte 0x30.  Whether it's UTF-8, 
ISO-8859-1, or something else that is ASCII-compatible doesn't really 
matter.


So, as a practical matter, you can just use ISO-8859-1, even though in 
principal this is totally wrong.  Then ASCII is one byte per character as 
you expect, and all other bytes will round-trip unchanged.  Just don't do 
any non-trivial processing on non-ASCII characters.


I don't see how it could be made any simpler without going back to making 
it easy for people to pretend the issue doesn't exist at all and bringing 
back the attendant confusion and problems.


I have lots of little utilities, to help me with day to day stuff like 
this.  One fine morning I decided to start usnig Python 3 for the job. 
Imagine my surprise when it turned out to make my job more complicated, 
not easier.  Suddenly I had to start thining about stuff that hadn't 
mattered at all, and still didn't really matter.  All it did was 
complicate things for no benefit.

[]


All of this talk is positive, though.  The fact that these topics have 
finally reached the halls of python-dev are indication that people out 
there are _trying_ to move to 3.3 :)


Agreed.

Isaac Morland   CSCF Web Guru
DC 2619, x36650 WWW Software Specialist___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Mark Lawrence

On 09/01/2014 00:21, MRAB wrote:




 "I need a new battery."

 "What kind of battery?"

 "I don't care!"



A neat summary of the draft requirements specification for Python 2.8.

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Kristján Valur Jónsson
Still playing the devil's advocate:
I didn't used to must.  Why must I must now?  Did the universe just shift when 
I fired up python3?
Things were demonstatably working just fine before without doing so.

K


From: Python-Dev [python-dev-bounces+kristjan=ccpgames@python.org] on 
behalf of Ben Finney [ben+pyt...@benfinney.id.au]
Sent: Thursday, January 09, 2014 00:07
To: python-dev@python.org
Subject: Re: [Python-Dev] Python3 "complexity"

Kristján Valur Jónsson  writes:

> Python 3 forces you to think about abstract concepts like encodings
> when all you want is to open that .txt file on the drive and extract
> some phone numbers and merge in some email addresses.  What encoding
> does the file have?  Do I care?  Must I care?

Yes, you must.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Ben Finney
MRAB  writes:

> On 2014-01-09 00:07, Ben Finney wrote:
> > Kristján Valur Jónsson  writes:
> >> Python 3 forces you to think about abstract concepts like encodings
> >> when all you want is to open that .txt file on the drive and
> >> extract some phone numbers and merge in some email addresses. What
> >> encoding does the file have? Do I care? Must I care?
> >
> > Yes, you must.
> >
> >> Python forcing you to think about this is like the cashier at the
> >> hardware store who won't let you buy the hammer you brought to the
> >> cash register because you don't know what wood its handle is made
> >> of.
> >
> > The cashier is making a mistake: the hammer, regardless of the wood in
> > the handle, still functions just fine as a hammer. Hence, the question
> > is unimportant to the purpose.
>
> On the other hand:
>
> "I need a new battery."
>
> "What kind of battery?"
>
> "I don't care!"

That's a much better analogy. The customer may not care, but the
question is essential and must be answered; if the supplier guesses what
the customer wants, they are doing the customer a disservice.

If the customer insists the supplier just give them a battery which will
work regardless of what type of battery the device requires, the
*customer is wrong*. Such customers need to be educated about the
necessity to care about details they may have no interest in, if they
want to get their device working reliably.

We can all work toward a world where there is just one encoding which
works for all text and no other encodings to confuse the matter. Until
then, everyone needs to deal with the world as it is.

(good sigmonster, have a cookie)

-- 
 \ “Ours is a world where people don't know what they want and are |
  `\   willing to go through hell to get it.” —Donald Robert Perry |
_o__)  Marquis |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Kristján Valur Jónsson
Just to avoid confusion, let me state up front that I am very well aware of 
encodings and all that, having internationalized one largish app in python 2.x. 
 I know the problems that 2.x had with tracking down the source of errors and 
understand the beautiful concept of encodings on the boundary.

However:
For a  lot of data processing and tools, encoding isn't an issue.  Either you 
assume ascii, or you're working with something like latin1.  A single byte 
encoding.  This is because you're working with a text file that _you_ wrote.  
And you're not assigning any semantics to the characters.  If there is actual 
"text" in there it is just english, not Norwegian or Turkish. A byte read at 
code 0xfa doesn't mean anything special.  It's just that, a byte with that 
value.  The file system doesn't have any default encoding.  A file on disk is 
just a file on disk consisting of bytes.  There can never be any wrong 
encoding, no mojibake.

With python 2, you can read that file into a string object.  You can scan for 
your field delimiter, e.g. a comma, split up your string, interpolate some 
binary data, spit it out again.  All without ever thinking about encodings.  

Even though the file is conceptually encoded in something, if you insist on 
attaching a particular semantic meaning to every ordinal value, whatever that 
meaning is is in many cases irrelevant to the program.

I understand that surrogateescape allows you to do this.  But it is an awkward 
extra step and forces an extra layer of needles semantics on to that guy that 
just wants to read a file.  Sure, vegetarians and alergics like to read the 
list of ingredients on everything that they eat.  But others are just omnivores 
and want to be able to eat whatever is on the table, and not worry about what 
it is made of.
And yes, you can read the file in binary mode but then you end up with those 
bytes objects that we have just found that are tedious to work with.

So, what I'm saying is that at least I have a very common use case that has 
just become a) more confusing (having to needlessly derail the train of thought 
about the data processing to be done by thinking about text encodings) and b) 
more complicated.
Not sure if there is anything to be done about it though :)

I think there might be a different analogy:  Having to specify an encoding is 
like having strong typing.  In Python 2.7, we _can_ forego that and just 
duck-type our strings :)

K

From: Python-Dev [python-dev-bounces+kristjan=ccpgames@python.org] on 
behalf of R. David Murray [rdmur...@bitdance.com]
Sent: Wednesday, January 08, 2014 23:40
To: python-dev@python.org
Subject: Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add   
bytes...)


Why *do* you care?  Isn't your system configured for utf-8, and all your
.txt files encoded with utf-8 by default?  Or at least configured
with a single consistent encoding?  If that's the case, Python3
doesn't make you think about the encoding.  Knowing the right encoding
is different from needing to know the difference between text and bytes;
you only need to worry about encodings when your system isn't configured
consistently to begin with.

If you do have to care, your little utilities only work by accident in
Python2, and must have produced mojibake when the encoding was wrong,
unless I'm completely confused.  So yeah, sorting that out is harder if
you were just living with the mojibake before...but if so I'm surprised
you haven't wanted to fix that before this.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Ben Finney
Kristján Valur Jónsson  writes:

> I didn't used to must.  Why must I must now?  Did the universe just
> shift when I fired up python3?

In a sense, yes. The world of software has been shifting for decades, as
a reasult of broader changes in how different segments of humanity have
changed their interactions, and thereby changed their expectations of
what computers can do with their data.

While for some programmers, in past decades, it used to be reasonable to
stick one's head in the sand and ignore all encodings except one
privileged local encoding, that is no longer reasonable today. As a
result, it is incumbent on any programmer working with text to care
about text encodings.

You've likely already seen it, but the point I'm making is better made
in this essay http://www.joelonsoftware.com/articles/Unicode.html>.

-- 
 \己所不欲、勿施于人。 |
  `\(What is undesirable to you, do not do to others.) |
_o__) —孔夫子 Confucius, 551 BCE – 479 BCE |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Mark Lawrence

On 09/01/2014 00:12, Kristján Valur Jónsson wrote:

Just to avoid confusion, let me state up front that I am very well aware of 
encodings and all that, having internationalized one largish app in python 2.x. 
 I know the problems that 2.x had with tracking down the source of errors and 
understand the beautiful concept of encodings on the boundary.

However:
For a  lot of data processing and tools, encoding isn't an issue.  Either you assume 
ascii, or you're working with something like latin1.  A single byte encoding.  This is 
because you're working with a text file that _you_ wrote.  And you're not assigning any 
semantics to the characters.  If there is actual "text" in there it is just 
english, not Norwegian or Turkish. A byte read at code 0xfa doesn't mean anything 
special.  It's just that, a byte with that value.  The file system doesn't have any 
default encoding.  A file on disk is just a file on disk consisting of bytes.  There can 
never be any wrong encoding, no mojibake.

With python 2, you can read that file into a string object.  You can scan for 
your field delimiter, e.g. a comma, split up your string, interpolate some 
binary data, spit it out again.  All without ever thinking about encodings.

Even though the file is conceptually encoded in something, if you insist on 
attaching a particular semantic meaning to every ordinal value, whatever that 
meaning is is in many cases irrelevant to the program.

I understand that surrogateescape allows you to do this.  But it is an awkward 
extra step and forces an extra layer of needles semantics on to that guy that 
just wants to read a file.  Sure, vegetarians and alergics like to read the 
list of ingredients on everything that they eat.  But others are just omnivores 
and want to be able to eat whatever is on the table, and not worry about what 
it is made of.
And yes, you can read the file in binary mode but then you end up with those 
bytes objects that we have just found that are tedious to work with.



All I can say is that I've been using python 3 for years and wouldn't 
know what a surrogateescape was if you were to hit me around the head 
with it.  I open my files, I process them, and Python kindly closes them 
for me via a context manager.  So if you're not bothered about encoding, 
where has the "awkward extra step and forces an extra layer of needles 
semantics" bit come from?


--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread R. David Murray
On Thu, 09 Jan 2014 00:12:57 +,  wrote:
> I think there might be a different analogy:  Having to specify an
> encoding is like having strong typing.  In Python 2.7, we _can_ forego
> that and just duck-type our strings :)

Python is a strongly typed language.

Saying that python2 let you duck type bytestrings (ie: postpone the
decision as to what encoding they were in until the last minute) is an
interesting perspective...but as we know it led to many many program bugs.
Which were the result, essentially, of a failure to strongly type the
string and bytes types the way other python types are strongly typed.

However, I do now understand your use case better, even though I wouldn't
myself write programs like that.  Or, rather, I make sure all my files
are in the same encoding (utf-8).  I suppose that this is because I,
as an English-speaking USAian, came late to the need for non-ascii
characters, after utf-8 was already well established.  The rest of
the world didn't have that luxury.

--David
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Terry Reedy

On 1/8/2014 5:04 PM, Kristján Valur Jónsson wrote:


Believe it or not, sometimes you really don't care about encodings.
Sometimes you just want to parse text files.  Python 3 forces you to
think about abstract concepts like encodings when all you want is to
open that .txt file on the drive and extract some phone numbers and


I suspect that you would do that by looking for the bytes that can be 
interpreted as ascii digits. That will work fine as long as the .txt 
file has an ascii-compatible encoding. As soon as it does not, the 
little utility fails. It also fails with non-European digits, such as 
are used in Arabic and Indic writings.


Even if you are in an environment where all .txt files are encoded in 
utf-8, it will be easier to look for non-ascii digits in decoded unicode 
strings.



merge in some email addresses.  What encoding does the file have?  Do
I care?  Must I care?


If the email addresses have non-ascii characters, then you must.

...

All this talk is positive, though.  The fact that these topics
have finally reached the halls of python-dev are indication that
people out there are _trying_ to move to 3.3 :)


That is an interesting observation, worth keeping in mind among the turmoil.

--
Terry Jan Reedy


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Chris Angelico
On Thu, Jan 9, 2014 at 11:21 AM, MRAB  wrote:
> On the other hand:
>
> "I need a new battery."
>
> "What kind of battery?"
>
> "I don't care!"

Or, bringing it back to Python: How do you write a set out to a file?

foo = {1, 2, 4, 8, 16, 32}
open("foo.txt","w").write(foo)  # Uh... nope!

I don't want to have to worry about how it's formatted! I just want to
write that set out and have someone read it in later!

A text string is just as abstract as any other complex type. For some
reason, we've grown up thinking that "ABCD" == \x61\x62\x63\x64 ==
"ABCD", even though it's just as logical for those bytes to represent
12.1414 or 1094861636 or 1145258561. There's no difference between
encoding one thing to bytes and encoding another thing to bytes, and
it's critical to get those encodes/decodes right.

ChrisA
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread INADA Naoki
> And I think everyone was well intentioned - and python3 covers most of the
> bases, but working with binary data is not only a "wire-protocol
> programmer's"
> problem.  Needing a library to wrap bytesthing.format('ascii',
> 'surrogateescape')
> or some such thing makes python3 less approachable for those who haven't
> learned that yet - which was almost all of us at some point when we started
> programming.
>
>
Totally agree with you.


-- 
INADA Naoki  
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread MRAB

On 2014-01-06 13:24, Victor Stinner wrote:

Hi,

bytes % args and bytes.format(args) are requested by Mercurial and
Twisted projects. The issue #3982 was stuck because nobody proposed a
complete definition of the "new" features. Here is a try as a PEP.

The PEP is a draft with open questions. First, I'm not sure that both
bytes%args and bytes.format(args) are needed. The implementation of
.format() is more complex, so why not only adding bytes%args? Then,
the following points must be decided to define the complete list of
supported features (formatters):

* Format integer to hexadecimal? ``%x`` and ``%X``
* Format integer to octal? ``%o``
* Format integer to binary? ``{!b}``
* Alignment?
* Truncating? Truncate or raise an error?
* format keywords? ``b'{arg}'.format(arg=5)``
* ``str % dict`` ? ``b'%(arg)s' % {'arg': 5)``
* Floating point number?
* ``%i``, ``%u`` and ``%d`` formats for integer numbers?
* Signed number? ``%+i`` and ``%-i``


I'm thinking that the "i" format could be used for signed integers and
the "u" for unsigned integers. The width would be the number of bytes.
You would also need to have a way of specifying the endianness.

For example:

>>> b'{:<2i}'.format(256)
b'\x01\x00'
>>> b'{:>2i}'.format(256)
b'\x00\x01'

Perhaps the width should default to 1 in the cases of "i" and "u":

>>> b'{:i}'.format(-1)
b'\xFF'
>>> b'{:u}'.format(255)
b'\xFF'
>>> b'{:i}'.format(255)
ValueError: ...

Interestingly, I've just been checking what exception is raised for
some format types, and I got this:

>>> '{:c}'.format(-1)
Traceback (most recent call last):
  File "", line 1, in 
OverflowError: %c arg not in range(0x11)

Should the exception be OverflowError (probably yes), and should the
message say "%c"?

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Dan Stromberg
On Wed, Jan 8, 2014 at 2:04 PM, Kristján Valur Jónsson
 wrote:
>
> Believe it or not, sometimes you really don't care about encodings.
> Sometimes you just want to parse text files.  Python 3 forces you to think 
> about abstract concepts like encodings when all you want is to open that .txt 
> file on the drive and extract some phone numbers and merge in some email 
> addresses.  What encoding does the file have?  Do I care?  Must I care?

If computers had taken off in China before the USA, you'd probably be
wondering why some Chinese refuse to care about encodings, when the
rest of the world clearly needs them.

Yes, you really should care about encodings.  No, it's not quite as
simple as it once was for English speakers as it once was.  It was
formerly simple (for us) because we were effectively pressing everyone
else to read and write English.

If you want to keep things close to what you're used to, use latin-1
as your encoding.  It's still a choice, and not a great one for
user-facing text, but if you want to be simplistic about it, that's a
way to do it.

That said, there will be some text that isn't user-facing, EG in a
network protocol.  This is probably what all the fuss is about.  But
like I said, this can be done with latin-1.
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] RFC: PEP 460: Add bytes % args and bytes.format(args) to Python 3.5

2014-01-08 Thread Stephen J. Turnbull
Antoine Pitrou writes:

 > However, interpolating a bytes object isn't out of place, and it is
 > what a minimal "formatting" primitive could do.

Something like this?

# VERY incomplete pseudo-code
class str:

# new method
# fmtstring has syntax of .format method's spec, maybe adding a 'B'
# for "insert Blob of bytes" spec
def format_for_wire(fmtstring, args, encoding='utf-8', errors='strict'):

result = b''

# gotta go to a meeting, exercise for reader :-(
parts = zip_specs_and_args(fmtstring, args)

for spec, arg in parts:
if spec == 'B' and isinstance(arg, bytes):
result += arg
else:
partial = format(spec, arg)
result += partial.encode(encoding=encoding, errors=errors)

return result

Maybe format_to_bytes is a more accurate name.

I have no idea how to do this for %-formatting though. :-(

And I have the sneaking suspicion that it *can't* be this easy. :-(

Can it? :-)
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Greg Ewing

Kristján Valur Jónsson wrote:

all you want is to open that .txt
file on the drive and extract some phone numbers and merge in some email
addresses. What encoding does the file have? Do I care? Must I care?


To some extent, yes. If the encoding happens to be an
ascii-compatible one, such as latin-1 or utf-8, you can
probably extract the phone numbers without caring what
the rest of the bytes mean. But not if it's utf-16,
for example.

If you know that all the files on your system have an
ascii-compatible encoding, you can use the surrogateescape
error handler to avoid having to know about the exact
encoding. Granted, that makes it slightly more complicated
than it was in Python 2, but not much.

--
Greg
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Stephen J. Turnbull
Kristján Valur Jónsson writes:

 > Still playing the devil's advocate:
 > I didn't used to must.  Why must I must now?  Did the universe just
 > shift when I fired up python3?

No.  Go look at the Economist's tag cloud and notice how big "China"
and "India" are most days.  The universe has been shifting for 3
decades now, you just noticed it when you fired up Python 3.

 > Things were demonstatably working just fine before without doing
 > so.

Who elected you General Secretary of the UN?  Things were, and are
still, demonstrably fucked up for the world at large.  Python 3 is a
big contribution to un-fucking the rest of us[1], thank you very much
to Guido and Company!

It's not obvious how to do things right for those of us who have to
deal with 8-10 different encodings daily *on our desktops*, and still
make things easy for those of you who rarely see ISO 8859/N for N !=
1, let alone monstrosities like GB18030 or Shift JIS.  That latter is
a shame, but we're working on it (and have been all along -- it's not
easy).

Footnotes: 
[1]  Or will be when my employer adopts it. 


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Stephen J. Turnbull
Ben Finney writes:

 > That's a much better analogy. The customer may not care, but the
 > question is essential and must be answered; if the supplier guesses what
 > the customer wants, they are doing the customer a disservice.

It is a much better analogy for me on my desktop, and for programmers
working for global enterprises, too.  It is not for Kristján, nor for
many other American, European, and yes, even Australian programmers.

You're making the same kind of mistake he is (although I personally
benefit from your mistake, and have suffered for decades from his :-).

Diff'rent folks, diff'rent strokes.  It would be nice if we could
serve both use cases *by default*.  We haven't found the way yet,
that's all.


___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Lennart Regebro
On Thu, Jan 9, 2014 at 1:07 AM, Ben Finney  wrote:
> Kristján Valur Jónsson  writes:
>
>> Believe it or not, sometimes you really don't care about encodings.
>> Sometimes you just want to parse text files.
>
> Files don't contain text, they contain bytes. Bytes only become text
> when filtered through the correct encoding.

To be honest, you can define text as "A stream of bytes that are split
up in lines separated by a linefeed", and do some basic text
processing like that. Just very *basic*, but still. Replacing
characters. Extracting certain lines etc.

This is harder in Python 3, as bytes does not have all the
functionality strings has, like formatting. This can probably be fixed
in Python 3.5, if the relevant PEP gets finished.

For the battery analogy, that's like saying:

"I want a battery."

"What kind?"

"It doesn't matter, as long as it's over 5V."

//Lennart
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Mark Lawrence

On 09/01/2014 06:50, Lennart Regebro wrote:

On Thu, Jan 9, 2014 at 1:07 AM, Ben Finney  wrote:

Kristján Valur Jónsson  writes:


Believe it or not, sometimes you really don't care about encodings.
Sometimes you just want to parse text files.


Files don't contain text, they contain bytes. Bytes only become text
when filtered through the correct encoding.


To be honest, you can define text as "A stream of bytes that are split
up in lines separated by a linefeed", and do some basic text
processing like that. Just very *basic*, but still. Replacing
characters. Extracting certain lines etc.

This is harder in Python 3, as bytes does not have all the
functionality strings has, like formatting. This can probably be fixed
in Python 3.5, if the relevant PEP gets finished.

For the battery analogy, that's like saying:

"I want a battery."

"What kind?"

"It doesn't matter, as long as it's over 5V."

//Lennart



"That Python 3 battery you sold me blew up when I tried using it".

"We've been telling you for years that could happen".

"I didn't think you actually meant it".

--
My fellow Pythonistas, ask not what our language can do for you, ask 
what you can do for our language.


Mark Lawrence

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Nick Coghlan
On 9 January 2014 10:07, Ben Finney  wrote:
> Kristján Valur Jónsson  writes:
>
>> Believe it or not, sometimes you really don't care about encodings.
>> Sometimes you just want to parse text files.
>
> Files don't contain text, they contain bytes. Bytes only become text
> when filtered through the correct encoding.
>
> Python should not guess the encoding if it's unknown. Without the right
> encoding, you don't get text, you get partial or complete gibberish.
>
> So, if what you want is to parse text and not get gibberish, you need to
> *tell* Python what the encoding is. That's a brute fact of the world of
> text in computing.

Set the mode to "rb", process it as binary. Done.

See 
http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html
for details.

Cheers,
Nick.

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Nick Coghlan
On 9 January 2014 10:22, Kristján Valur Jónsson  wrote:
> Still playing the devil's advocate:
> I didn't used to must.  Why must I must now?  Did the universe just shift 
> when I fired up python3?
> Things were demonstatably working just fine before without doing so.

They were working fine for experienced POSIX users that had fully
internalised the idiosycrasies of that platform and didn't need to
care about any other environment (like Windows or the JVM).

Cheers,
Nick.

>
> K
>
> 
> From: Python-Dev [python-dev-bounces+kristjan=ccpgames@python.org] on 
> behalf of Ben Finney [ben+pyt...@benfinney.id.au]
> Sent: Thursday, January 09, 2014 00:07
> To: python-dev@python.org
> Subject: Re: [Python-Dev] Python3 "complexity"
>
> Kristján Valur Jónsson  writes:
>
>> Python 3 forces you to think about abstract concepts like encodings
>> when all you want is to open that .txt file on the drive and extract
>> some phone numbers and merge in some email addresses.  What encoding
>> does the file have?  Do I care?  Must I care?
>
> Yes, you must.
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe: 
> https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com



-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity"

2014-01-08 Thread Ben Finney
Nick Coghlan  writes:

> On 9 January 2014 10:07, Ben Finney  wrote:
> > Kristján Valur Jónsson  writes:
> >
> >> Believe it or not, sometimes you really don't care about encodings.
> >> Sometimes you just want to parse text files.
> >
> > Files don't contain text, they contain bytes. Bytes only become text
> > when filtered through the correct encoding.
[…]

> Set the mode to "rb", process it as binary. Done.

Which entails abandoning the stated goal of “just want to parse text
files” :-)

-- 
 \ “All television is educational television. The question is: |
  `\   what is it teaching?” —Nicholas Johnson |
_o__)  |
Ben Finney

___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] Python3 "complexity" (was RFC: PEP 460: Add bytes...)

2014-01-08 Thread Nick Coghlan
On 9 January 2014 15:22, Greg Ewing  wrote:
> Kristján Valur Jónsson wrote:
>>
>> all you want is to open that .txt
>> file on the drive and extract some phone numbers and merge in some email
>> addresses. What encoding does the file have? Do I care? Must I care?
>
>
> To some extent, yes. If the encoding happens to be an
> ascii-compatible one, such as latin-1 or utf-8, you can
> probably extract the phone numbers without caring what
> the rest of the bytes mean. But not if it's utf-16,
> for example.
>
> If you know that all the files on your system have an
> ascii-compatible encoding, you can use the surrogateescape
> error handler to avoid having to know about the exact
> encoding. Granted, that makes it slightly more complicated
> than it was in Python 2, but not much.

There's also the fact that POSIX folks are used to "r" and "rb" being
the same thing.

Python 3 chose to make the default behaviour be to open files as text
files in the default system encoding. This created two significant
user visible changes:

- POSIX users could no longer ignore the difference between binary
mode and text mode when opening files (Windows users have always had
to care due to the line ending problem)

- POSIX users could no longer ignore locale configuration errors

We're aiming to resolve the most common locale configuration issue by
configuring surrogateescape on the standard streams when the OS claims
that default encoding is ASCII, but ultimately, the long term fix is
for POSIX platforms to standardise on and consistently report UTF-8 as
the system encoding (as well as configuring ssh environments properly
by default)

Python 2 is *very* much a POSIX first language, with Windows, the JVM
and other non-POSIX environments as an afterthought. Python 3 is
intentionally offers more consistent cross platform behaviour, which
means it no longer aligns as neatly with the sensibilities of
experienced users of POSIX systems.

Cheers,
Nick.

>
> --
> Greg
>
> ___
> Python-Dev mailing list
> Python-Dev@python.org
> https://mail.python.org/mailman/listinfo/python-dev
> Unsubscribe:
> https://mail.python.org/mailman/options/python-dev/ncoghlan%40gmail.com



-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Python-Dev mailing list
Python-Dev@python.org
https://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
https://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com