[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Petr Viktorin

On 01. 11. 21 13:17, Petr Viktorin wrote:

Hello,
Today, an attack called "Trojan source" was revealed, where a malicious 
contributor can use Unicode features (left-to-right text and homoglyphs) 
to code that, when shown in an editor, will look different from how a 
computer language parser will process it.

See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.

This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the report 
and decided that it should be handled in code editors, diff viewers, 
repository frontends and similar software, rather than in the language.


I agree: in my opinion, the attack is similar to abusing any other 
"gotcha" where Python doesn't parse text as a non-expert human would. 
For example: `if a or b == 'yes'`, mutable default arguments, or a 
misleading typo.


Nevertheless, I did do a bit of research about similar gotchas in 
Python, and I'd like to publish a summary as an informational PEP, 
pasted below.



Thanks for the comments, everyone! I've updated the document and sent it 
to https://github.com/python/peps/pull/2129
A rendered version is at 
https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst




Toshio Kuratomi wrote:

  `Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.


Thanks! That's a nice summary; I condensed it a bit more and used it.
(I'm not joining the conversation on glyphs, characters, codepoints and 
encodings -- that's much too technical for this document. Using the 
specific technical terms unfortunately doesn't help understanding, so I 
use the vague ones like "character" and "letter".)



Jim J. Jewett wrote:

"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete 
Python statement."


Normally, an identifier must begin with a letter, and numbers can only be used in the 
second and subsequent positions.  (XID_CONTINUE instead of XID_START)  The fact that some 
characters with numeric values are considered letters (in this case, category Lo, Other 
Letters) is a different problem than just looking visually confusable with "+", 
and it should probably be listed on its own.


I'm not a native speaker, but as I understand it, "十" is closer to a 
single-letter word than a single-digit number. It translates better as 
"ten" than "10". (And it appears in "十四", "fourteen", just like "four" 
appears in "fourteen".)



Patrick Schultz wrote:

- The Unicode consortium has a list of confusables, in case useful


Yup, and it's linked from the documents that describe how to use it. I 
link to those rather than just the list.

But thank you!


Terry Reedy wrote:

Bidirectional Text
--

Some scripts, such as Hebrew or Arabic, are written right-to-left.


[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local (contiguous 
sequences are properly reversed), and extended (see below).  The handling 
depends on the display software and may depend on the quoting.  Tk and hence 
tkinter (and IDLE) text widgets do local handing.  Windows Notepad++ does local 
handling of unquoted code but extending handling of quoted text.  Windows 
Notepad currently does extended handling even without quotes.


I'd like to leave these details out of the document. The examples should 
render convincingly in browsers. The text should now describe the 
behavior even if you open it in an editor that does things differently, 
and acknowledge that such editors exist. (The behavior of specific 
editors/toolkits might well change in the future.)



For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::


I don't see the connection between the text above and the example that follows.


# For writing Japanese, you don't need an editor that supports
# UTF-8 source encoding: unicode_escape sequences work just as well.

[etc]


Let me know if it's clear in the newest version, with this note:


Here, ``encoding: unicode_escape`` in the initial comment is an encoding
declaration. The ``unicode_escape`` encoding instructs Python to treat
``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
a comma (punctuator), etc.



Steven D'Aprano wrote:

Before the age of computers, most mechanical typewriters lacked the keys 
for the digits ``0`` and ``1``


I'm not sure that "most" is justifed here. One of the most popular 
typewriters in history, the Underwood #5 (from 1900 to 1920), lacked 
the 1 key but had a 0 distinct from O.


https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg

[Python-Dev] Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Petr Viktorin



On 01. 11. 21 18:32, Serhiy Storchaka wrote:

This is excellent!

01.11.21 14:17, Petr Viktorin пише:

CPython treats the control character NUL (``\0``) as end of input,
but many editors simply skip it, possibly showing code that Python
will not
run as a regular part of a file.


It is an implementation detail and we will get rid of it. It only
happens when you read the Python script from a file. If you import it as
a module or run with runpy, the NUL character is an error.


That brings us to possible changes in Python in this  area, which is an 
interesting topic.


As for \0, can we ban all ASCII & C1 control characters except 
whitespace? I see no place for them in source code.



For homoglyphs/confusables, should there be a SyntaxWarning when an 
identifier looks like ASCII but isn't?


For right-to-left text: does anyone actually name identifiers in 
Hebrew/Arabic? AFAIK, we should allow a few non-printing 
"joiner"/"non-joiner" characters to make it possible to use all Arabic 
words. But it would be great to consult with users/teachers of the 
languages.
Should Python run the bidi algorithm when parsing and disallow reordered 
tokens? Maybe optionally?

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/TGB377QWGIDPUWMAJSZLT22ERGPNZ5FZ/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Serhiy Storchaka
02.11.21 16:16, Petr Viktorin пише:
> As for \0, can we ban all ASCII & C1 control characters except
> whitespace? I see no place for them in source code.

All control characters except CR, LF, TAB and FF are banned outside
comments and string literals. I think it is worth to ban them in
comments and string literals too. In string literals you can use
backslash-escape sequences, and comments should be human readable, there
are no reason to include control characters in them. There is a
precedence of emitting warnings for some superficial escapes in strings.


> For homoglyphs/confusables, should there be a SyntaxWarning when an
> identifier looks like ASCII but isn't?

It would virtually ban Cyrillic. There is a lot of Cyrillic letters
which look like Latin letters, and there are complete words written in
Cyrillic which by accident look like other words written in Latin.

It is a work for linters, which can have many options for configuring
acceptable scripts, use spelling dictionaries and dictionaries of
homoglyphs, etc.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Chris Angelico
On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin  wrote:
> Let me know if it's clear in the newest version, with this note:
>
> > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> > a comma (punctuator), etc.
>

Huh. Is that level of generality actually still needed? Can Python
deprecate all but a small handful of encodings?

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/WA7P7YLY7N6CGF7N5G6DVG3PIA24BPS7/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Jim J. Jewett
Serhiy Storchaka wrote:
> 02.11.21 16:16, Petr Viktorin пише:
> > As for \0, can we ban all ASCII & C1 control characters except
> > whitespace? I see no place for them in source code.

> All control characters except CR, LF, TAB and FF are banned outside
> comments and string literals. I think it is worth to ban them in
> comments and string literals too. In string literals you can use
> backslash-escape sequences, and comments should be human readable, there
> are no reason to include control characters in them. 

If escape sequences were also allowed in comments (or at least in strings 
within comments), this would make sense.  I don't like banning them otherwise, 
since odd characters are often a good reason to need a comment, but it is 
definitely a "mention, not use" situation.

> > For homoglyphs/confusables, should there be a SyntaxWarning when an
> > identifier looks like ASCII but isn't?
> > It would virtually ban Cyrillic. There is a lot of Cyrillic letters
> which look like Latin letters, and there are complete words written in
> Cyrillic which by accident look like other words written in Latin.

At the time, we considered it, and we also considered a narrower restriction on 
using multiple scripts in the same identifier, or at least the same identifier 
portion (so it was OK if separated by _).

Simplicity won, in part because of existing practice in EMACS scripting, 
particularly with some Asian languages.

> It is a work for linters, which can have many options for configuring
> acceptable scripts, use spelling dictionaries and dictionaries of
> homoglyphs, etc.

It might be time for the documentation to mention a specific 
linter/configuration that does this.  It also might be reasonable to do by 
default in IDLE or even the interactive shell.

-jJ
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/BCZI6HCZJ34XABFFZETJMWFQWOUG4UB4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Marc-Andre Lemburg
On 01.11.2021 13:17, Petr Viktorin wrote:
>> PEP: 
>> Title: Unicode Security Considerations for Python
>> Author: Petr Viktorin 
>> Status: Active
>> Type: Informational
>> Content-Type: text/x-rst
>> Created: 01-Nov-2021
>> Post-History:

Thanks for writing this up. I'm not sure whether a PEP is the right place
for such documentation, though. Wouldn't it be more visible in the standard
Python documentation ?

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Nov 02 2021)
>>> Python Projects, Coaching and Support ...https://www.egenix.com/
>>> Python Product Development ...https://consulting.egenix.com/


::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   https://www.egenix.com/company/contact/
 https://www.malemburg.com/

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread David Mertz, Ph.D.
This is an amazing document, Petr. Really great work!

I think I agree with Marc-André that putting it in the actual Python
documentation would give it more visibility than in a PEP.

On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg  wrote:

> On 01.11.2021 13:17, Petr Viktorin wrote:
> >> PEP: 
> >> Title: Unicode Security Considerations for Python
> >> Author: Petr Viktorin 
> >> Status: Active
> >> Type: Informational
> >> Content-Type: text/x-rst
> >> Created: 01-Nov-2021
> >> Post-History:
>
> Thanks for writing this up. I'm not sure whether a PEP is the right place
> for such documentation, though. Wouldn't it be more visible in the standard
> Python documentation ?
>
> --
> Marc-Andre Lemburg
> eGenix.com
>
> Professional Python Services directly from the Experts (#1, Nov 02 2021)
> >>> Python Projects, Coaching and Support ...https://www.egenix.com/
> >>> Python Product Development ...https://consulting.egenix.com/
> 
>
> ::: We implement business ideas - efficiently in both time and costs :::
>
>eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>Registered at Amtsgericht Duesseldorf: HRB 46611
>https://www.egenix.com/company/contact/
>  https://www.malemburg.com/
>
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Chris Angelico
On Wed, Nov 3, 2021 at 5:07 AM David Mertz, Ph.D.  wrote:
>
> This is an amazing document, Petr. Really great work!
>
> I think I agree with Marc-André that putting it in the actual Python 
> documentation would give it more visibility than in a PEP.
>

There are quite a few other PEPs that have similar sorts of advice,
like PEP 257 on docstrings, and several of the type hinting PEPs. IMO
it's fine.

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NICZBYG332C4WBFZVCHCTDTEP3NGEF7B/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Terry Reedy

On 11/2/2021 1:02 PM, Marc-Andre Lemburg wrote:

On 01.11.2021 13:17, Petr Viktorin wrote:

PEP: 
Title: Unicode Security Considerations for Python
Author: Petr Viktorin 
Status: Active
Type: Informational
Content-Type: text/x-rst
Created: 01-Nov-2021
Post-History:


Thanks for writing this up. I'm not sure whether a PEP is the right place
for such documentation, though. Wouldn't it be more visible in the standard
Python documentation ?


There is already "Unicode HOW TO"  We could add "Unicode problems and 
pitfalls".



--
Terry Jan Reedy

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5KDNR5RIITKMIKGSZK2WCPEQDA6AJGQE/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Steven D'Aprano
On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
> On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin  wrote:
> > Let me know if it's clear in the newest version, with this note:
> >
> > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as
> > > a comma (punctuator), etc.
> >
> 
> Huh. Is that level of generality actually still needed? Can Python
> deprecate all but a small handful of encodings?

To be clear, are you proposing to deprecate the encodings *completely* 
or just as the source code encoding?

Personally, I think that using obscure encodings as the source encoding 
is one of those "linters and code reviews should check it" issues. 

Besides, now that I've learned about this unicode_escape encoding, I 
think that's going to be *awesome* for winning obfuscated Python 
competitions! *wink*


-- 
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/27IDDKAADVBAZSRZ2I5EO5SLXZIY6ANW/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Chris Angelico
On Wed, Nov 3, 2021 at 11:09 AM Steven D'Aprano  wrote:
>
> On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote:
> > On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin  wrote:
> > > Let me know if it's clear in the newest version, with this note:
> > >
> > > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding
> > > > declaration. The ``unicode_escape`` encoding instructs Python to treat
> > > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` 
> > > > as
> > > > a comma (punctuator), etc.
> > >
> >
> > Huh. Is that level of generality actually still needed? Can Python
> > deprecate all but a small handful of encodings?
>
> To be clear, are you proposing to deprecate the encodings *completely*
> or just as the source code encoding?

Only source code encodings. Obviously we still need to be able to cope
with all manner of *data*, but Python source code shouldn't need to be
in bizarre, weird encodings.

(Honestly, I'd love to just require that Python source code be UTF-8,
but that would probably cause problems, so mandating that it be one of
a small set of encodings would be a safer option.)

> Personally, I think that using obscure encodings as the source encoding
> is one of those "linters and code reviews should check it" issues.
>
> Besides, now that I've learned about this unicode_escape encoding, I
> think that's going to be *awesome* for winning obfuscated Python
> competitions! *wink*

TBH, I'm not entirely sure how valid it is to talk about *security*
considerations when we're dealing with Python source code and variable
confusions, but that's a term that is well understood.

But to the extent that it is a security concern, it's not one that
linters can really cope with. I'm not sure how a linter would stop
someone from publishing code on PyPI that causes confusion by its
character encoding, for instance.

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/HJ452KNBAFXI6WBQ6OUMHHZRRETPC7QL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Kyle Stanley
I'd suggest both: briefer, easier to read write up for average user in
docs, more details/semantics in informational PEP. Thanks for working on
this, Petr!

On Tue, Nov 2, 2021 at 2:07 PM David Mertz, Ph.D. 
wrote:

> This is an amazing document, Petr. Really great work!
>
> I think I agree with Marc-André that putting it in the actual Python
> documentation would give it more visibility than in a PEP.
>
> On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg  wrote:
>
>> On 01.11.2021 13:17, Petr Viktorin wrote:
>> >> PEP: 
>> >> Title: Unicode Security Considerations for Python
>> >> Author: Petr Viktorin 
>> >> Status: Active
>> >> Type: Informational
>> >> Content-Type: text/x-rst
>> >> Created: 01-Nov-2021
>> >> Post-History:
>>
>> Thanks for writing this up. I'm not sure whether a PEP is the right place
>> for such documentation, though. Wouldn't it be more visible in the
>> standard
>> Python documentation ?
>>
>> --
>> Marc-Andre Lemburg
>> eGenix.com
>>
>> Professional Python Services directly from the Experts (#1, Nov 02 2021)
>> >>> Python Projects, Coaching and Support ...https://www.egenix.com/
>> >>> Python Product Development ...https://consulting.egenix.com/
>> 
>>
>> ::: We implement business ideas - efficiently in both time and costs :::
>>
>>eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
>> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
>>Registered at Amtsgericht Duesseldorf: HRB 46611
>>https://www.egenix.com/company/contact/
>>  https://www.malemburg.com/
>>
>> ___
>> Python-Dev mailing list -- python-dev@python.org
>> To unsubscribe send an email to python-dev-le...@python.org
>> https://mail.python.org/mailman3/lists/python-dev.python.org/
>> Message archived at
>> https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/
>> Code of Conduct: http://python.org/psf/codeofconduct/
>>
> ___
> Python-Dev mailing list -- python-dev@python.org
> To unsubscribe send an email to python-dev-le...@python.org
> https://mail.python.org/mailman3/lists/python-dev.python.org/
> Message archived at
> https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/
> Code of Conduct: http://python.org/psf/codeofconduct/
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/6OET4CKEZIA34PAXIJR7BUDKT2DPX2DG/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] PEP 663:

2021-11-02 Thread Ethan Furman

See the latest changes, which are mostly a (hopefully) improved abstract, 
better tables, and some slight rewordings.

Feedback welcome!

---

PEP: 663
Title: Standardizing Enum str(), repr(), and format() behaviors
Version: $Revision$
Last-Modified: $Date$
Author: Ethan Furman 
Discussions-To: python-dev@python.org
Status: Draft
Type: Informational
Content-Type: text/x-rst
Created: 23-Feb-2013
Python-Version: 3.11
Post-History: 20-Jul-2021, 02-Nov-2021
Resolution:


Abstract


Update the ``repr()``, ``str()``, and ``format()`` of the various Enum types
to better match their intended purpose.  For example, ``IntEnum`` will have
its ``str()`` change to match its ``format()``, while a user-mixed int-enum
will have its ``format()`` match its ``str()``.  In all cases, an enum's
``str()`` and ``format()`` will be the same (unless the user overrides
``format()``).

Add a global enum decorator which changes the ``str()`` and ``repr()``  (and
``format()``) of the decorated enum to be a valid global reference: i.e.
``re.IGNORECASE`` instead of .


Motivation
==

Having the ``str()`` of ``IntEnum`` and ``IntFlag`` not be the value causes
bugs and extra work when replacing existing constants.

Having the ``str()`` and ``format()`` of an enum member be different can be
confusing.

The addition of ``StrEnum`` with its requirement to have its ``str()`` be its
``value`` is inconsistent with other provided Enum's ``str``.

The iteration of ``Flag`` members, which directly affects their ``repr()``, is
inelegant at best, and buggy at worst.


Rationale
=

Enums are becoming more common in the standard library; being able to recognize
enum members by their ``repr()``, and having that ``repr()`` be easy to parse, 
is
useful and can save time and effort in understanding and debugging code.

However, the enums with mixed-in data types (``IntEnum``, ``IntFlag``, and the 
new
``StrEnum``) need to be more backwards compatible with the constants they are
replacing -- specifically, ``str(replacement_enum_member) == 
str(original_constant)``
should be true (and the same for ``format()``).

IntEnum, IntFlag, and StrEnum should be as close to a drop-in replacement of
existing integer and string constants as is possible.  Towards that goal, the
``str()`` output of each should be its inherent value; e.g. if ``Color`` is an
``IntEnum``::

>>> Color.RED

>>> str(Color.RED)
'1'
>>> format(Color.RED)
'1'

Note that ``format()`` already produces the correct output, only ``str()`` needs
updating.

As much as possible, the ``str()``, ``repr()``, and ``format()`` of enum members
should be standardized across the standard library.  However, up to Python 3.10
several enums in the standard library have a custom ``str()`` and/or ``repr()``.

The ``repr()`` of Flag currently includes aliases, which it should not; fixing 
that
will, of course, already change its ``repr()`` in certain cases.


Specification
=

There a three broad categories of enum usage:

- simple: ``Enum`` or ``Flag``
  a new enum class is created with no data type mixins

- drop-in replacement: ``IntEnum``, ``IntFlag``, ``StrEnum``
  a new enum class is created which also subclasses ``int`` or ``str`` and uses
  ``int.__str__`` or ``str.__str__``

- user-mixed enums and flags
  the user creates their own integer-, float-, str-, whatever-enums instead of
  using enum.IntEnum, etc.

There are also two styles:

- normal: the enumeration members remain in their classes and are accessed as
  ``classname.membername``, and the class name shows in their ``repr()`` and
  ``str()`` (where appropriate)

- global: the enumeration members are copied into their module's global
  namespace, and their module name shows in their ``repr()`` and ``str()``
  (where appropriate)

Some sample enums::

# module: tools.py

class Hue(Enum):  # or IntEnum
LIGHT = -1
NORMAL = 0
DARK = +1

class Color(Flag):  # or IntFlag
RED = 1
GREEN = 2
BLUE = 4

class Grey(int, Enum):  # or (int, Flag)
   BLACK = 0
   WHITE = 1

Using the above enumerations, the following two tables show the old and new
output (blank cells indicate no change):

+++-++---+
| style  | category   | enum repr() | enum str() | enum 
format() |
++-+--+-++---+
| normal | simple  | 3.10 | ||  
 |
|| 
+--+-++---+
|| | new  | ||  
 |
|
+-+--+-++---+
|| user mixed  | 3.10 | || 1
   

[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Jim J. Jewett
Chris Angelico wrote:
> I'm not sure how a linter would stop
> someone from publishing code on PyPI that causes confusion by its
> character encoding, for instance.

If it becomes important, the cheeseshop backend can run various validations 
(including a linter) on submissions, and include those results in the display 
template.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/NO6XRUPLOEAO2ZMUJEXXRNQMVFWZUGLT/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Stephen J. Turnbull
Serhiy Storchaka writes:
 > This is excellent!
 > 
 > 01.11.21 14:17, Petr Viktorin пише:
 > >> CPython treats the control character NUL (``\0``) as end of input,
 > >> but many editors simply skip it, possibly showing code that Python
 > >> will not
 > >> run as a regular part of a file.
 > 
 > It is an implementation detail and we will get rid of it.

You can't, probably not for a decade, because people will be running
versions of Python released before you change it.  I hope this PEP
will address Python as it is as well as as it will be.

 > It only happens when you read the Python script from a file.

Which is one of the likely vectors for malware.  It might be worth
teaching virus checkers about this, for example.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/OUFJ47LYOHQ245BIKWVPCH4OCDB4CM7N/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Stephen J. Turnbull
Serhiy Storchaka writes:

 > All control characters except CR, LF, TAB and FF are banned outside
 > comments and string literals. I think it is worth to ban them in
 > comments and string literals too.

+1

 > > For homoglyphs/confusables, should there be a SyntaxWarning when an
 > > identifier looks like ASCII but isn't?
 > 
 > It would virtually ban Cyrillic.

+1 (for the comment and for the implied -1 on SyntaxWarning, let's
keep the Cyrillic repertoire in Python!)

 > It is a work for linters,

+1

Aside from the reasons Serhiy presents, I'd rather not tie
this kind of rather ambiguous improvement in Unicode handling to the
release cycle.

It might be worth having a pep module/script in Python (perhaps
more likely, PyPI but maintained by whoever does the work to make
these improvements + Petr or somebody Petr trusts to do it), that
lints scripts specifically for confusables and other issues.

Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Z62GMKAJLHZJD3YSEOJKKBWUZSBYEIVA/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)

2021-11-02 Thread Stephen J. Turnbull
Jim J. Jewett writes:

 > At the time, we considered it, and we also considered a narrower
 > restriction on using multiple scripts in the same identifier, or at
 > least the same identifier portion (so it was OK if separated by
 > _).

This would ban "παν語", aka "pango".  That's arguably a good idea
(IMO, 0.9 wink), but might make some GTK/GNOME folks sad.

 > Simplicity won, in part because of existing practice in EMACS
 > scripting, particularly with some Asian languages.

Interesting.  I maintained a couple of Emacs libraries (dictionaries
and input methods) for Japanese in XEmacs, and while hyphen-separated
mixtures of ASCII and Japanese are common, I don't recall ever seeing
an identifier with ASCII and Japanese glommed together without a
separator.  It was almost always of the form "English verb - Japanese
lexical component".  Or do you consider that "relatively complicated"?

 > It might be time for the documentation to mention a specific
 > linter/configuration that does this.  It also might be reasonable
 > to do by default in IDLE or even the interactive shell.

It would have to be easy to turn off, perhaps even provide
instructions in the messages.  I would guess that for code that uses
it at all, it would be common.  So the warnings would likely make
those tools somewhere between really annoying and unusable.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/FPO3EJISKDZUVMC3RMJJQZIKGCOG35CX/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Stephen J. Turnbull
Chris Angelico writes:

 > Huh. Is that level of generality actually still needed? Can Python
 > deprecate all but a small handful of encodings?

I think that's pointless.  With few exceptions (GB18030, Big5 has a
couple of code point pairs that encode the same very rare characters,
ISO 2022 extensions) you're not going to run into the confuseables
problem, and AFAIK the only generic BIDI solution is Unicode (the ISO
8859 encodings of Hebrew and Arabic do not have direction markers).

What exactly are you thinking?

The only thing I'd like to see is to rearrange the codec aliases so
that the "common names" would denote the maximal repertoires in each
family (gb denotes gb18030, sjis denotes shift_jisx0213, etc) as in
the WhatWG recommendations for web browsers.  But that's probably too
backward incompatible to fly.
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/W4RJJVAJN7FB24R52YSCU2Y3QZQE3BEL/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-02 Thread Chris Angelico
On Wed, Nov 3, 2021 at 5:12 PM Stephen J. Turnbull
 wrote:
>
> Chris Angelico writes:
>
>  > Huh. Is that level of generality actually still needed? Can Python
>  > deprecate all but a small handful of encodings?
>
> I think that's pointless.  With few exceptions (GB18030, Big5 has a
> couple of code point pairs that encode the same very rare characters,
> ISO 2022 extensions) you're not going to run into the confuseables
> problem, and AFAIK the only generic BIDI solution is Unicode (the ISO
> 8859 encodings of Hebrew and Arabic do not have direction markers).
>
> What exactly are you thinking?

You'll never eliminate confusables (even ASCII has some, depending on
font). But I was surprised to find that Python would let you use
unicode_escape for source code.



# coding: unicode_escape

x = '''

Code example:

\u0027\u0027\u0027 # format in monospaced on the web site

print("Did you think this would be executed?")

\u0027\u0027\u0027 # end monospaced

Surprise!
'''

print("There are %d lines in x." % len(x.split(chr(10



With some carefully-crafted comments, a lot of human readers will
ignore the magic tokens. It's not uncommon to put example code into
triple-quoted strings, and it's also not all that surprising when
simplified examples do things that you wouldn't normally want done
(like monkeypatching other modules), since it's just an example, after
all.

I don't have access to very many editors, but SciTE, VS Code, nano,
and the GitHub gist display all syntax-highlighted this as if it were
a single large string. Only Idle showed it as code in between, and
that's because it actually decoded it using the declared character
coding, so the magic lines showed up with actual apostrophes.

Maybe the phrase "a small handful" was a bit too hopeful, but would it
be possible to mandate (after, obviously, a deprecation period) that
source encodings be ASCII-compatible?

ChrisA
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/QQM7HLRMVKBELRRYBJYGR356QVSOLKKZ/
Code of Conduct: http://python.org/psf/codeofconduct/