[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On 01. 11. 21 13:17, Petr Viktorin wrote: Hello, Today, an attack called "Trojan source" was revealed, where a malicious contributor can use Unicode features (left-to-right text and homoglyphs) to code that, when shown in an editor, will look different from how a computer language parser will process it. See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694. This is not a bug in Python. As far as I know, the Python Security Response team reviewed the report and decided that it should be handled in code editors, diff viewers, repository frontends and similar software, rather than in the language. I agree: in my opinion, the attack is similar to abusing any other "gotcha" where Python doesn't parse text as a non-expert human would. For example: `if a or b == 'yes'`, mutable default arguments, or a misleading typo. Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below. Thanks for the comments, everyone! I've updated the document and sent it to https://github.com/python/peps/pull/2129 A rendered version is at https://github.com/encukou/peps/blob/pep-0672/pep-0672.rst Toshio Kuratomi wrote: `Unicode`_ is a system for handling all kinds of written language. It aims to allow any character from any human natural language (as well as a few characters which are not from natural languages) to be used. Python code may consist of almost all valid Unicode characters. Thanks! That's a nice summary; I condensed it a bit more and used it. (I'm not joining the conversation on glyphs, characters, codepoints and encodings -- that's much too technical for this document. Using the specific technical terms unfortunately doesn't help understanding, so I use the vague ones like "character" and "letter".) Jim J. Jewett wrote: "The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement." Normally, an identifier must begin with a letter, and numbers can only be used in the second and subsequent positions. (XID_CONTINUE instead of XID_START) The fact that some characters with numeric values are considered letters (in this case, category Lo, Other Letters) is a different problem than just looking visually confusable with "+", and it should probably be listed on its own. I'm not a native speaker, but as I understand it, "十" is closer to a single-letter word than a single-digit number. It translates better as "ten" than "10". (And it appears in "十四", "fourteen", just like "four" appears in "fourteen".) Patrick Schultz wrote: - The Unicode consortium has a list of confusables, in case useful Yup, and it's linked from the documents that describe how to use it. I link to those rather than just the list. But thank you! Terry Reedy wrote: Bidirectional Text -- Some scripts, such as Hebrew or Arabic, are written right-to-left. [Suggested addition, subject to further revision.] There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below). The handling depends on the display software and may depend on the quoting. Tk and hence tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local handling of unquoted code but extending handling of quoted text. Windows Notepad currently does extended handling even without quotes. I'd like to leave these details out of the document. The examples should render convincingly in browsers. The text should now describe the behavior even if you open it in an editor that does things differently, and acknowledge that such editors exist. (The behavior of specific editors/toolkits might well change in the future.) For example, with ``encoding: unicode_escape``, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example:: I don't see the connection between the text above and the example that follows. # For writing Japanese, you don't need an editor that supports # UTF-8 source encoding: unicode_escape sequences work just as well. [etc] Let me know if it's clear in the newest version, with this note: Here, ``encoding: unicode_escape`` in the initial comment is an encoding declaration. The ``unicode_escape`` encoding instructs Python to treat ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as a comma (punctuator), etc. Steven D'Aprano wrote: Before the age of computers, most mechanical typewriters lacked the keys for the digits ``0`` and ``1`` I'm not sure that "most" is justifed here. One of the most popular typewriters in history, the Underwood #5 (from 1900 to 1920), lacked the 1 key but had a 0 distinct from O. https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg
[Python-Dev] Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)
On 01. 11. 21 18:32, Serhiy Storchaka wrote: This is excellent! 01.11.21 14:17, Petr Viktorin пише: CPython treats the control character NUL (``\0``) as end of input, but many editors simply skip it, possibly showing code that Python will not run as a regular part of a file. It is an implementation detail and we will get rid of it. It only happens when you read the Python script from a file. If you import it as a module or run with runpy, the NUL character is an error. That brings us to possible changes in Python in this area, which is an interesting topic. As for \0, can we ban all ASCII & C1 control characters except whitespace? I see no place for them in source code. For homoglyphs/confusables, should there be a SyntaxWarning when an identifier looks like ASCII but isn't? For right-to-left text: does anyone actually name identifiers in Hebrew/Arabic? AFAIK, we should allow a few non-printing "joiner"/"non-joiner" characters to make it possible to use all Arabic words. But it would be great to consult with users/teachers of the languages. Should Python run the bidi algorithm when parsing and disallow reordered tokens? Maybe optionally? ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/TGB377QWGIDPUWMAJSZLT22ERGPNZ5FZ/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)
02.11.21 16:16, Petr Viktorin пише: > As for \0, can we ban all ASCII & C1 control characters except > whitespace? I see no place for them in source code. All control characters except CR, LF, TAB and FF are banned outside comments and string literals. I think it is worth to ban them in comments and string literals too. In string literals you can use backslash-escape sequences, and comments should be human readable, there are no reason to include control characters in them. There is a precedence of emitting warnings for some superficial escapes in strings. > For homoglyphs/confusables, should there be a SyntaxWarning when an > identifier looks like ASCII but isn't? It would virtually ban Cyrillic. There is a lot of Cyrillic letters which look like Latin letters, and there are complete words written in Cyrillic which by accident look like other words written in Latin. It is a work for linters, which can have many options for configuring acceptable scripts, use spelling dictionaries and dictionaries of homoglyphs, etc. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/DN24FK3A2DSO4HBGEDGJXERSAUYK6VK6/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin wrote: > Let me know if it's clear in the newest version, with this note: > > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding > > declaration. The ``unicode_escape`` encoding instructs Python to treat > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as > > a comma (punctuator), etc. > Huh. Is that level of generality actually still needed? Can Python deprecate all but a small handful of encodings? ChrisA ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/WA7P7YLY7N6CGF7N5G6DVG3PIA24BPS7/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)
Serhiy Storchaka wrote: > 02.11.21 16:16, Petr Viktorin пише: > > As for \0, can we ban all ASCII & C1 control characters except > > whitespace? I see no place for them in source code. > All control characters except CR, LF, TAB and FF are banned outside > comments and string literals. I think it is worth to ban them in > comments and string literals too. In string literals you can use > backslash-escape sequences, and comments should be human readable, there > are no reason to include control characters in them. If escape sequences were also allowed in comments (or at least in strings within comments), this would make sense. I don't like banning them otherwise, since odd characters are often a good reason to need a comment, but it is definitely a "mention, not use" situation. > > For homoglyphs/confusables, should there be a SyntaxWarning when an > > identifier looks like ASCII but isn't? > > It would virtually ban Cyrillic. There is a lot of Cyrillic letters > which look like Latin letters, and there are complete words written in > Cyrillic which by accident look like other words written in Latin. At the time, we considered it, and we also considered a narrower restriction on using multiple scripts in the same identifier, or at least the same identifier portion (so it was OK if separated by _). Simplicity won, in part because of existing practice in EMACS scripting, particularly with some Asian languages. > It is a work for linters, which can have many options for configuring > acceptable scripts, use spelling dictionaries and dictionaries of > homoglyphs, etc. It might be time for the documentation to mention a specific linter/configuration that does this. It also might be reasonable to do by default in IDLE or even the interactive shell. -jJ ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/BCZI6HCZJ34XABFFZETJMWFQWOUG4UB4/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On 01.11.2021 13:17, Petr Viktorin wrote: >> PEP: >> Title: Unicode Security Considerations for Python >> Author: Petr Viktorin >> Status: Active >> Type: Informational >> Content-Type: text/x-rst >> Created: 01-Nov-2021 >> Post-History: Thanks for writing this up. I'm not sure whether a PEP is the right place for such documentation, though. Wouldn't it be more visible in the standard Python documentation ? -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Experts (#1, Nov 02 2021) >>> Python Projects, Coaching and Support ...https://www.egenix.com/ >>> Python Product Development ...https://consulting.egenix.com/ ::: We implement business ideas - efficiently in both time and costs ::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 https://www.egenix.com/company/contact/ https://www.malemburg.com/ ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
This is an amazing document, Petr. Really great work! I think I agree with Marc-André that putting it in the actual Python documentation would give it more visibility than in a PEP. On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg wrote: > On 01.11.2021 13:17, Petr Viktorin wrote: > >> PEP: > >> Title: Unicode Security Considerations for Python > >> Author: Petr Viktorin > >> Status: Active > >> Type: Informational > >> Content-Type: text/x-rst > >> Created: 01-Nov-2021 > >> Post-History: > > Thanks for writing this up. I'm not sure whether a PEP is the right place > for such documentation, though. Wouldn't it be more visible in the standard > Python documentation ? > > -- > Marc-Andre Lemburg > eGenix.com > > Professional Python Services directly from the Experts (#1, Nov 02 2021) > >>> Python Projects, Coaching and Support ...https://www.egenix.com/ > >>> Python Product Development ...https://consulting.egenix.com/ > > > ::: We implement business ideas - efficiently in both time and costs ::: > >eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 > D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg >Registered at Amtsgericht Duesseldorf: HRB 46611 >https://www.egenix.com/company/contact/ > https://www.malemburg.com/ > > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/ > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On Wed, Nov 3, 2021 at 5:07 AM David Mertz, Ph.D. wrote: > > This is an amazing document, Petr. Really great work! > > I think I agree with Marc-André that putting it in the actual Python > documentation would give it more visibility than in a PEP. > There are quite a few other PEPs that have similar sorts of advice, like PEP 257 on docstrings, and several of the type hinting PEPs. IMO it's fine. ChrisA ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NICZBYG332C4WBFZVCHCTDTEP3NGEF7B/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On 11/2/2021 1:02 PM, Marc-Andre Lemburg wrote: On 01.11.2021 13:17, Petr Viktorin wrote: PEP: Title: Unicode Security Considerations for Python Author: Petr Viktorin Status: Active Type: Informational Content-Type: text/x-rst Created: 01-Nov-2021 Post-History: Thanks for writing this up. I'm not sure whether a PEP is the right place for such documentation, though. Wouldn't it be more visible in the standard Python documentation ? There is already "Unicode HOW TO" We could add "Unicode problems and pitfalls". -- Terry Jan Reedy ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5KDNR5RIITKMIKGSZK2WCPEQDA6AJGQE/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote: > On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin wrote: > > Let me know if it's clear in the newest version, with this note: > > > > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding > > > declaration. The ``unicode_escape`` encoding instructs Python to treat > > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` as > > > a comma (punctuator), etc. > > > > Huh. Is that level of generality actually still needed? Can Python > deprecate all but a small handful of encodings? To be clear, are you proposing to deprecate the encodings *completely* or just as the source code encoding? Personally, I think that using obscure encodings as the source encoding is one of those "linters and code reviews should check it" issues. Besides, now that I've learned about this unicode_escape encoding, I think that's going to be *awesome* for winning obfuscated Python competitions! *wink* -- Steve ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/27IDDKAADVBAZSRZ2I5EO5SLXZIY6ANW/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On Wed, Nov 3, 2021 at 11:09 AM Steven D'Aprano wrote: > > On Wed, Nov 03, 2021 at 03:03:54AM +1100, Chris Angelico wrote: > > On Wed, Nov 3, 2021 at 1:06 AM Petr Viktorin wrote: > > > Let me know if it's clear in the newest version, with this note: > > > > > > > Here, ``encoding: unicode_escape`` in the initial comment is an encoding > > > > declaration. The ``unicode_escape`` encoding instructs Python to treat > > > > ``\u0027`` as a single quote (which can start/end a string), ``\u002c`` > > > > as > > > > a comma (punctuator), etc. > > > > > > > Huh. Is that level of generality actually still needed? Can Python > > deprecate all but a small handful of encodings? > > To be clear, are you proposing to deprecate the encodings *completely* > or just as the source code encoding? Only source code encodings. Obviously we still need to be able to cope with all manner of *data*, but Python source code shouldn't need to be in bizarre, weird encodings. (Honestly, I'd love to just require that Python source code be UTF-8, but that would probably cause problems, so mandating that it be one of a small set of encodings would be a safer option.) > Personally, I think that using obscure encodings as the source encoding > is one of those "linters and code reviews should check it" issues. > > Besides, now that I've learned about this unicode_escape encoding, I > think that's going to be *awesome* for winning obfuscated Python > competitions! *wink* TBH, I'm not entirely sure how valid it is to talk about *security* considerations when we're dealing with Python source code and variable confusions, but that's a term that is well understood. But to the extent that it is a security concern, it's not one that linters can really cope with. I'm not sure how a linter would stop someone from publishing code on PyPI that causes confusion by its character encoding, for instance. ChrisA ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/HJ452KNBAFXI6WBQ6OUMHHZRRETPC7QL/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
I'd suggest both: briefer, easier to read write up for average user in docs, more details/semantics in informational PEP. Thanks for working on this, Petr! On Tue, Nov 2, 2021 at 2:07 PM David Mertz, Ph.D. wrote: > This is an amazing document, Petr. Really great work! > > I think I agree with Marc-André that putting it in the actual Python > documentation would give it more visibility than in a PEP. > > On Tue, Nov 2, 2021, 1:06 PM Marc-Andre Lemburg wrote: > >> On 01.11.2021 13:17, Petr Viktorin wrote: >> >> PEP: >> >> Title: Unicode Security Considerations for Python >> >> Author: Petr Viktorin >> >> Status: Active >> >> Type: Informational >> >> Content-Type: text/x-rst >> >> Created: 01-Nov-2021 >> >> Post-History: >> >> Thanks for writing this up. I'm not sure whether a PEP is the right place >> for such documentation, though. Wouldn't it be more visible in the >> standard >> Python documentation ? >> >> -- >> Marc-Andre Lemburg >> eGenix.com >> >> Professional Python Services directly from the Experts (#1, Nov 02 2021) >> >>> Python Projects, Coaching and Support ...https://www.egenix.com/ >> >>> Python Product Development ...https://consulting.egenix.com/ >> >> >> ::: We implement business ideas - efficiently in both time and costs ::: >> >>eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 >> D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg >>Registered at Amtsgericht Duesseldorf: HRB 46611 >>https://www.egenix.com/company/contact/ >> https://www.malemburg.com/ >> >> ___ >> Python-Dev mailing list -- python-dev@python.org >> To unsubscribe send an email to python-dev-le...@python.org >> https://mail.python.org/mailman3/lists/python-dev.python.org/ >> Message archived at >> https://mail.python.org/archives/list/python-dev@python.org/message/FSFG2B3LCWU5PQWX3WRIOJGNV2JFW4AU/ >> Code of Conduct: http://python.org/psf/codeofconduct/ >> > ___ > Python-Dev mailing list -- python-dev@python.org > To unsubscribe send an email to python-dev-le...@python.org > https://mail.python.org/mailman3/lists/python-dev.python.org/ > Message archived at > https://mail.python.org/archives/list/python-dev@python.org/message/6PHPDZRCYNA44NHSHXPBL7QMWXMHXWGU/ > Code of Conduct: http://python.org/psf/codeofconduct/ > ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/6OET4CKEZIA34PAXIJR7BUDKT2DPX2DG/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] PEP 663:
See the latest changes, which are mostly a (hopefully) improved abstract, better tables, and some slight rewordings. Feedback welcome! --- PEP: 663 Title: Standardizing Enum str(), repr(), and format() behaviors Version: $Revision$ Last-Modified: $Date$ Author: Ethan Furman Discussions-To: python-dev@python.org Status: Draft Type: Informational Content-Type: text/x-rst Created: 23-Feb-2013 Python-Version: 3.11 Post-History: 20-Jul-2021, 02-Nov-2021 Resolution: Abstract Update the ``repr()``, ``str()``, and ``format()`` of the various Enum types to better match their intended purpose. For example, ``IntEnum`` will have its ``str()`` change to match its ``format()``, while a user-mixed int-enum will have its ``format()`` match its ``str()``. In all cases, an enum's ``str()`` and ``format()`` will be the same (unless the user overrides ``format()``). Add a global enum decorator which changes the ``str()`` and ``repr()`` (and ``format()``) of the decorated enum to be a valid global reference: i.e. ``re.IGNORECASE`` instead of . Motivation == Having the ``str()`` of ``IntEnum`` and ``IntFlag`` not be the value causes bugs and extra work when replacing existing constants. Having the ``str()`` and ``format()`` of an enum member be different can be confusing. The addition of ``StrEnum`` with its requirement to have its ``str()`` be its ``value`` is inconsistent with other provided Enum's ``str``. The iteration of ``Flag`` members, which directly affects their ``repr()``, is inelegant at best, and buggy at worst. Rationale = Enums are becoming more common in the standard library; being able to recognize enum members by their ``repr()``, and having that ``repr()`` be easy to parse, is useful and can save time and effort in understanding and debugging code. However, the enums with mixed-in data types (``IntEnum``, ``IntFlag``, and the new ``StrEnum``) need to be more backwards compatible with the constants they are replacing -- specifically, ``str(replacement_enum_member) == str(original_constant)`` should be true (and the same for ``format()``). IntEnum, IntFlag, and StrEnum should be as close to a drop-in replacement of existing integer and string constants as is possible. Towards that goal, the ``str()`` output of each should be its inherent value; e.g. if ``Color`` is an ``IntEnum``:: >>> Color.RED >>> str(Color.RED) '1' >>> format(Color.RED) '1' Note that ``format()`` already produces the correct output, only ``str()`` needs updating. As much as possible, the ``str()``, ``repr()``, and ``format()`` of enum members should be standardized across the standard library. However, up to Python 3.10 several enums in the standard library have a custom ``str()`` and/or ``repr()``. The ``repr()`` of Flag currently includes aliases, which it should not; fixing that will, of course, already change its ``repr()`` in certain cases. Specification = There a three broad categories of enum usage: - simple: ``Enum`` or ``Flag`` a new enum class is created with no data type mixins - drop-in replacement: ``IntEnum``, ``IntFlag``, ``StrEnum`` a new enum class is created which also subclasses ``int`` or ``str`` and uses ``int.__str__`` or ``str.__str__`` - user-mixed enums and flags the user creates their own integer-, float-, str-, whatever-enums instead of using enum.IntEnum, etc. There are also two styles: - normal: the enumeration members remain in their classes and are accessed as ``classname.membername``, and the class name shows in their ``repr()`` and ``str()`` (where appropriate) - global: the enumeration members are copied into their module's global namespace, and their module name shows in their ``repr()`` and ``str()`` (where appropriate) Some sample enums:: # module: tools.py class Hue(Enum): # or IntEnum LIGHT = -1 NORMAL = 0 DARK = +1 class Color(Flag): # or IntFlag RED = 1 GREEN = 2 BLUE = 4 class Grey(int, Enum): # or (int, Flag) BLACK = 0 WHITE = 1 Using the above enumerations, the following two tables show the old and new output (blank cells indicate no change): +++-++---+ | style | category | enum repr() | enum str() | enum format() | ++-+--+-++---+ | normal | simple | 3.10 | || | || +--+-++---+ || | new | || | | +-+--+-++---+ || user mixed | 3.10 | || 1
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
Chris Angelico wrote: > I'm not sure how a linter would stop > someone from publishing code on PyPI that causes confusion by its > character encoding, for instance. If it becomes important, the cheeseshop backend can run various validations (including a linter) on submissions, and include those results in the display template. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/NO6XRUPLOEAO2ZMUJEXXRNQMVFWZUGLT/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
Serhiy Storchaka writes: > This is excellent! > > 01.11.21 14:17, Petr Viktorin пише: > >> CPython treats the control character NUL (``\0``) as end of input, > >> but many editors simply skip it, possibly showing code that Python > >> will not > >> run as a regular part of a file. > > It is an implementation detail and we will get rid of it. You can't, probably not for a decade, because people will be running versions of Python released before you change it. I hope this PEP will address Python as it is as well as as it will be. > It only happens when you read the Python script from a file. Which is one of the likely vectors for malware. It might be worth teaching virus checkers about this, for example. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/OUFJ47LYOHQ245BIKWVPCH4OCDB4CM7N/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)
Serhiy Storchaka writes: > All control characters except CR, LF, TAB and FF are banned outside > comments and string literals. I think it is worth to ban them in > comments and string literals too. +1 > > For homoglyphs/confusables, should there be a SyntaxWarning when an > > identifier looks like ASCII but isn't? > > It would virtually ban Cyrillic. +1 (for the comment and for the implied -1 on SyntaxWarning, let's keep the Cyrillic repertoire in Python!) > It is a work for linters, +1 Aside from the reasons Serhiy presents, I'd rather not tie this kind of rather ambiguous improvement in Unicode handling to the release cycle. It might be worth having a pep module/script in Python (perhaps more likely, PyPI but maintained by whoever does the work to make these improvements + Petr or somebody Petr trusts to do it), that lints scripts specifically for confusables and other issues. Steve ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Z62GMKAJLHZJD3YSEOJKKBWUZSBYEIVA/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Preventing Unicode-related gotchas (Was: pre-PEP: Unicode Security Considerations for Python)
Jim J. Jewett writes: > At the time, we considered it, and we also considered a narrower > restriction on using multiple scripts in the same identifier, or at > least the same identifier portion (so it was OK if separated by > _). This would ban "παν語", aka "pango". That's arguably a good idea (IMO, 0.9 wink), but might make some GTK/GNOME folks sad. > Simplicity won, in part because of existing practice in EMACS > scripting, particularly with some Asian languages. Interesting. I maintained a couple of Emacs libraries (dictionaries and input methods) for Japanese in XEmacs, and while hyphen-separated mixtures of ASCII and Japanese are common, I don't recall ever seeing an identifier with ASCII and Japanese glommed together without a separator. It was almost always of the form "English verb - Japanese lexical component". Or do you consider that "relatively complicated"? > It might be time for the documentation to mention a specific > linter/configuration that does this. It also might be reasonable > to do by default in IDLE or even the interactive shell. It would have to be easy to turn off, perhaps even provide instructions in the messages. I would guess that for code that uses it at all, it would be common. So the warnings would likely make those tools somewhere between really annoying and unusable. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/FPO3EJISKDZUVMC3RMJJQZIKGCOG35CX/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
Chris Angelico writes: > Huh. Is that level of generality actually still needed? Can Python > deprecate all but a small handful of encodings? I think that's pointless. With few exceptions (GB18030, Big5 has a couple of code point pairs that encode the same very rare characters, ISO 2022 extensions) you're not going to run into the confuseables problem, and AFAIK the only generic BIDI solution is Unicode (the ISO 8859 encodings of Hebrew and Arabic do not have direction markers). What exactly are you thinking? The only thing I'd like to see is to rearrange the codec aliases so that the "common names" would denote the maximal repertoires in each family (gb denotes gb18030, sjis denotes shift_jisx0213, etc) as in the WhatWG recommendations for web browsers. But that's probably too backward incompatible to fly. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/W4RJJVAJN7FB24R52YSCU2Y3QZQE3BEL/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On Wed, Nov 3, 2021 at 5:12 PM Stephen J. Turnbull wrote: > > Chris Angelico writes: > > > Huh. Is that level of generality actually still needed? Can Python > > deprecate all but a small handful of encodings? > > I think that's pointless. With few exceptions (GB18030, Big5 has a > couple of code point pairs that encode the same very rare characters, > ISO 2022 extensions) you're not going to run into the confuseables > problem, and AFAIK the only generic BIDI solution is Unicode (the ISO > 8859 encodings of Hebrew and Arabic do not have direction markers). > > What exactly are you thinking? You'll never eliminate confusables (even ASCII has some, depending on font). But I was surprised to find that Python would let you use unicode_escape for source code. # coding: unicode_escape x = ''' Code example: \u0027\u0027\u0027 # format in monospaced on the web site print("Did you think this would be executed?") \u0027\u0027\u0027 # end monospaced Surprise! ''' print("There are %d lines in x." % len(x.split(chr(10 With some carefully-crafted comments, a lot of human readers will ignore the magic tokens. It's not uncommon to put example code into triple-quoted strings, and it's also not all that surprising when simplified examples do things that you wouldn't normally want done (like monkeypatching other modules), since it's just an example, after all. I don't have access to very many editors, but SciTE, VS Code, nano, and the GitHub gist display all syntax-highlighted this as if it were a single large string. Only Idle showed it as code in between, and that's because it actually decoded it using the declared character coding, so the magic lines showed up with actual apostrophes. Maybe the phrase "a small handful" was a bit too hopeful, but would it be possible to mandate (after, obviously, a deprecation period) that source encodings be ASCII-compatible? ChrisA ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/QQM7HLRMVKBELRRYBJYGR356QVSOLKKZ/ Code of Conduct: http://python.org/psf/codeofconduct/