[Python-Dev] pre-PEP: Unicode Security Considerations for Python
Hello, Today, an attack called "Trojan source" was revealed, where a malicious contributor can use Unicode features (left-to-right text and homoglyphs) to code that, when shown in an editor, will look different from how a computer language parser will process it. See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694. This is not a bug in Python. As far as I know, the Python Security Response team reviewed the report and decided that it should be handled in code editors, diff viewers, repository frontends and similar software, rather than in the language. I agree: in my opinion, the attack is similar to abusing any other "gotcha" where Python doesn't parse text as a non-expert human would. For example: `if a or b == 'yes'`, mutable default arguments, or a misleading typo. Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below. PEP: Title: Unicode Security Considerations for Python Author: Petr Viktorin Status: Active Type: Informational Content-Type: text/x-rst Created: 01-Nov-2021 Post-History: Abstract This document explains possible ways to misuse Unicode to write Python programs that appear to do something else than they actually do. This document does not give any recommendations and solutions. Introduction Python code is written in `Unicode`_ – a system for encoding and handling all kinds of written language. While this allows programmers from all around the world to express themselves, it also allows writing code that is potentially confusing to readers. It is possible to misuse Python's Unicode-related features to write code that *appears* to do something else than what it does. Evildoers could take advantage of this to trick code reviewers into accepting malicious code. The possible issues generally can't be solved in Python itself without excessive restrictions of the language. They should be solved in code edirors and review tools (such as *diff* displays), by enforcing project-specific policies, and by raising awareness of individual programmers. This document purposefully does not give any solutions or recommendations: it is rather a list of things to keep in mind. This document is specific to Python. For general security considerations in Unicode text, see [tr36]_ and [tr39]_. Acknowledgement === Investigation for this document was prompted by [CVE-2021-42574], *Trojan Source Attacks* reported by Nicholas Boucher and Ross Anderson, which focuses on Bidirectional override characters in a variety of languages. Confusing Features == This section lists some Unicode-related features that can be surprising or misusable. ASCII-only Considerations - ASCII is a subset of Unicode While issues with the ASCII character set are generally well understood, the're presented here to help better understanding of the non-ASCII cases. Confusables and Typos ' Some characters look alike. Before the age of computers, most mechanical typewriters lacked the keys for the digits ``0`` and ``1``: users typed ``O`` (capital o) and ``l`` (lowercase L) instead. Human readers could tell them apart by context only. In programming language, however, distinction between digits and letters is critical -- and most fonts designed for programmers make it easy to tell them apart. Similarly, the uppercase “I” and lowercase “l” can look similar in fonts designed for human languages, but programmers' fonts make them noticeably different. However, what is “noticeably” different always depend on the context. Humans tend to ignore details in longer identifiers: the variable name ``accessibi1ity_options`` can still look indistinguishable from ``accessibility_options``, while they are distinct for the compiler. The same can be said for plain typos: most humans will not notice the typo in ``responsbility_chain_delegate``. Control Characters '' Python generally considers all ``CR`` (``\r``), ``LF`` (``\n``), and ``CR-LF`` pairs (``\r\n``) as an end of line characters. Most code editors do as well, but there are editors that display “non-native” line endings as unknown characters (or nothing at all), rather than ending the line, displaying this example:: # Don't call this function: fire_the_missiles() as a harmless comment like:: # Don't call this function:⬛fire_the_missiles() CPython treats the control character NUL (``\0``) as end of input, but many editors simply skip it, possibly showing code that Python will not run as a regular part of a file. Some characters can be used to hide/overwrite other characters when source is listed in common terminals: * BS (``\b``, Backspace) moves the cursor back, so the character after it will overwrite the character before. * CR (``\r``, carriage return) moves the cursor to the start of line, subsequent characters overw
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
Thanks for writing this Petr! A few comments below. On Mon, Nov 01, 2021 at 01:17:02PM +0100, Petr Viktorin wrote: > >ASCII-only Considerations > >- > > > >ASCII is a subset of Unicode > > > >While issues with the ASCII character set are generally well understood, > >the're presented here to help better understanding of the non-ASCII cases. You should mention that some very common typefaces (fonts) are more confusable than others. For instance, Arial (a common font on Windows systems) makes the two letter combination 'rn' virtually indistinguishable from the single letter 'm'. > >Before the age of computers, most mechanical typewriters lacked the keys > >for the digits ``0`` and ``1`` I'm not sure that "most" is justifed here. One of the most popular typewriters in history, the Underwood #5 (from 1900 to 1920), lacked the 1 key but had a 0 distinct from O. https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford Typewriter. As did possibly the best selling typewriter in history, the IBM Selectric (introduced in 1961). http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewriters-photo-gallery Perhaps you should say "many older mechanical typewriters"? > >Bidirectional Text > >-- The section on bidirectional text is interesting, because reading it in my email client mutt, all the examples are left to right. You might like to note that not all applications support bidirectional text. > >Unicode includes alorithms to *normalize* variants like these to a > >single form, and Python identifiers are normalized. Typo: "algorithms". This is a good and useful document, thank you again. -- Steve ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CHGK6LLBMVRQ6GGEMRWYJNRLUL7KUMVS/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
This is excellent! 01.11.21 14:17, Petr Viktorin пише: >> CPython treats the control character NUL (``\0``) as end of input, >> but many editors simply skip it, possibly showing code that Python >> will not >> run as a regular part of a file. It is an implementation detail and we will get rid of it. It only happens when you read the Python script from a file. If you import it as a module or run with runpy, the NUL character is an error. >> Some characters can be used to hide/overwrite other characters when >> source is >> listed in common terminals: >> >> * BS (``\b``, Backspace) moves the cursor back, so the character after it >> will overwrite the character before. >> * CR (``\r``, carriage return) moves the cursor to the start of line, >> subsequent characters overwrite the start of the line. >> * DEL (``\x7F``) commonly initiates escape codes which allow arbitrary >> control of the terminal. ESC (``\x1B``) starts many control sequences. ``\1A`` means the end of the text file on Windows. Some programs (for example "type") ignore the rest of the file. ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/CBI7ME3YUAVVH5B6LSC745GJSVUIZJHO/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Python multithreading without the GIL
Hi Skip, I think the performance difference is because of different versions of NumPy. Python 3.9 installs NumPy 1.21.3 by default for "pip install numpy". I've only built and packaged NumPy 1.19.4 for "nogil" Python. There are substantial performance differences between the two NumPy builds for this matmul script. With NumPy 1.19.4, I get practically the same results for both Python 3.9.2 and "nogil" Python for "time python3 matmul.py 0 10". I'll update the version of NumPy for "nogil" Python if I have some time this week. Best, Sam On Sun, Oct 31, 2021 at 5:46 PM Skip Montanaro wrote: > > Remember that py stone is a terrible benchmark. > > I understand that. I was only using it as a spot check. I was surprised at > how much slower my (threaded or unthreaded) matrix multiply was on nogil vs > 3.9+. I went into it thinking I would see an improvement. The Performance > section of Sam's design document starts: > > As mentioned above, the no-GIL proof-of-concept interpreter is about 10% > faster than CPython 3.9 (and 3.10) on the pyperformance benchmark suite. > > > so it didn't occur to me that I'd be looking at a slowdown, much less by > as much as I'm seeing. > > Maybe I've somehow stumbled on some instruction mix for which the nogil VM > is much worse than the stock VM. For now, I prefer to think I'm just doing > something stupid. It certainly wouldn't be the first time. > > Skip > > P.S. I suppose I should have cc'd Sam when I first replied to this > thread, but I'm doing so now. I figured my mistake would reveal itself > early on. Sam, here's my first post about my little "project." > https://mail.python.org/archives/list/python-dev@python.org/message/WBLU6PZ2RDPEMG3ZYBWSAXUGXCJNFG4A/ > > > ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/W23EPICXG3RVOMMCVSM3FVOEN2U3LNM3/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
This is an excellent enumeration of some of the concerns! One minor comment about the introductory material: On Mon, Nov 1, 2021 at 5:21 AM Petr Viktorin wrote: > > > > Introduction > > > > > > Python code is written in `Unicode`_ – a system for encoding and > > handling all kinds of written language. Unicode specifies the mapping of glyphs to code points. Then a second mapping from code points to sequences of bytes is what is actually recorded by the computer. The second mapping is what programmers using Python will commonly think of as the encoding while the majority of what you're writing about has more to do with the first mapping. I'd try to word this in a way that doesn't lead a reader to conflate those two mappings. Maybe something like this? `Unicode`_ is a system for handling all kinds of written language. It aims to allow any character from any human natural language (as well as a few characters which are not from natural languages) to be used. Python code may consist of almost all valid Unicode characters. > > While this allows programmers from all around the world to express > > themselves, > > it also allows writing code that is potentially confusing to readers. > > -Toshio ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/Q2T3GKC6R6UH5O7RZJJNREG3XQDDZ6N4/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Python multithreading without the GIL
> I think the performance difference is because of different versions of > NumPy. > Good reason to leave numpy completely out of it. Unless you want to test nogil’s performance effects on numpy code — an interesting exercise in itself. Also — sorry I didn’t look at your code before, but you really want to keep the generation of large random arrays out of your benchmark if you can. I suspect that’s what’s changed in numpy versions. In any case, do time the random number generation… -CHB Python 3.9 installs NumPy 1.21.3 by default for "pip install numpy". I've > only built and packaged NumPy 1.19.4 for "nogil" Python. There are > substantial performance differences between the two NumPy builds for this > matmul script. > > With NumPy 1.19.4, I get practically the same results for both Python > 3.9.2 and "nogil" Python for "time python3 matmul.py 0 10". > > I'll update the version of NumPy for "nogil" Python if I have some time > this week. > > Best, > Sam > > On Sun, Oct 31, 2021 at 5:46 PM Skip Montanaro > wrote: > >> > Remember that py stone is a terrible benchmark. >> >> I understand that. I was only using it as a spot check. I was surprised >> at how much slower my (threaded or unthreaded) matrix multiply was on nogil >> vs 3.9+. I went into it thinking I would see an improvement. The >> Performance section of Sam's design document starts: >> >> As mentioned above, the no-GIL proof-of-concept interpreter is about 10% >> faster than CPython 3.9 (and 3.10) on the pyperformance benchmark suite. >> >> >> so it didn't occur to me that I'd be looking at a slowdown, much less by >> as much as I'm seeing. >> >> Maybe I've somehow stumbled on some instruction mix for which the nogil >> VM is much worse than the stock VM. For now, I prefer to think I'm just >> doing something stupid. It certainly wouldn't be the first time. >> >> Skip >> >> P.S. I suppose I should have cc'd Sam when I first replied to this >> thread, but I'm doing so now. I figured my mistake would reveal itself >> early on. Sam, here's my first post about my little "project." >> https://mail.python.org/archives/list/python-dev@python.org/message/WBLU6PZ2RDPEMG3ZYBWSAXUGXCJNFG4A/ >> >> >> -- Christopher Barker, PhD (Chris) Python Language Consulting - Teaching - Scientific Software Development - Desktop GUI and Web Development - wxPython, numpy, scipy, Cython ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/VLSAMFORVMEIQVH3UH6LOK3OA3GL7C6J/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: Python multithreading without the GIL
Sam> I think the performance difference is because of different versions of NumPy. Thanks all for the help/input/advice. It never occurred to me that two relatively recent versions of numpy would differ so much for the simple tasks in my script (array creation & transform). I confirmed this by removing 1.21.3 and installing 1.19.4 in my 3.9 build. I also got a little bit familiar with pyperf, and as a "stretch" goal completely removed random numbers and numpy from my script. (Took me a couple tries to get my array init and transposition correct. Let's just say that it's been awhile. Numpy *was* a nice crutch...) With no trace of numpyleft I now get identical results for single-threaded matrix multiply (a size==1, b size==2): 3.9: matmul: Mean +- std dev: 102 ms +- 1 ms nogil: matmul: Mean +- std dev: 103 ms +- 2 ms and a nice speedup for multi-threaded (a size==3, b size=6, nthreads=3): 3.9: matmul_t: Mean +- std dev: 290 ms +- 13 ms nogil: matmul_t: Mean +- std dev: 102 ms +- 3 ms Sam> I'll update the version of NumPy for "nogil" Python if I have some time this week. I think it would be sufficient to alert users to the 1.19/1.21 performance differences and recommend they force install 1.19 in non-nogil builds for testing purposes. Hopefully adding a simple note to your README will take less time than porting your changes to numpy 1.21 and adjusting your build configs/scripts. Skip ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/5RXRTNNCYBCILMVATHODFGAZ5ZEQXRZI/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a complete Python statement." Normally, an identifier must begin with a letter, and numbers can only be used in the second and subsequent positions. (XID_CONTINUE instead of XID_START) The fact that some characters with numeric values are considered letters (in this case, category Lo, Other Letters) is a different problem than just looking visually confusable with "+", and it should probably be listed on its own. -jJ ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/RV7RU7DGWFIBEGFKNYDP63ZRJNP5Y4YU/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On 11/1/2021 8:17 AM, Petr Viktorin wrote: Nevertheless, I did do a bit of research about similar gotchas in Python, and I'd like to publish a summary as an informational PEP, pasted below. Very helpful. Bidirectional Text -- Some scripts, such as Hebrew or Arabic, are written right-to-left. [Suggested addition, subject to further revision.] There are at least three levels of handling r2l chars: none, local (contiguous sequences are properly reversed), and extended (see below). The handling depends on the display software and may depend on the quoting. Tk and hence tkinter (and IDLE) text widgets do local handing. Windows Notepad++ does local handling of unquoted code but extending handling of quoted text. Windows Notepad currently does extended handling even without quotes. In extended handling, phrases ... Phrases in such scripts interact with nearby text in ways that can be surprising to people who aren't familiar with these writing systems and their computer representation. The exact process is complicated, and explained in Unicode® Standard Annex #9, "Unicode Bidirectional Algorithm". Some surprising examples include: * In the statement ``ערך = 23``, the variable ``ערך`` is set to the integer 23. In local handling, one sees = 23`. In extended handling, one sees 23 = . (Notepad++ sees backticks as quotes.) Source Encoding --- The encoding of Python source files is given by a specific regex on the first two lines of a file, as per `Encoding declarations`_. This mechanism is very liberal in what it accepts, and thus easy to obfuscate. This can be misused in combination with Python-specific special-purpose encodings (see `Text Encodings`_). Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to something? For example, with ``encoding: unicode_escape``, characters like quotes or braces can be hidden in an (f-)string, with many tools (syntax highlighters, linters, etc.) considering them part of the string. For example:: I don't see the connection between the text above and the example that follows. # For writing Japanese, you don't need an editor that supports # UTF-8 source encoding: unicode_escape sequences work just as well. [etc] -- Terry Jan Reedy ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/34JROXNUHEUDC4TOWUAM74KIGIRRHHG4/ Code of Conduct: http://python.org/psf/codeofconduct/
[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python
On Mon, Nov 01, 2021 at 11:41:06AM -0700, Toshio Kuratomi wrote: > Unicode specifies the mapping of glyphs to code points. Then a second > mapping from code points to sequences of bytes is what is actually > recorded by the computer. The second mapping is what programmers > using Python will commonly think of as the encoding while the majority > of what you're writing about has more to do with the first mapping. I don't think that is correct. According to the Unicode consortium -- and I hope that they would know *wink* -- Unicode is the universal character encoding. In other words: "Unicode provides a unique number for every character" https://www.unicode.org/standard/WhatIsUnicode.html Not glyphs. ("Character" in natural language is a bit of a fuzzy concept, so I think that Unicode here is referring to what their glossary calls an abstract character.) The usual meaning of glyph is for the graphical images used by fonts (typefaces) for display. Sense 2 in the Unicode glossary here: https://www.unicode.org/glossary/#glyph I'm not really sure what they mean by sense 1, unless they mean a representative glyph, which is intended to stand in as an example of the entire range of glyphs. Unicode does not specify what the glyphs for code points are, although it does provide representative samples. See, for example, their comment on emoji: "The Unicode Consortium provides character code charts that show a representative glyph" http://www.unicode.org/faq/emoji_dingbats.html Their code point charts likewise show representative glyphs for other letters and symbols, not authoritative. And of course, many abstract characters do not have glyphs at all, e.g. invisible joiners, control characters, variation selectors, noncharacters, etc. The mapping from bytes to code points and abstract characters is also part of Unicode. The UTF encodings are part of Unicode: https://www.unicode.org/faq/utf_bom.html#gen2 The "U" in UTF literally stands for Unicode :-) -- Steve ___ Python-Dev mailing list -- python-dev@python.org To unsubscribe send an email to python-dev-le...@python.org https://mail.python.org/mailman3/lists/python-dev.python.org/ Message archived at https://mail.python.org/archives/list/python-dev@python.org/message/I7ZRNIHSQ7UL4NSKOXFRYBYHQEXGNBPA/ Code of Conduct: http://python.org/psf/codeofconduct/