[Python-Dev] pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Petr Viktorin

Hello,
Today, an attack called "Trojan source" was revealed, where a malicious 
contributor can use Unicode features (left-to-right text and homoglyphs) 
to code that, when shown in an editor, will look different from how a 
computer language parser will process it.

See https://trojansource.codes/, CVE-2021-42574 and CVE-2021-42694.

This is not a bug in Python.
As far as I know, the Python Security Response team reviewed the report 
and decided that it should be handled in code editors, diff viewers, 
repository frontends and similar software, rather than in the language.


I agree: in my opinion, the attack is similar to abusing any other 
"gotcha" where Python doesn't parse text as a non-expert human would. 
For example: `if a or b == 'yes'`, mutable default arguments, or a 
misleading typo.


Nevertheless, I did do a bit of research about similar gotchas in 
Python, and I'd like to publish a summary as an informational PEP, 
pasted below.





PEP: 
Title: Unicode Security Considerations for Python
Author: Petr Viktorin 
Status: Active
Type: Informational
Content-Type: text/x-rst
Created: 01-Nov-2021
Post-History:

Abstract


This document explains possible ways to misuse Unicode to write Python
programs that appear to do something else than they actually do.

This document does not give any recommendations and solutions.


Introduction


Python code is written in `Unicode`_ – a system for encoding and
handling all kinds of written language.
While this allows programmers from all around the world to express themselves,
it also allows writing code that is potentially confusing to readers.

It is possible to misuse Python's Unicode-related features to write code that
*appears* to do something else than what it does.
Evildoers could take advantage of this to trick code reviewers into
accepting malicious code.

The possible issues generally can't be solved in Python itself without
excessive restrictions of the language.
They should be solved in code edirors and review tools
(such as *diff* displays), by enforcing project-specific policies,
and by raising awareness of individual programmers.

This document purposefully does not give any solutions
or recommendations: it is rather a list of things to keep in mind.

This document is specific to Python.
For general security considerations in Unicode text, see [tr36]_ and [tr39]_.


Acknowledgement
===

Investigation for this document was prompted by [CVE-2021-42574],
*Trojan Source Attacks* reported by Nicholas Boucher and Ross Anderson,
which focuses on Bidirectional override characters in a variety of languages.


Confusing Features
==

This section lists some Unicode-related features that can be surprising
or misusable.


ASCII-only Considerations
-

ASCII is a subset of Unicode

While issues with the ASCII character set are generally well understood,
the're presented here to help better understanding of the non-ASCII cases.

Confusables and Typos
'

Some characters look alike.
Before the age of computers, most mechanical typewriters lacked the keys for
the digits ``0`` and ``1``: users typed ``O`` (capital o) and ``l``
(lowercase L) instead. Human readers could tell them apart by context only.
In programming language, however, distinction between digits and letters is
critical -- and most fonts designed for programmers make it easy to tell them
apart.

Similarly, the uppercase “I” and lowercase “l” can look similar in fonts
designed for human languages, but programmers' fonts make them noticeably
different.

However, what is “noticeably” different always depend on the context.
Humans tend to ignore details in longer identifiers: the variable name
``accessibi1ity_options`` can still look indistinguishable from
``accessibility_options``, while they are distinct for the compiler.

The same can be said for plain typos: most humans will not notice the typo in
``responsbility_chain_delegate``.

Control Characters
''

Python generally considers all ``CR`` (``\r``), ``LF`` (``\n``), and ``CR-LF``
pairs (``\r\n``) as an end of line characters.
Most code editors do as well, but there are editors that display “non-native”
line endings as unknown characters (or nothing at all), rather than ending
the line, displaying this example::

# Don't call this function:
fire_the_missiles()

as a harmless comment like::

# Don't call this function:⬛fire_the_missiles()

CPython treats the control character NUL (``\0``) as end of input,
but many editors simply skip it, possibly showing code that Python will not
run as a regular part of a file.

Some characters can be used to hide/overwrite other characters when source is
listed in common terminals:

* BS (``\b``, Backspace) moves the cursor back, so the character after it
  will overwrite the character before.
* CR (``\r``, carriage return) moves the cursor to the start of line,
  subsequent characters overw

[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Steven D'Aprano
Thanks for writing this Petr!

A few comments below.

On Mon, Nov 01, 2021 at 01:17:02PM +0100, Petr Viktorin wrote:

> >ASCII-only Considerations
> >-
> >
> >ASCII is a subset of Unicode
> >
> >While issues with the ASCII character set are generally well understood,
> >the're presented here to help better understanding of the non-ASCII cases.

You should mention that some very common typefaces (fonts) are more 
confusable than others. For instance, Arial (a common font on Windows 
systems) makes the two letter combination 'rn' virtually 
indistinguishable from the single letter 'm'.


> >Before the age of computers, most mechanical typewriters lacked the keys 
> >for the digits ``0`` and ``1``

I'm not sure that "most" is justifed here. One of the most popular 
typewriters in history, the Underwood #5 (from 1900 to 1920), lacked 
the 1 key but had a 0 distinct from O.

https://i1.wp.com/curiousasacathy.com/wp-content/uploads/2016/04/underwood-no-5-standard-typewriter-circa-1901.jpg

The Oliver 5 (1894 – 1928) had both a 0 and a 1, as did the 1895 Ford 
Typewriter. As did possibly the best selling typewriter in history, the 
IBM Selectric (introduced in 1961).

http://www.technocrazed.com/the-interesting-history-of-evolution-of-typewriters-photo-gallery

Perhaps you should say "many older mechanical typewriters"?


> >Bidirectional Text
> >--

The section on bidirectional text is interesting, because reading it in 
my email client mutt, all the examples are left to right.

You might like to note that not all applications support bidirectional 
text.


> >Unicode includes alorithms to *normalize* variants like these to a 
> >single form, and Python identifiers are normalized.

Typo: "algorithms".



This is a good and useful document, thank you again.


-- 
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CHGK6LLBMVRQ6GGEMRWYJNRLUL7KUMVS/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Serhiy Storchaka
This is excellent!

01.11.21 14:17, Petr Viktorin пише:
>> CPython treats the control character NUL (``\0``) as end of input,
>> but many editors simply skip it, possibly showing code that Python
>> will not
>> run as a regular part of a file.

It is an implementation detail and we will get rid of it. It only
happens when you read the Python script from a file. If you import it as
a module or run with runpy, the NUL character is an error.

>> Some characters can be used to hide/overwrite other characters when
>> source is
>> listed in common terminals:
>>
>> * BS (``\b``, Backspace) moves the cursor back, so the character after it
>>   will overwrite the character before.
>> * CR (``\r``, carriage return) moves the cursor to the start of line,
>>   subsequent characters overwrite the start of the line.
>> * DEL (``\x7F``) commonly initiates escape codes which allow arbitrary
>>   control of the terminal.

ESC (``\x1B``) starts many control sequences.

``\1A`` means the end of the text file on Windows. Some programs (for
example "type") ignore the rest of the file.

___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/CBI7ME3YUAVVH5B6LSC745GJSVUIZJHO/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Python multithreading without the GIL

2021-11-01 Thread Sam Gross
Hi Skip,

I think the performance difference is because of different versions of
NumPy. Python 3.9 installs NumPy 1.21.3 by default for "pip install numpy".
I've only built and packaged NumPy 1.19.4 for "nogil" Python. There are
substantial performance differences between the two NumPy builds for this
matmul script.

With NumPy 1.19.4, I get practically the same results for both Python 3.9.2
and "nogil" Python for "time python3 matmul.py 0 10".

I'll update the version of NumPy for "nogil" Python if I have some time
this week.

Best,
Sam

On Sun, Oct 31, 2021 at 5:46 PM Skip Montanaro 
wrote:

> > Remember that py stone is a terrible benchmark.
>
> I understand that. I was only using it as a spot check. I was surprised at
> how much slower my (threaded or unthreaded) matrix multiply was on nogil vs
> 3.9+. I went into it thinking I would see an improvement. The Performance
> section of Sam's design document starts:
>
> As mentioned above, the no-GIL proof-of-concept interpreter is about 10%
> faster than CPython 3.9 (and 3.10) on the pyperformance benchmark suite.
>
>
> so it didn't occur to me that I'd be looking at a slowdown, much less by
> as much as I'm seeing.
>
> Maybe I've somehow stumbled on some instruction mix for which the nogil VM
> is much worse than the stock VM. For now, I prefer to think I'm just doing
> something stupid. It certainly wouldn't be the first time.
>
> Skip
>
> P.S. I suppose I should have cc'd Sam when I first replied to this
> thread, but I'm doing so now. I figured my mistake would reveal itself
> early on. Sam, here's my first post about my little "project."
> https://mail.python.org/archives/list/python-dev@python.org/message/WBLU6PZ2RDPEMG3ZYBWSAXUGXCJNFG4A/
>
>
>
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/W23EPICXG3RVOMMCVSM3FVOEN2U3LNM3/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Toshio Kuratomi
This is an excellent enumeration of some of the concerns!

One minor comment about the introductory material:

On Mon, Nov 1, 2021 at 5:21 AM Petr Viktorin  wrote:

> >
> > Introduction
> > 
> >
> > Python code is written in `Unicode`_ – a system for encoding and
> > handling all kinds of written language.

Unicode specifies the mapping of glyphs to code points.  Then a second
mapping from code points to sequences of bytes is what is actually
recorded by the computer.  The second mapping is what programmers
using Python will commonly think of as the encoding while the majority
of what you're writing about has more to do with the first mapping.
I'd try to word this in a way that doesn't lead a reader to conflate
those two mappings.

Maybe something like this?

  `Unicode`_ is a system for handling all kinds of written language.
It aims to allow any character from any human natural language (as
well as a few characters which are not from natural languages) to be
used. Python code may consist of almost all valid Unicode characters.

> > While this allows programmers from all around the world to express 
> > themselves,
> > it also allows writing code that is potentially confusing to readers.
> >

-Toshio
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/Q2T3GKC6R6UH5O7RZJJNREG3XQDDZ6N4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Python multithreading without the GIL

2021-11-01 Thread Christopher Barker
> I think the performance difference is because of different versions of
> NumPy.
>

Good reason to leave numpy completely out of it. Unless you want to test
 nogil’s performance effects on numpy code — an interesting exercise in
itself.

Also — sorry I didn’t look at your code before, but you really want to keep
the generation of large random arrays out of your benchmark if you can. I
suspect that’s what’s changed in numpy versions.

In any case, do time the random number generation…

-CHB



Python 3.9 installs NumPy 1.21.3 by default for "pip install numpy". I've
> only built and packaged NumPy 1.19.4 for "nogil" Python. There are
> substantial performance differences between the two NumPy builds for this
> matmul script.
>
> With NumPy 1.19.4, I get practically the same results for both Python
> 3.9.2 and "nogil" Python for "time python3 matmul.py 0 10".
>
> I'll update the version of NumPy for "nogil" Python if I have some time
> this week.
>
> Best,
> Sam
>
> On Sun, Oct 31, 2021 at 5:46 PM Skip Montanaro 
> wrote:
>
>> > Remember that py stone is a terrible benchmark.
>>
>> I understand that. I was only using it as a spot check. I was surprised
>> at how much slower my (threaded or unthreaded) matrix multiply was on nogil
>> vs 3.9+. I went into it thinking I would see an improvement. The
>> Performance section of Sam's design document starts:
>>
>> As mentioned above, the no-GIL proof-of-concept interpreter is about 10%
>> faster than CPython 3.9 (and 3.10) on the pyperformance benchmark suite.
>>
>>
>> so it didn't occur to me that I'd be looking at a slowdown, much less by
>> as much as I'm seeing.
>>
>> Maybe I've somehow stumbled on some instruction mix for which the nogil
>> VM is much worse than the stock VM. For now, I prefer to think I'm just
>> doing something stupid. It certainly wouldn't be the first time.
>>
>> Skip
>>
>> P.S. I suppose I should have cc'd Sam when I first replied to this
>> thread, but I'm doing so now. I figured my mistake would reveal itself
>> early on. Sam, here's my first post about my little "project."
>> https://mail.python.org/archives/list/python-dev@python.org/message/WBLU6PZ2RDPEMG3ZYBWSAXUGXCJNFG4A/
>>
>>
>> --
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/VLSAMFORVMEIQVH3UH6LOK3OA3GL7C6J/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: Python multithreading without the GIL

2021-11-01 Thread Skip Montanaro
Sam> I think the performance difference is because of different
versions of NumPy.

Thanks all for the help/input/advice. It never occurred to me that two
relatively recent versions of numpy would differ so much for the
simple tasks in my script (array creation & transform). I confirmed
this by removing 1.21.3 and installing 1.19.4 in my 3.9 build.

I also got a little bit familiar with pyperf, and as a "stretch" goal
completely removed random numbers and numpy from my script. (Took me a
couple tries to get my array init and transposition correct. Let's
just say that it's been awhile. Numpy *was* a nice crutch...) With no
trace of numpyleft I now get identical results for single-threaded
matrix multiply (a size==1, b size==2):

3.9: matmul: Mean +- std dev: 102 ms +- 1 ms
nogil: matmul: Mean +- std dev: 103 ms +- 2 ms

and a nice speedup for multi-threaded (a size==3, b size=6, nthreads=3):

3.9: matmul_t: Mean +- std dev: 290 ms +- 13 ms
nogil: matmul_t: Mean +- std dev: 102 ms +- 3 ms

Sam> I'll update the version of NumPy for "nogil" Python if I have
some time this week.

I think it would be sufficient to alert users to the 1.19/1.21
performance differences and recommend they force install 1.19 in
non-nogil builds for testing purposes. Hopefully adding a simple note
to your README will take less time than porting your changes to numpy
1.21 and adjusting your build configs/scripts.

Skip
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/5RXRTNNCYBCILMVATHODFGAZ5ZEQXRZI/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Jim J. Jewett
"The East Asian symbol for *ten* looks like a plus sign, so ``十= 10`` is a 
complete Python statement."

Normally, an identifier must begin with a letter, and numbers can only be used 
in the second and subsequent positions.  (XID_CONTINUE instead of XID_START)  
The fact that some characters with numeric values are considered letters (in 
this case, category Lo, Other Letters) is a different problem than just looking 
visually confusable with "+", and it should probably be listed on its own.

-jJ
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/RV7RU7DGWFIBEGFKNYDP63ZRJNP5Y4YU/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Terry Reedy

On 11/1/2021 8:17 AM, Petr Viktorin wrote:

Nevertheless, I did do a bit of research about similar gotchas in 
Python, and I'd like to publish a summary as an informational PEP, 
pasted below.


Very helpful.


Bidirectional Text
--

Some scripts, such as Hebrew or Arabic, are written right-to-left.


[Suggested addition, subject to further revision.]

There are at least three levels of handling r2l chars: none, local 
(contiguous sequences are properly reversed), and extended (see below). 
 The handling depends on the display software and may depend on the 
quoting.  Tk and hence tkinter (and IDLE) text widgets do local handing. 
 Windows Notepad++ does local handling of unquoted code but extending 
handling of quoted text.  Windows Notepad currently does extended 
handling even without quotes.


In extended handling, phrases ...


Phrases in such scripts interact with nearby text in ways that can be
surprising to people who aren't familiar with these writing systems 
and their

computer representation.

The exact process is complicated, and explained in Unicode® Standard 
Annex #9,

"Unicode Bidirectional Algorithm".

Some surprising examples include:

* In the statement ``ערך = 23``, the variable ``ערך`` is set to the 
integer 23.


In local handling, one sees  = 23`.  In extended handling,
one sees 23 = .  (Notepad++ sees backticks as quotes.)



Source Encoding
---

The encoding of Python source files is given by a specific regex on 
the first

two lines of a file, as per `Encoding declarations`_.
This mechanism is very liberal in what it accepts, and thus easy to 
obfuscate.


This can be misused in combination with Python-specific special-purpose
encodings (see `Text Encodings`_).



Are `Encoding declarations`_ and `Text Encodings`_ supposed to link to 
something?




For example, with ``encoding: unicode_escape``, characters like
quotes or braces can be hidden in an (f-)string, with many tools (syntax
highlighters, linters, etc.) considering them part of the string.
For example::


I don't see the connection between the text above and the example that 
follows.



    # For writing Japanese, you don't need an editor that supports
    # UTF-8 source encoding: unicode_escape sequences work just as well.

[etc]


--
Terry Jan Reedy
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/34JROXNUHEUDC4TOWUAM74KIGIRRHHG4/
Code of Conduct: http://python.org/psf/codeofconduct/


[Python-Dev] Re: pre-PEP: Unicode Security Considerations for Python

2021-11-01 Thread Steven D'Aprano
On Mon, Nov 01, 2021 at 11:41:06AM -0700, Toshio Kuratomi wrote:

> Unicode specifies the mapping of glyphs to code points.  Then a second
> mapping from code points to sequences of bytes is what is actually
> recorded by the computer.  The second mapping is what programmers
> using Python will commonly think of as the encoding while the majority
> of what you're writing about has more to do with the first mapping.

I don't think that is correct.

According to the Unicode consortium -- and I hope that they would know 
*wink* -- Unicode is the universal character encoding. In other words:

"Unicode provides a unique number for every character"

https://www.unicode.org/standard/WhatIsUnicode.html

Not glyphs.

("Character" in natural language is a bit of a fuzzy concept, so I think 
that Unicode here is referring to what their glossary calls an abstract 
character.)

The usual meaning of glyph is for the graphical images used 
by fonts (typefaces) for display. Sense 2 in the Unicode glossary here:

https://www.unicode.org/glossary/#glyph

I'm not really sure what they mean by sense 1, unless they mean a 
representative glyph, which is intended to stand in as an example of the 
entire range of glyphs.

Unicode does not specify what the glyphs for code points are, although 
it does provide representative samples. See, for example, their comment 
on emoji:

"The Unicode Consortium provides character code charts that show a 
representative glyph"

http://www.unicode.org/faq/emoji_dingbats.html

Their code point charts likewise show representative glyphs for other 
letters and symbols, not authoritative. And of course, many abstract 
characters do not have glyphs at all, e.g. invisible joiners, control 
characters, variation selectors, noncharacters, etc.

The mapping from bytes to code points and abstract characters is also 
part of Unicode. The UTF encodings are part of Unicode:

https://www.unicode.org/faq/utf_bom.html#gen2

The "U" in UTF literally stands for Unicode :-)


-- 
Steve
___
Python-Dev mailing list -- python-dev@python.org
To unsubscribe send an email to python-dev-le...@python.org
https://mail.python.org/mailman3/lists/python-dev.python.org/
Message archived at 
https://mail.python.org/archives/list/python-dev@python.org/message/I7ZRNIHSQ7UL4NSKOXFRYBYHQEXGNBPA/
Code of Conduct: http://python.org/psf/codeofconduct/