Source: python3.10 Version: 3.10.4-3 Severity: wishlist Tags: patch User: reproducible-bui...@lists.alioth.debian.org Usertags: randomness X-Debbugs-Cc: reproducible-b...@lists.alioth.debian.org
Hi, if a package contains python code with a variable named _m, then after installing that package the pyc file resulting from that code is unreproducible because of some randomness. Minimal reproducer: export SOURCE_DATE_EPOCH="$(date +%s)" for i in `seq 1 10`; do mmdebstrap --quiet --variant=apt --include=python3.10 \ --customize-hook='echo _m > "$1"/tmp/decoder.py' \ --customize-hook='chroot "$1" python3.10 -m py_compile /tmp/decoder.py' \ --customize-hook='cat "$1"/tmp/__pycache__/decoder.cpython-310.pyc | md5sum' \ unstable /dev/null 2>&1 done | sort | uniq -c The above will print something like: 6 4662176a6024d5eec15033097cd7e588 - 4 aeb00bedc784e7cca3eb42cf50e92f8d - If you run the loop more often, one can see that 2/3 of the times, the pyc file will have one hash and the other 1/3 of the times the other. So there are two distinct possible contents that the pyc file generated from the same python script just containing "_m" can have. Below you can find a difference between the hexdump these two possible pyc versions. I have no idea why this happens. But why does it matter? Since #1004558 got fixed, a Priority:standard chroot is now mostly bit-by-bit identical. Only "mostly" because there is one remaining difference: /usr/lib/python3.10/json/__pycache__/decoder.cpython-310.pyc But why does that pyc file differ (randomly) while all the others remain stable? Even if it sounds ridiculous, I tracked it down to the use of the variable _m in /usr/lib/python3.10/json/decoder.py. Also, the problem only shows when compiling all pyc files in a fresh chroot. Given the same chroot with all pyc files already generated, the pyc file generated from the minimal test case (just a python script containing the variable name "_m" as above) will remain stable. So the following will *not* reproduce the problem: echo _m > test.py for i in `seq 1 100`; do rm -rf __pycache__ python3.10 -m py_compile test.py md5sum __pycache__/test.cpython-310.pyc done It needs to be done in a fresh chroot. Since the pyc contents also rely on the modification time of the python scripts involved, maybe the reason for this is behaviour is some unreproducible mtimes after unpacking the packages? This is why I'm filing it here. This might as well be some sort of packaging problem. For the minimal test case (a python script just containing the variable name "_m"), the pyc file is very tiny and the diffoscope output will display the whole file via the diff context: @@ -1,8 +1,8 @@ 00000000: 6f0d 0d0a 0300 0000 5371 fe33 17b6 dd59 o.......Sq.3...Y 00000010: e300 0000 0000 0000 0000 0000 0000 0000 ................ 00000020: 0001 0000 0040 0000 0073 0800 0000 6500 .....@...s....e. -00000030: 0100 6400 5300 2901 4e29 01da 025f 6da9 ..d.S.).N)..._m. -00000040: 0072 0200 0000 7202 0000 00fa 0f2f 746d .r....r....../tm +00000030: 0100 6400 5300 2901 4e29 015a 025f 6da9 ..d.S.).N).Z._m. +00000040: 0072 0100 0000 7201 0000 00fa 0f2f 746d .r....r....../tm 00000050: 702f 6465 636f 6465 722e 7079 da08 3c6d p/decoder.py..<m 00000060: 6f64 756c 653e 0100 0000 7302 0000 0008 odule>....s..... 00000070: 00 . I'm not familiar with the pyc format so I cannot tell what the bits that differ mean but maybe somebody who can, can figure this out given the hexdump difference from above. But it's crazy that a simple choice of variable name triggers randomness in the pyc files, right? So to further test this theory, I patched the python3.10 source package like this: --- a/Lib/json/decoder.py +++ b/Lib/json/decoder.py @@ -67,7 +67,7 @@ def _decode_uXXXX(s, pos): raise JSONDecodeError(msg, s, pos) def py_scanstring(s, end, strict=True, - _b=BACKSLASH, _m=STRINGCHUNK.match): + _b=BACKSLASH, m=STRINGCHUNK.match): """Scan the string s for a JSON string. End is the index of the character in s after the quote that started the JSON string. Unescapes all valid JSON string escape sequences and raises ValueError @@ -80,7 +80,7 @@ def py_scanstring(s, end, strict=True, _append = chunks.append begin = end - 1 while 1: - chunk = _m(s, end) + chunk = m(s, end) if chunk is None: raise JSONDecodeError("Unterminated string starting at", s, begin) end = chunk.end() This solves the problem of random unreproducibility. All pyc files in a priority:standard chroot are now reproducible even when running the producer from the top of this mail 100 times. This is why I'm tagging this bug with "patch". I know this is just a workaround but maybe it can be applied until the underlying problem is identified? With above patch, a priority:standard chroot is now finally always bit-by-bit reproducible. I know that I also claimed that this were the case for the patch I submitted in #1004558 but since the pyc contents change randomly, it is very possible that I just did two tests which happened to produce identical output and called it a day and thus never encountered the randomly occurring difference of decoder.cpython-310.pyc. Due to the random nature of the pyc file contents, it's completely possible to run the reproducer 10 times and always get the same result and only the 11th run shows the difference. But what is so special about variables named _m? Following a hunch I searched the python codebase and found another variable called _m in Lib/types.py. Choosing _m here seemed arbitrary so I tried what happens if the function name would be changed from _m to something else: --- a/Lib/types.py +++ b/Lib/types.py @@ -37,8 +37,8 @@ _ag = _ag() AsyncGeneratorType = type(_ag) class _C: - def _m(self): pass -MethodType = type(_C()._m) + def _abc(self): pass +MethodType = type(_C()._abc) BuiltinFunctionType = type(len) BuiltinMethodType = type([].append) # Same as BuiltinFunctionType And this *also* fixes the reproducibility issue! So now there exists a second workaround patch and it seems that somehow private variable names from Lib/types.py have an influence on pyc files generated containing the same variable names in a completely different context? So yes, this is a bug that probably needs to be properly fixed elsewhere but until then, please consider applying either of above temporary workarounds so that a priority:standard chroot can become reproducible again for our next stable release. Thanks! cheers, josch