[issue46384] Request: make lzma._(encode|decode)_filter_properties public
New submission from Hiroshi Miura : py7zr 3rd party project that use lzma module to compress/decompress 7-zip archive uses lzma._(encode|decode)_filter_properties. These methods are public at first but become private in py3.4 at commit a425c3d5a264c556d31bdd88097c79246b533ea3 Here is a reason described in commit comment > These functions were originally added to support LZMA compression in the > zipfile module, and are not of interest for the majority of users. This is a request these methods to be public. ref: py7zr: https://github.com/miurahr/py7zr -- components: Library (Lib) messages: 410615 nosy: miurahr priority: normal severity: normal status: open title: Request: make lzma._(encode|decode)_filter_properties public type: enhancement versions: Python 3.11 ___ Python tracker <https://bugs.python.org/issue46384> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data
New submission from Hiroshi Miura : When decompressing a particular archive, result become truncated a last word. A test data attached is uncompressed size is 12800 bytes, and compressed using LZMA1+BCJ algorithm into 11327 bytes. The data is a payload of a 7zip archive. Here is a pytest code to reproduce it. :: code-block:: def test_lzma_raw_decompressor_lzmabcj(): filters = [] filters.append({'id': lzma.FILTER_X86}) filters.append(lzma._decode_filter_properties(lzma.FILTER_LZMA1, b']\x00\x00\x01\x00')) decompressor = lzma.LZMADecompressor(format=lzma.FORMAT_RAW, filters=filters) with testdata_path.joinpath('lzmabcj.bin').open('rb') as infile: out = decompressor.decompress(infile.read(11327)) assert len(out) == 12800 test become failure that len(out) become 12796 bytes, which lacks last 4 bytes, which should be b'\x00\x00\x00\x00' When specifying a filters as a single LZMA1 decompression, I got an expected length of data, 12800 bytes.(*1) When creating a test data with LZMA2+BCJ and examines it, I got an expected data. When specifying a filters as a single LZMA2 decompression against LZMA2+BCJ payload, a result is perfectly as same as (*1) data. It indicate us that a pipeline of LZMA1/LZMA2 --> BCJ is in doubt. After investigation and understanding that _lzmamodule.c is a thin wrapper of liblzma, I found the problem can be reproduced in liblzma. I've reported it to upstream xz-devel ML with a test code https://www.mail-archive.com/xz-devel@tukaani.org/msg00370.html -- components: Extension Modules files: lzmabcj.bin messages: 373008 nosy: miurahr priority: normal severity: normal status: open title: LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data versions: Python 3.6, Python 3.7, Python 3.8, Python 3.9 Added file: https://bugs.python.org/file49296/lzmabcj.bin ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data
Hiroshi Miura added the comment: >Compression filters: >FILTER_LZMA1 (for use with FORMAT_ALONE) >FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW) I look into past discussion BPO-6715 when lzma module proposed. https://bugs.python.org/issue6715 There is an only comment about FORMAT_ALONE and LZMA1 here https://bugs.python.org/issue6715#msg92174 > .lzma is actually not a format. It is just the raw output of the LZMA1 > coder. XZ instead is a container format for the LZMA2 coder, which probably means LZMA+some metadata. It said FORMAT_ALONE decode .lzma archive which use LZMA1 as coder and FORMAT_XZ decode .xz archive which use LZMA2 as coder. There are no discussion about FORMAT_RAW. This indicate an opposite relation between two things. FORMAT_ALONE should use with LZMA1. FORMAT_XZ should use with LZMA2. FORMAT_RAW actually no limitation against LZMA1/2. Here is another discussion about lzma_raw_encoder and LZMA1. A xz/liblzma maintainer Lasse suggest lzma_raw_encoder is usable for LZMA1. https://sourceforge.net/p/lzmautils/discussion/708858/thread/cd04b6ace0/#6050 I think we need fix the document. -- ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data
Hiroshi Miura added the comment: I think FORMAT_RAW is only tested with LZMA2 in Lib/test/test_lzma.py Since no test is for LZMA1, then the document express FORMAT_RAW is for LZMA2. I'd like to add tests against LZMA1 and change expression on the document. -- keywords: +patch Added file: https://bugs.python.org/file49300/0001-lzma-support-LZMA1-with-FORMAT_RAW.patch ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data
Change by Hiroshi Miura : Added file: https://bugs.python.org/file49301/0001-lzma-support-LZMA1-with-FORMAT_RAW.patch ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data
Hiroshi Miura added the comment: Thank you for information about similar problem. This problem is observed and reported on 7-zip library project, https://github.com/miurahr/py7zr/issues/178. py7zr heavily depend on lzma FORMAT_RAW interface. Fortunately 7-zip container format has size database, then library can know output is enough or not. In reported case, the library/caller become a state that all input data has send into decompressor, but decompressor does not output anything. I'd like to wait upstream reaction. -- ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data
Hiroshi Miura added the comment: Here is a BCJ only CFFI test project. https://github.com/miurahr/bcj-cffi It imports two bcj_x86 C sources, one is from liblzma (src/xz_bcj_x86.c) taht is bind with python's lzma module, and the other is from xz-embbed project for linux kernel.(src/xz_simple_bcj.c) We can observe that 1. it has an interface which overwrite buffer 2. it returns good resulted buffer (digest assertion) in both case 3. it returns 4 bytes less size than expected. for 3, it is because return value of BCJ is defined such as ``` size -= 4; for (i = 0; i < size; ++i) {...} return i; ``` and variable i sometimes increment 4 bytes when target sequence is found and processed. It may be natural that a size value returned from BCJ filter is often 4 bytes smaller than actual. -- ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE
Hiroshi Miura added the comment: Lasse Collin gives me explanation of LZMA1 data format and suggestion how to implement. I'd like to change an issue to a documentation issue to add more description about limitation on FORMAT_ALONE and LZMA1. A suggestion from Lasse is as follows: > liblzma cannot be used to decode data from .7z files except in certain > cases. This isn't a bug, it's a missing feature. > > The raw encoder and decoder APIs only support streams that contain an > end of payload marker (EOPM) alias end of stream (EOS) marker. .7z > files use LZMA1 without such an end marker. Instead, the end is handled > by the decoder knowing the exact uncompressed size of the data. > > The API of liblzma supports LZMA1 without end marker via > lzma_alone_decoder(). That API can be abused to properly decode raw > LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte > header. Everything else in the public API requires some end marker. > > Decoding LZMA1 without BCJ or other extra filters from .7z with > lzma_raw_decoder() kind of works but you will notice that it will never > return LZMA_STREAM_END, only LZMA_OK. This is because it will never see > an end marker. A minor downside is that it won't then do a small > integrity check at the end either (one variable in the range decoder > state will be 0 at the end of any valid LZMA1 stream); > lzma_alone_decoder() does this check even when end marker is missing. > > If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that > you never give it more output space than the real uncompressed size. In > rare cases this could result in extra output or an error since the > decoder would try to decode more output using the input it has gotten > so far. Overall I think the hack with lzma_alone_decoder() is a better > way with the current API. > > BCJ filters process the input data in chunks of a few bytes long, thus > they need to hold a few bytes of look-ahead buffer. With some filters > like ARM the look-ahead is aligned and if the uncompressed size is a > multiple of this alignment, lzma_raw_decoder() will give you all the > data even when the LZMA1 layer doesn't have an end marker. The x86 > filter has one-byte alignment but needs to see five bytes from the > future before producing output. When LZMA1 layer doesn't return > LZMA_STREAM_END, the x86 filter doesn't know that the end was reached > and cannot flush the last bytes out. > > Using liblzma to decode .7z works in these cases: > > - LZMA1 using a fake 13-byte header with lzma_alone_decoder(): > > 1 byte LZMA properties byte that encodes lc/lp/pb > 4 bytes dictionary size as little endian uint32_t > 8 bytes uncompressed size as little endian uint64_t; > UINT64_MAX means unknown and then (and only then) > EOPM must be present -- title: LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data -> Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE
Hiroshi Miura added the comment: Here is a draft of additional text Usage of :const:`FILTER_LZMA1` with :const:`FORMAT_RAW` is not recommended. Because it may produce a wrong output in a certain condition, decompressing a combination of :const:`FILTER_LZMA1` and BCJ filters in :const:`FORMAT_RAW`. It is because LZMA1 format sometimes lacks End of Stream (EOS) mark that lead BCJ filters can not be flushed. I've tried to write without a description of liblzma implementation, but only a nature of API and file format specification. -- ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue41210] Docs: More description(warning) about LZMA1 + BCJ with FORMAT_RAW
Change by Hiroshi Miura : -- title: Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE -> Docs: More description(warning) about LZMA1 + BCJ with FORMAT_RAW ___ Python tracker <https://bugs.python.org/issue41210> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue42057] pytest case which catch exceptions become segfault
New submission from Hiroshi Miura : I've observed that pytest becomes segmentation fault on python 3.9.0 with attached case. I've tested the case with several python versions; python 3.9.0a2 - good python 3.9.0a3, 3.9.0-final - bad python 3.10.0a1 - good - OS: Mint Linux 20, Linux kernel 5.8.14 - Attachments: * test_main.py - pytest test case to reproduce * py_stacktrace.txt - pytest result which become segmentation fault * gdb_backtrace.txt - gdb backtrace So I've bisected and a result is as follows; 9af0e47b1705457bb6b327c197f2ec5737a1d8f6 is the first bad commit commit 9af0e47b1705457bb6b327c197f2ec5737a1d8f6 Author: Mark Shannon Date: Tue Jan 14 10:12:45 2020 + bpo-39156: Break up COMPARE_OP into four logically distinct opcodes. (GH-17754) Break up COMPARE_OP into four logically distinct opcodes: * COMPARE_OP for rich comparisons * IS_OP for 'is' and 'is not' tests * CONTAINS_OP for 'in' and 'is not' tests * JUMP_IF_NOT_EXC_MATCH for checking exceptions in 'try-except' statements. Doc/library/dis.rst| 21 + Include/opcode.h |8 +- Lib/importlib/_bootstrap_external.py |3 +- Lib/opcode.py |7 +- Lib/test/test_dis.py | 141 +- Lib/test/test_peepholer.py | 12 +- Lib/test/test_positional_only_arg.py |6 +- .../2019-12-30-10-53-59.bpo-39156.veT-CB.rst |9 + PC/launcher.c |3 +- Python/ceval.c | 137 +- Python/compile.c | 71 +- Python/importlib.h | 2922 +++-- Python/importlib_external.h| 4560 ++-- Python/importlib_zipimport.h | 1831 Python/opcode_targets.h|6 +- Python/peephole.c |6 +- Tools/scripts/generate_opcode_h.py |5 - 17 files changed, 4901 insertions(+), 4847 deletions(-) create mode 100644 Misc/NEWS.d/next/Core and Builtins/2019-12-30-10-53-59.bpo-39156.veT-CB.rst -- components: Interpreter Core files: test_main.py messages: 378796 nosy: miurahr priority: normal severity: normal status: open title: pytest case which catch exceptions become segfault type: crash versions: Python 3.9 Added file: https://bugs.python.org/file49521/test_main.py ___ Python tracker <https://bugs.python.org/issue42057> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue42057] pytest case which catch exceptions become segfault
Change by Hiroshi Miura : Added file: https://bugs.python.org/file49522/py_stacktrace.txt ___ Python tracker <https://bugs.python.org/issue42057> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue42057] pytest case which catch exceptions become segfault
Change by Hiroshi Miura : Added file: https://bugs.python.org/file49523/gdb_backtrace.txt ___ Python tracker <https://bugs.python.org/issue42057> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue42057] pytest case which catch exceptions become segfault
Hiroshi Miura added the comment: Here is a result of running pytest on python 3.9.0 with gdb. test session starts platform linux -- Python 3.9.0, pytest-4.6.9, py-1.8.1, pluggy-0.13.0 rootdir: /home/miurahr/Projects/cpython collected 2 items test_main.py . Program received signal SIGSEGV, Segmentation fault. PyException_GetContext (self=self@entry=) at ../Objects/exceptions.c:351 warning: Source file is more recent than executable. 351 Py_XINCREF(context); -- ___ Python tracker <https://bugs.python.org/issue42057> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue42057] pytest case which catch exceptions become segfault
Hiroshi Miura added the comment: FYI: A following commit fixes the issue in 3.10 development branch. 6e8128f02e ("bpo-41323: Perform 'peephole' optimizations directly on the CFG. (GH-21517)", 2020-07-30) -- nosy: +Mark.Shannon ___ Python tracker <https://bugs.python.org/issue42057> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue42057] pytest case which catch exceptions become segfault
Hiroshi Miura added the comment: A test code does not always reproduce the issue. Please try it in several times. It seems to be happened when multiple threads try execute a same function which produces an exception, and both callers try to catch the exception at the same time. -- ___ Python tracker <https://bugs.python.org/issue42057> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com