[issue46384] Request: make lzma._(encode|decode)_filter_properties public

2022-01-14 Thread Hiroshi Miura


New submission from Hiroshi Miura :

py7zr 3rd party project that use lzma module to compress/decompress 7-zip 
archive uses lzma._(encode|decode)_filter_properties.

These methods are public at first but become private in py3.4 at commit 
a425c3d5a264c556d31bdd88097c79246b533ea3

Here is a reason described in commit comment 
> These functions were originally added to support LZMA compression in the 
> zipfile module, and are not of interest for the majority of users.

This is a request these methods to be public.

ref: py7zr: https://github.com/miurahr/py7zr

--
components: Library (Lib)
messages: 410615
nosy: miurahr
priority: normal
severity: normal
status: open
title: Request: make  lzma._(encode|decode)_filter_properties public
type: enhancement
versions: Python 3.11

___
Python tracker 
<https://bugs.python.org/issue46384>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-04 Thread Hiroshi Miura


New submission from Hiroshi Miura :

When decompressing a particular archive, result become truncated a last word. 
A test data attached is uncompressed size is 12800 bytes, and compressed using 
LZMA1+BCJ algorithm into 11327 bytes.
The data is a payload of a 7zip archive.

Here is a pytest code to reproduce it.


:: code-block::

def test_lzma_raw_decompressor_lzmabcj():
filters = []
filters.append({'id': lzma.FILTER_X86})
filters.append(lzma._decode_filter_properties(lzma.FILTER_LZMA1, 
b']\x00\x00\x01\x00'))
decompressor = lzma.LZMADecompressor(format=lzma.FORMAT_RAW, 
filters=filters)
with testdata_path.joinpath('lzmabcj.bin').open('rb') as infile:
out = decompressor.decompress(infile.read(11327))
assert len(out) == 12800


test become failure that len(out) become 12796 bytes, which lacks last 4 bytes, 
which should be b'\x00\x00\x00\x00'
When specifying  a filters  as a single LZMA1 decompression,  I got an expected 
length of data, 12800 bytes.(*1)

When creating a test data with LZMA2+BCJ and examines it, I got an expected 
data.
When specifying a filters as a single LZMA2 decompression against LZMA2+BCJ 
payload, a result is perfectly as same as (*1) data.
It indicate us that a pipeline of LZMA1/LZMA2 --> BCJ is in doubt. 


After investigation and understanding that _lzmamodule.c is a thin wrapper of 
liblzma, I found the problem can be reproduced in liblzma.
I've reported it to upstream xz-devel ML with a test code 
https://www.mail-archive.com/xz-devel@tukaani.org/msg00370.html

--
components: Extension Modules
files: lzmabcj.bin
messages: 373008
nosy: miurahr
priority: normal
severity: normal
status: open
title: LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is 
paticular LZMA+BCJ  data
versions: Python 3.6, Python 3.7, Python 3.8, Python 3.9
Added file: https://bugs.python.org/file49296/lzmabcj.bin

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-06 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

>Compression filters:
>FILTER_LZMA1 (for use with FORMAT_ALONE)
>FILTER_LZMA2 (for use with FORMAT_XZ and FORMAT_RAW)

I look into past discussion  BPO-6715 when lzma module proposed.
https://bugs.python.org/issue6715

There is an only comment about FORMAT_ALONE and LZMA1 here 
https://bugs.python.org/issue6715#msg92174

> .lzma is actually not a format. It is just the raw output of the LZMA1
> coder. XZ instead is a container format for the LZMA2 coder, which
probably means LZMA+some metadata.

It said FORMAT_ALONE decode .lzma archive which use LZMA1 as coder and 
FORMAT_XZ decode .xz archive which use LZMA2 as coder.
There are no discussion about FORMAT_RAW.

This indicate an opposite relation between two things.
FORMAT_ALONE should use with LZMA1.
FORMAT_XZ should use with LZMA2. 

FORMAT_RAW actually no limitation against LZMA1/2.

Here is another discussion about lzma_raw_encoder and LZMA1.
A xz/liblzma maintainer Lasse suggest lzma_raw_encoder is usable for LZMA1.
https://sourceforge.net/p/lzmautils/discussion/708858/thread/cd04b6ace0/#6050


I think we need fix the document.

--

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-06 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

I think FORMAT_RAW is only tested with LZMA2 in Lib/test/test_lzma.py Since no 
test is for LZMA1, then the document express FORMAT_RAW is for LZMA2.

I'd like to add tests against LZMA1 and change expression on the document.

--
keywords: +patch
Added file: 
https://bugs.python.org/file49300/0001-lzma-support-LZMA1-with-FORMAT_RAW.patch

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-07 Thread Hiroshi Miura


Change by Hiroshi Miura :


Added file: 
https://bugs.python.org/file49301/0001-lzma-support-LZMA1-with-FORMAT_RAW.patch

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-07 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

Thank you for information about similar problem.

This problem is observed and reported on 7-zip library project, 
https://github.com/miurahr/py7zr/issues/178.
py7zr heavily depend on lzma FORMAT_RAW interface.

Fortunately  7-zip container format has size database, then library can know 
output is enough or not.

In reported case, the library/caller become a state that all input data has 
send into decompressor,  but decompressor does not output anything.

I'd like to wait upstream reaction.

--

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is paticular LZMA+BCJ data

2020-07-11 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

Here is a BCJ only CFFI test project.
https://github.com/miurahr/bcj-cffi

It imports two bcj_x86 C sources, one is from liblzma (src/xz_bcj_x86.c) taht 
is bind with python's lzma module, and the other is from xz-embbed project for 
linux kernel.(src/xz_simple_bcj.c)

We can observe that

1. it has an interface which overwrite buffer
2. it returns good resulted buffer (digest assertion) in both case
3. it returns 4 bytes less size than expected.

for 3, it is because return value  of BCJ is defined such as

```
size -= 4;
for (i = 0; i < size; ++i) {...}
return i;
```
and  variable i sometimes increment 4 bytes when target sequence is found and 
processed.

It may be natural that a size value returned from BCJ filter is often 4 bytes 
smaller than actual.

--

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE

2020-07-12 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

Lasse Collin gives me explanation of LZMA1 data format and suggestion how to 
implement.

I'd like to change an issue to a documentation issue to add more description 
about limitation on FORMAT_ALONE and LZMA1.

A suggestion from Lasse is as follows:

> liblzma cannot be used to decode data from .7z files except in certain
> cases. This isn't a bug, it's a missing feature.
>
> The raw encoder and decoder APIs only support streams that contain an
> end of payload marker (EOPM) alias end of stream (EOS) marker. .7z
> files use LZMA1 without such an end marker. Instead, the end is handled
> by the decoder knowing the exact uncompressed size of the data.
>
> The API of liblzma supports LZMA1 without end marker via
> lzma_alone_decoder(). That API can be abused to properly decode raw
> LZMA1 with known uncompressed size by feeding the decoder a fake 13-byte
> header. Everything else in the public API requires some end marker.
>
> Decoding LZMA1 without BCJ or other extra filters from .7z with
> lzma_raw_decoder() kind of works but you will notice that it will never
> return LZMA_STREAM_END, only LZMA_OK. This is because it will never see
> an end marker. A minor downside is that it won't then do a small
> integrity check at the end either (one variable in the range decoder
> state will be 0 at the end of any valid LZMA1 stream);
> lzma_alone_decoder() does this check even when end marker is missing.
>
> If you use lzma_raw_decoder() for end-markerless LZMA1, make sure that
> you never give it more output space than the real uncompressed size. In
> rare cases this could result in extra output or an error since the
> decoder would try to decode more output using the input it has gotten
> so far. Overall I think the hack with lzma_alone_decoder() is a better
> way with the current API.
>
> BCJ filters process the input data in chunks of a few bytes long, thus
> they need to hold a few bytes of look-ahead buffer. With some filters
> like ARM the look-ahead is aligned and if the uncompressed size is a
> multiple of this alignment, lzma_raw_decoder() will give you all the
> data even when the LZMA1 layer doesn't have an end marker. The x86
> filter has one-byte alignment but needs to see five bytes from the
> future before producing output. When LZMA1 layer doesn't return
> LZMA_STREAM_END, the x86 filter doesn't know that the end was reached
> and cannot flush the last bytes out.
>
> Using liblzma to decode .7z works in these cases:
>
> - LZMA1 using a fake 13-byte header with lzma_alone_decoder():
>
> 1 byte LZMA properties byte that encodes lc/lp/pb
> 4 bytes dictionary size as little endian uint32_t
> 8 bytes uncompressed size as little endian uint64_t;
> UINT64_MAX means unknown and then (and only then)
> EOPM must be present

--
title: LZMADecompressor.decompress(FORMAT_RAW) truncate output when input is 
paticular LZMA+BCJ  data -> Docs: More description of reason about LZMA1 data 
handling with FORMAT_ALONE

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] Docs: More description of reason about LZMA1 data handling with FORMAT_ALONE

2020-08-02 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

Here is a draft of additional text


Usage of :const:`FILTER_LZMA1` with :const:`FORMAT_RAW` is not recommended.
Because it may produce a wrong output in a certain condition, decompressing 
a combination of :const:`FILTER_LZMA1` and BCJ filters in :const:`FORMAT_RAW`.
It is because LZMA1 format sometimes lacks End of Stream (EOS) mark that
lead BCJ filters can not be flushed.


I've tried to write without a description of liblzma implementation, but only a 
nature of API and file format specification.

--

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue41210] Docs: More description(warning) about LZMA1 + BCJ with FORMAT_RAW

2020-08-02 Thread Hiroshi Miura


Change by Hiroshi Miura :


--
title: Docs: More description of reason about LZMA1 data handling with 
FORMAT_ALONE -> Docs: More description(warning) about LZMA1 + BCJ with 
FORMAT_RAW

___
Python tracker 
<https://bugs.python.org/issue41210>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42057] pytest case which catch exceptions become segfault

2020-10-16 Thread Hiroshi Miura


New submission from Hiroshi Miura :

I've observed that pytest becomes segmentation fault on python 3.9.0 with 
attached case.

I've tested the case with several python versions;
  python 3.9.0a2 - good
  python 3.9.0a3, 3.9.0-final - bad
  python 3.10.0a1 - good

- OS: Mint Linux 20, Linux kernel 5.8.14
- Attachments:
* test_main.py  - pytest test case to reproduce
* py_stacktrace.txt - pytest result which become segmentation fault
* gdb_backtrace.txt - gdb backtrace

So I've bisected and  a result is as follows;

9af0e47b1705457bb6b327c197f2ec5737a1d8f6 is the first bad commit
commit 9af0e47b1705457bb6b327c197f2ec5737a1d8f6
Author: Mark Shannon 
Date:   Tue Jan 14 10:12:45 2020 +

bpo-39156: Break up COMPARE_OP into four logically distinct opcodes. 
(GH-17754)

Break up COMPARE_OP into four logically distinct opcodes:
* COMPARE_OP for rich comparisons
* IS_OP for 'is' and 'is not' tests
* CONTAINS_OP for 'in' and 'is not' tests
* JUMP_IF_NOT_EXC_MATCH for checking exceptions in 'try-except' statements.

 Doc/library/dis.rst|   21 +
 Include/opcode.h   |8 +-
 Lib/importlib/_bootstrap_external.py   |3 +-
 Lib/opcode.py  |7 +-
 Lib/test/test_dis.py   |  141 +-
 Lib/test/test_peepholer.py |   12 +-
 Lib/test/test_positional_only_arg.py   |6 +-
 .../2019-12-30-10-53-59.bpo-39156.veT-CB.rst   |9 +
 PC/launcher.c  |3 +-
 Python/ceval.c |  137 +-
 Python/compile.c   |   71 +-
 Python/importlib.h | 2922 +++--
 Python/importlib_external.h| 4560 ++--
 Python/importlib_zipimport.h   | 1831 
 Python/opcode_targets.h|6 +-
 Python/peephole.c  |6 +-
 Tools/scripts/generate_opcode_h.py |5 -
 17 files changed, 4901 insertions(+), 4847 deletions(-)
 create mode 100644 Misc/NEWS.d/next/Core and 
Builtins/2019-12-30-10-53-59.bpo-39156.veT-CB.rst

--
components: Interpreter Core
files: test_main.py
messages: 378796
nosy: miurahr
priority: normal
severity: normal
status: open
title: pytest case which catch  exceptions become segfault
type: crash
versions: Python 3.9
Added file: https://bugs.python.org/file49521/test_main.py

___
Python tracker 
<https://bugs.python.org/issue42057>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42057] pytest case which catch exceptions become segfault

2020-10-16 Thread Hiroshi Miura


Change by Hiroshi Miura :


Added file: https://bugs.python.org/file49522/py_stacktrace.txt

___
Python tracker 
<https://bugs.python.org/issue42057>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42057] pytest case which catch exceptions become segfault

2020-10-16 Thread Hiroshi Miura


Change by Hiroshi Miura :


Added file: https://bugs.python.org/file49523/gdb_backtrace.txt

___
Python tracker 
<https://bugs.python.org/issue42057>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42057] pytest case which catch exceptions become segfault

2020-10-16 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

Here is a result of running pytest on python 3.9.0 with gdb.

 test session starts 

platform linux -- Python 3.9.0, pytest-4.6.9, py-1.8.1, pluggy-0.13.0
rootdir: /home/miurahr/Projects/cpython
collected 2 items

test_main.py .
Program received signal SIGSEGV, Segmentation fault.
PyException_GetContext (self=self@entry=) 
at ../Objects/exceptions.c:351
warning: Source file is more recent than executable.
351 Py_XINCREF(context);

--

___
Python tracker 
<https://bugs.python.org/issue42057>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42057] pytest case which catch exceptions become segfault

2020-10-16 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

FYI: A following commit fixes the issue in 3.10 development branch.

6e8128f02e ("bpo-41323: Perform 'peephole' optimizations directly on the CFG. 
(GH-21517)", 2020-07-30)

--
nosy: +Mark.Shannon

___
Python tracker 
<https://bugs.python.org/issue42057>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue42057] pytest case which catch exceptions become segfault

2020-10-16 Thread Hiroshi Miura


Hiroshi Miura  added the comment:

A test code does not always reproduce the issue. Please try it in several times.

It seems to be happened when multiple threads try execute a same function which 
produces an exception, and both callers try to catch the exception at the same 
time.

--

___
Python tracker 
<https://bugs.python.org/issue42057>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com