New submission from Valentin Samir :
When tarfile open a tar containing a sparse file where actual data is bigger
than 0o777 bytes (~8GB), it fails listing members after this file. As
so, it is impossible to access/extract files located after a such file inside
the archive.
A tar file presenting the issue is available at https://genua.fr/sample.tar.xz
Uncompressed the file is ~16GB.
It containes two files:
* disk.img a 50GB sparse file containing ~16GB of data
* README.txt a simple text file containing "This last file is not properly
listed"
disk.img was generated using the folowing python script:
GB = 1024**3
buf = b"\xFF" * 1024**2
with open('disk.img', 'wb') as f:
f.seek(10 * GB)
wrotten = 0
while wrotten < 0o777:
wrotten += f.write(buf)
f.flush()
print(wrotten/0o777 * 100, '%')
f.seek(50 * GB - 1)
f.write(b'\0')
sample.tar was generated using GNU tar 1.30 on a Debian 10 with the following
command:
tar --format pax -cvSf sample.tar disk.img README.txt
The following script expose the issue:
import tarfile
t = tarfile.open('sample.tar')
print('members', t.getmembers())
print('offset', t.offset)
Its output is:
members []
offset 17179806208
members should also list README.txt.
I think I have found the root cause of the bug: Because the file is bigger than
0o777, it's size cannot be specified inside the tar ustar header, so a
"size" pax extented header is generated. This header contain the full size of
the file block in the tar.
As the file is sparse, as of sparse format 1.0, the file block contains first a
sparse mapping, then the file data. So this block size is the size of the
mapping added to the size of the data.
Because the file is sparse, a GNU.sparse.realsize header is also added
containing the full expanded file size (here 50GB).
Here
https://github.com/python/cpython/blob/4dee92b0ad9f4e3ea2f5253340801bb92dc7/Lib/tarfile.py#L1350
tarfile set the tarinfo size to GNU.sparse.realsize (50GB),then, in this
block
https://github.com/python/cpython/blob/4dee92b0ad9f4e3ea2f5253340801bb92dc7/Lib/tarfile.py#L1297
the file offset is moved forward from GNU.sparse.realsize (50GB) instead of
pax_headers["size"]. Moreover, the move is done from next.offset_data which is
set at https://github.com/python/cpython/blob/master/Lib/tarfile.py#L1338 to
after the sparse mapping.
The move forward in the sparse file should be made from next.offset + BLOCKSIZE.
--
components: Library (Lib)
messages: 362275
nosy: Nit
priority: normal
severity: normal
status: open
title: tarfile: GNU sparse 1.0 pax tar header offset not properly computed
type: behavior
versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8, Python 3.9
___
Python tracker
<https://bugs.python.org/issue39688>
___
___
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com