[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-28 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: This is a duplicate of issue28494. I concur with Gregory. What we can do -- improve the documentation. -- resolution: -> duplicate stage: -> resolved status: open -> closed superseder: -> is_zipfile false positives __

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-27 Thread Gregory P. Smith
Gregory P. Smith added the comment: The first four bytes of the file do not identify a zip file. A zip file is identified by the end of file central directory. Which you then must read entries of to determine where the start of the archive may be... often not at position zero. --

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-27 Thread STINNER Victor
STINNER Victor added the comment: > functions who's point is to be fast and not consume an amount of memory > determined by the input data I proposed to read the first 4 bytes of the file. It's a fixed length. -- ___ Python tracker

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-27 Thread Brian Kohan
Brian Kohan added the comment: I concur with Gregory. It seems that the action here is to just make it apparent in the docs the very real possibility of false positives. In my experience processing data from the wild, I see a pretty high rate of about 1/1000. I'm sure the probability is a fu

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-27 Thread Gregory P. Smith
Gregory P. Smith added the comment: ZipFile.open() is not the code for opening a zip file. :) That's the code for opening a file embedded within an already constructed mode='r' archive as done the ZipFile.__init__() constructor. By the time you've gotten to the open() method, you've loaded

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-26 Thread Serhiy Storchaka
Change by Serhiy Storchaka : -- nosy: +serhiy.storchaka ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe: https

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-26 Thread Serhiy Storchaka
Change by Serhiy Storchaka : -- assignee: -> serhiy.storchaka ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-26 Thread STINNER Victor
STINNER Victor added the comment: ZipFile.open() checks the first 4 bytes: # Skip the file header: fheader = zef_file.read(sizeFileHeader) if len(fheader) != sizeFileHeader: raise BadZipFile("Truncated file header") fheader = stru

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-26 Thread Alex Roussel
Alex Roussel added the comment: OK understood. Thanks for the explanation, I wasn't aware of those and will take a look. -- ___ Python tracker ___ ___

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-24 Thread Gregory P. Smith
Gregory P. Smith added the comment: for what it's worth: false positives are always going to be possible in any such "magic" check as is_zipfile is. we don't check the start of the file because zip files are defined by their end of file central directory which contains length information to

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-24 Thread Irit Katriel
Change by Irit Katriel : -- nosy: +gregory.p.smith, vstinner ___ Python tracker ___ ___ Python-bugs-list mailing list Unsubscribe:

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-23 Thread Alex Roussel
Alex Roussel added the comment: The impression I got from reading https://bugs.python.org/issue28494 was that this was fixed in python 3.7 ? Or perhaps as you say it's just a matter of stumbling across the rare files that generate false positives. --

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-22 Thread Brian Kohan
Brian Kohan added the comment: Hi all, I'm experiencing the same issue. I took a look at the is_zipfile code - seems like its not checking the start of the file for the magic numbers, but looking deeper in. I presume because the magic numbers at the start are considered unreliable for some

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-22 Thread Alex Roussel
Alex Roussel added the comment: Hi Irit, Thank you for the response, I'm afraid I'm not allowed to upload the file myself, however the file in question is '2020-10-18-1602979256-http_get_7549.json.gz', which is available at this link https://opendata.rapid7.com/sonar.http/?page=1. It becom

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-21 Thread Irit Katriel
Irit Katriel added the comment: Are you able to attach a file with which you see this problem? Have you tried with newer Python versions? -- nosy: +iritkatriel ___ Python tracker

[issue42096] zipfile.is_zipfile incorrectly identifying a gzipped file as a zip archive

2020-10-20 Thread Alex Roussel
New submission from Alex Roussel : Hello, I've come across an issue that seems similar to the false positives problem outlined in this ticket (https://bugs.python.org/issue28494), however this issue relates to a single gzipped json file which is incorrectly identified as a .zip archive becau