[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-27 Thread Serhiy Storchaka
Serhiy Storchaka added the comment: The standard requires interpreting filename encoding as cp470 or utf8. But for practical reasons it would be handy to allow to specify other encoding (which is not necessary equal ti the local filesystem encoding) . This is issue28080. But i left this issue

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-25 Thread John Goerzen
John Goerzen added the comment: Hi Jon, I've read your article in the gist, the ZIP spec, and the article you linked to. As the article you linked to (https://marcosc.com/2008/12/zip-files-and-encoding-i-hate-you/) states, "Implementers just encode file names however they want (usually byt

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-25 Thread Jon Nalley
Jon Nalley added the comment: Please see a detailed explanation of the behavior here: https://gist.github.com/jnalley/cec21bca2d865758bc5e23654df28bd5 -- ___ Python tracker __

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-24 Thread John Goerzen
John Goerzen added the comment: I can tell you that the zip(1) on Unix systems has never done re-encoding to cp437; on a system that uses latin-1 (or any other latin-* for that matter) the filenames in the ZIP will be encoded in latin-1. Furthermore, this doesn't explain the corruption that

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-24 Thread Jon Nalley
Jon Nalley added the comment: I think the Python implementation is adhering to the zip specification. >From the specification v6.3.6 (Revised: April 26, 2019): If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding. If general pur

[issue38861] zipfile: Corrupts filenames containing non-UTF8 characters

2019-11-19 Thread John Goerzen
New submission from John Goerzen : The zipfile.py standard library component contains a number of pieces of questionable handling of non-UTF8 filenames. As the ZIP file format predated Unicode by a significant number of years, this is actually fairly common with older code. Here is a very s