[ 
https://issues.apache.org/jira/browse/VFS-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058831#comment-16058831
 ] 

Guido Schnepp commented on VFS-637:
-----------------------------------

Not much, because both encodings differ from IBM437 in bigger parts: 

java.nio.charset.StandardCharsets.US-ASCII is a 7 bit charset only, so all 
characters 0x80 and above are missing. That's the most interesting part of 
codepage 437 for non-US people, where you also find the german umlauts for 
example.

ASCII (any) and ANSI (most) are 8 bit charsets which base on US-ASCII but have 
codepoints 0x80-0xFF defined - differently! On a sample with german umlaut Ä 
(Latin capital letter A with diaeresis) and ä (Latin small letter a with 
diaeresis):
IBM437:        Ä on 0x8E, ä on 0x84 ( 
https://de.wikipedia.org/wiki/Codepage_437 )
ISO-8859-1: Ä on 0xC4, ä on 0xE4 ( https://de.wikipedia.org/wiki/ISO_8859-1 )
UNICODE:    Ä on U+00C4, ä on U+00E4 ( https://de.wikipedia.org/wiki/Umlaut )
US-ASCII:     Ä non existent, ä non existent

PKWare is only a vendor of ZIP aware products, but with pkzip one of the first, 
far before zip support has been added to operating systems or Java runtime 
natively. I by myself trust their expertise here. BUT: I've also read on other 
pages, that on development of the ZIP format, encoding was +not defined+ (or 
not in mind) in any way. The language encoding bit has been added later. So 
some functionality is required long-term to set the desired encoding as user 
like, because you can also find ZIP files with native russian characters inside 
potentially - which are even not supported by IBM437.

Another interesting source is the new Oracle Java Zip package javadoc: 
https://docs.oracle.com/javase/7/docs/api/java/util/zip/package-summary.html . 
The, in turn, refer to PKWare Appendix D already mentioned on Ticket above.

> Zip files with legacy encoding and special characters let VFS crash
> -------------------------------------------------------------------
>
>                 Key: VFS-637
>                 URL: https://issues.apache.org/jira/browse/VFS-637
>             Project: Commons VFS
>          Issue Type: Bug
>         Environment: Windows 10 64 Bit, Java 8
>            Reporter: Guido Schnepp
>              Labels: easyfix
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Oracle has reworked the ZipFile object with Java 7. Since then the default 
> constructor used by commons-vfs2 2.1 is more restrictive than with Java 6. 
> The ZipFile constructor has got a second parameter (Charset) now for 
> specification of the legacy charset to be used explicitly if the ZipFile 
> doesn't state its UTF-8 compliance internally. This affects all ZIP files 
> using a legacy charset for filename encoding but not using UTF-8 is it is 
> common today. This could be a ZIP file with files containing german umlauts 
> or russian characters in archived file's filenames, for example.
> To support this new parameter with (more or less) default values, the class 
> org.apache.commons.vfs2.provider.zip.ZipFileSystem has to be extended by a 
> default charset parameter, getter or setter (as you like) to forward this 
> setting to the java.util.zip.ZipFile constructor.
> Quick workaround for me was to create a new OwnZipFileProvider referring to 
> the even new OwnZipFileSystem (extending ZipFileSystem) with the following 
> modified function. Change has been highlighted:
> {{    protected ZipFile createZipFile(final File file) throws 
> FileSystemException {
>               try {
>                       return new ZipFile(file{color:red}*, 
> Charset.forName("IBM437")*{color});
>               } catch (final IOException ioe) {
>                       throw new 
> FileSystemException("vfs.provider.zip/open-zip-file.error", file, ioe);
>               }
>       }
> }}
> Presetting to charset 437 as legacy default charset seems to be a a good 
> workaround as stated in appendix D here: 
> https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT :
> "D.1 The ZIP format has historically supported only the original IBM PC 
> character encoding set, commonly referred to as IBM Code Page 437.  This 
> limits storing file name characters to only those within the original MS-DOS 
> range of values and does not properly support file names in other character 
> encodings, or  languages. [...]"



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to