[
https://issues.apache.org/jira/browse/VFS-637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16058831#comment-16058831
]
Guido Schnepp commented on VFS-637:
-----------------------------------
Not much, because both encodings differ from IBM437 in bigger parts:
java.nio.charset.StandardCharsets.US-ASCII is a 7 bit charset only, so all
characters 0x80 and above are missing. That's the most interesting part of
codepage 437 for non-US people, where you also find the german umlauts for
example.
ASCII (any) and ANSI (most) are 8 bit charsets which base on US-ASCII but have
codepoints 0x80-0xFF defined - differently! On a sample with german umlaut Ä
(Latin capital letter A with diaeresis) and ä (Latin small letter a with
diaeresis):
IBM437: Ä on 0x8E, ä on 0x84 (
https://de.wikipedia.org/wiki/Codepage_437 )
ISO-8859-1: Ä on 0xC4, ä on 0xE4 ( https://de.wikipedia.org/wiki/ISO_8859-1 )
UNICODE: Ä on U+00C4, ä on U+00E4 ( https://de.wikipedia.org/wiki/Umlaut )
US-ASCII: Ä non existent, ä non existent
PKWare is only a vendor of ZIP aware products, but with pkzip one of the first,
far before zip support has been added to operating systems or Java runtime
natively. I by myself trust their expertise here. BUT: I've also read on other
pages, that on development of the ZIP format, encoding was +not defined+ (or
not in mind) in any way. The language encoding bit has been added later. So
some functionality is required long-term to set the desired encoding as user
like, because you can also find ZIP files with native russian characters inside
potentially - which are even not supported by IBM437.
Another interesting source is the new Oracle Java Zip package javadoc:
https://docs.oracle.com/javase/7/docs/api/java/util/zip/package-summary.html .
The, in turn, refer to PKWare Appendix D already mentioned on Ticket above.
> Zip files with legacy encoding and special characters let VFS crash
> -------------------------------------------------------------------
>
> Key: VFS-637
> URL: https://issues.apache.org/jira/browse/VFS-637
> Project: Commons VFS
> Issue Type: Bug
> Environment: Windows 10 64 Bit, Java 8
> Reporter: Guido Schnepp
> Labels: easyfix
> Original Estimate: 24h
> Remaining Estimate: 24h
>
> Oracle has reworked the ZipFile object with Java 7. Since then the default
> constructor used by commons-vfs2 2.1 is more restrictive than with Java 6.
> The ZipFile constructor has got a second parameter (Charset) now for
> specification of the legacy charset to be used explicitly if the ZipFile
> doesn't state its UTF-8 compliance internally. This affects all ZIP files
> using a legacy charset for filename encoding but not using UTF-8 is it is
> common today. This could be a ZIP file with files containing german umlauts
> or russian characters in archived file's filenames, for example.
> To support this new parameter with (more or less) default values, the class
> org.apache.commons.vfs2.provider.zip.ZipFileSystem has to be extended by a
> default charset parameter, getter or setter (as you like) to forward this
> setting to the java.util.zip.ZipFile constructor.
> Quick workaround for me was to create a new OwnZipFileProvider referring to
> the even new OwnZipFileSystem (extending ZipFileSystem) with the following
> modified function. Change has been highlighted:
> {{ protected ZipFile createZipFile(final File file) throws
> FileSystemException {
> try {
> return new ZipFile(file{color:red}*,
> Charset.forName("IBM437")*{color});
> } catch (final IOException ioe) {
> throw new
> FileSystemException("vfs.provider.zip/open-zip-file.error", file, ioe);
> }
> }
> }}
> Presetting to charset 437 as legacy default charset seems to be a a good
> workaround as stated in appendix D here:
> https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT :
> "D.1 The ZIP format has historically supported only the original IBM PC
> character encoding set, commonly referred to as IBM Code Page 437. This
> limits storing file name characters to only those within the original MS-DOS
> range of values and does not properly support file names in other character
> encodings, or languages. [...]"
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)