Bug#779207: patch submission for improving code pages support in unzip

Ivan Sorokin Tue, 28 May 2024 00:42:21 -0700

I slightly modified the patch for unzip:
1) Found a sample ANSI archive, which requires a separate code branch, so added 
it (sample archive attached)
2) Fixed the -I and -O options, which were broken; they now mean the same thing 
for all types of archives, both are left for compatibility, you can specify 
either one. That is, they work similarly to -mcp in 7zip now


  
>Воскресенье, 26 мая 2024, 17:48 +02:00 от Ivan Sorokin <un...@mail.ru>:
> 
>Dear colleagues,
> 
>I am writing to bring to your attention an issue with the current upstream 
>version of unzip that has not been updated for many years. In the modern 
>environment, where the vast majority of systems use UTF-8, unzip exhibits 
>several problems that need addressing:
> 
>1) unzip is unable to correctly extract files containing the bit 11 in the 
>General Purpose flag. This bit indicates that the file names are encoded in 
>UTF-8. However, unzip attempts to re-encode them as if they are in OEM 
>codepage, leading to incorrect file names.
>2) By default, unzip does not display UTF-8 encoding correctly on Unix systems.
>3) It is necessary to determine the OEM codepage correctly based on the system 
>locale, rather than using a single codepage for all archives.
>4) The assumption that archives for which the legacy codepage cannot be 
>determined are encoded in ISO 8859-1 is incorrect. In reality, most archivers 
>used the user's system codepage, which could be any codepage. It is reasonable 
>not to alter the encoding in this case, ensuring that the archive can be 
>opened at least on the same system where it was created. Additionally, options 
>-O and -I have been added to specify the encoding manually.
> 
>I have prepared a patch (based on a similar patch from Ubuntu, with 
>significant enhancements) that addresses these issues. A significant 
>difference from the Ubuntu patch is that my code is capable of selecting the 
>OEM codepage based on the system locale, instead of assuming the 
>Russian/Cyrillic CP866 codepage for all archives when the system is set to 
>UTF-8.
> 
>I hope you will find this patch useful.
> 
>Best regards,
>Ivan Sorokin
> 
>

From: Giovanni Scafora <giovanni.archlinux.org>
Subject: unzip files encoded with non-latin, non-unicode file names
Last-Update: 2024-02-22

* Updated 2015-02-11 by Marc Deslauriers <marc.deslauri...@canonical.com>
  to fix buffer overflow in charset_to_intern()
* Updated 2023-06-15 by Dominik Viererbe <dominik.viere...@canonical.com>
  to add documentation for `-I` and `-O` options (LP: #138307) and fixed
  garbled output when `zipinfo` or `unzip -Z` is called without arguments
  (LP: #1429939)
* Updated 2024-05-26 by Ivan Sorokin <un...@mail.ru>
  to fix several problems in codepages support:
  1) Fixed bit 11 of General purpose flag support on systems with UTF-8
  system charset
  2) Fixed OEM code page being always assumed Russian/Cyrillic CP866
  on any UTF-8 system
  3) Added proper OEM code page detection based on system locale setting
  4) Removed translation from ISO 8859-1 to local charset;
  assumption that any non-unicode archive uses it is for sure wrong
  as it can be any charset used on archive creator's local system;
  also do not treat PKZIP for UNIX 2.51 archives
  as having ISO 8859-1 charset for the same reasons
  5) Enabled UTF-8 output by default on Unix systems
  6) Fixed support for some legacy ANSI archives
  7) -I and -O for manually setting charset, they are just the same now,
  for all types of archives with 1-byte charset,
  both left for compatibility

--- a/INSTALL
+++ b/INSTALL
@@ -480,7 +480,7 @@ To compile UnZip, UnZipSFX and/or fUnZip (detailed instructions):
       NO_WORKING_ISPRINT
         The symbol HAVE_WORKING_ISPRINT enables enhanced non-printable chars
         filtering for filenames in the fnfilter() function.  On some systems
-        (Unix, VMS, some Win32 compilers), this setting is enabled by default.
+        (VMS, some Win32 compilers), this setting is enabled by default.
         In cases where isprint() flags printable extended characters as
         unprintable, defining NO_WORKING_ISPRINT allows to disable the enhanced
         filtering capability in fnfilter().  (The ASCII control codes 0x01 to
--- a/fileio.c
+++ b/fileio.c
@@ -2144,9 +2144,15 @@ int do_string(__G__ length, option)   /* return PK-type error code */
                 /* translate the text coded in the entry's host-dependent
                    "extended ASCII" charset into the compiler's (system's)
                    internal text code page */
-                Ext_ASCII_TO_Native((char *)G.outbuf, G.pInfo->hostnum,
-                                    G.pInfo->hostver, G.pInfo->HasUxAtt,
-                                    FALSE);
+#if (defined(UNICODE_SUPPORT) && defined(UTF8_MAYBE_NATIVE))
+                if (!G.pInfo->GPFIsUTF8 || !G.native_is_utf8) {
+#endif
+                        Ext_ASCII_TO_Native((char *)G.outbuf, G.pInfo->hostnum,
+                                            G.pInfo->hostver, G.pInfo->HasUxAtt,
+                                            FALSE);
+#if (defined(UNICODE_SUPPORT) && defined(UTF8_MAYBE_NATIVE))
+                }
+#endif
 #ifdef WINDLL
                 /* translate to ANSI (RTL internal codepage may be OEM) */
                 INTERN_TO_ISO((char *)G.outbuf, (char *)G.outbuf);
@@ -2258,8 +2264,14 @@ int do_string(__G__ length, option)   /* return PK-type error code */
 
         /* translate the Zip entry filename coded in host-dependent "extended
            ASCII" into the compiler's (system's) internal text code page */
-        Ext_ASCII_TO_Native(G.filename, G.pInfo->hostnum, G.pInfo->hostver,
-                            G.pInfo->HasUxAtt, (option == DS_FN_L));
+#if (defined(UNICODE_SUPPORT) && defined(UTF8_MAYBE_NATIVE))
+        if (!G.pInfo->GPFIsUTF8 || !G.native_is_utf8) {
+#endif
+            Ext_ASCII_TO_Native(G.filename, G.pInfo->hostnum, G.pInfo->hostver,
+                                G.pInfo->HasUxAtt, (option == DS_FN_L));
+#if (defined(UNICODE_SUPPORT) && defined(UTF8_MAYBE_NATIVE))
+        }
+#endif
 
         if (G.pInfo->lcflag)      /* replace with lowercase filename */
             STRLOWER(G.filename, G.filename);
--- a/man/unzip.1
+++ b/man/unzip.1
@@ -325,6 +325,8 @@ extension, it is replaced by the info from the extra field.)
 [MacOS only] ignore filenames stored in MacOS extra fields. Instead, the
 most compatible filename stored in the generic part of the entry's header
 is used.
+.IP \fB\-I\fP\ \fICHARSET\fP
+[UNIX only] Specify a character encoding for UNIX and other archives.
 .TP
 .B \-j
 junk paths.  The archive's directory structure is not recreated; all files
@@ -386,6 +388,8 @@ of \fIzip\fP(1), which stores filenotes as comments.
 overwrite existing files without prompting.  This is a dangerous option, so
 use it with care.  (It is often used with \fB\-f\fP, however, and is the only
 way to overwrite directory EAs under OS/2.)
+.IP \fB\-O\fP\ \fICHARSET\fP
+[UNIX only] Specify a character encoding for DOS, Windows and OS/2 archives.
 .IP \fB\-P\fP\ \fIpassword\fP
 use \fIpassword\fP to decrypt encrypted zipfile entries (if any).  \fBTHIS IS
 INSECURE!\fP  Many multi-user operating systems provide ways for any user to
--- a/man/zipinfo.1
+++ b/man/zipinfo.1
@@ -174,6 +174,10 @@ back to the behaviour of previous versions.
 .TP
 .B \-z
 include the archive comment (if any) in the listing.
+.IP \fB\-I\fP\ \fICHARSET\fP
+[UNIX only] Specify a character encoding for UNIX and other archives.
+.IP \fB\-O\fP\ \fICHARSET\fP
+[UNIX only] Specify a character encoding for DOS, Windows and OS/2 archives.
 .PD
 .\" =========================================================================
 .SH "DETAILED DESCRIPTION"
--- a/unix/unix.c
+++ b/unix/unix.c
@@ -30,6 +30,10 @@
 #define UNZIP_INTERNAL
 #include "unzip.h"
 
+#include <iconv.h>
+#include <langinfo.h>
+#include <stdbool.h>
+
 #ifdef SCO_XENIX
 #  define SYSNDIR
 #else  /* SCO Unix, AIX, DNIX, TI SysV, Coherent 4.x, ... */
@@ -1874,3 +1878,207 @@ static void qlfix(__G__ ef_ptr, ef_len)
     }
 }
 #endif /* QLZIP */
+
+
+typedef struct {
+    char *local_charset;
+    char *archive_charset;
+} CHARSET_MAP;
+
+/* A mapping of local <-> archive charsets used by default to convert filenames
+ * of DOS/Windows Zip archives. Currently very basic. */
+static CHARSET_MAP dos_charset_map[] = {
+    { "ANSI_X3.4-1968", "CP850" },
+    { "ISO-8859-1", "CP850" },
+    { "CP1252", "CP850" },
+    { "KOI8-R", "CP866" },
+    { "KOI8-U", "CP866" },
+    { "ISO-8859-5", "CP866" }
+};
+
+char OEM_CP[MAX_CP_NAME] = "";
+char ISO_CP[MAX_CP_NAME] = "";
+char ANSI_CP[MAX_CP_NAME] = "";
+int forcedCP = 0;
+
+/* Try to guess the default value of OEM_CP based on the current locale.
+ * ISO_CP is left alone for now. */
+void init_conversion_charsets()
+{
+    const char *local_charset;
+    int i;
+
+    /* Make a guess only if OEM_CP not already set. */
+    if(*OEM_CP == '\0') {
+        local_charset = nl_langinfo(CODESET);
+        for(i = 0; i < sizeof(dos_charset_map)/sizeof(CHARSET_MAP); i++)
+            if(!strcasecmp(local_charset, dos_charset_map[i].local_charset)) {
+                    strncpy(OEM_CP, dos_charset_map[i].archive_charset,
+                        sizeof(OEM_CP));
+                break;
+            }
+    }
+
+    // Still not detected? Try to detect by system locale
+    if(*OEM_CP == '\0') {
+
+      const char *lcToOemTable[] = {
+        "af_ZA", "CP850", "ar_SA", "CP720", "ar_LB", "CP720", "ar_EG", "CP720",
+        "ar_DZ", "CP720", "ar_BH", "CP720", "ar_IQ", "CP720", "ar_JO", "CP720",
+        "ar_KW", "CP720", "ar_LY", "CP720", "ar_MA", "CP720", "ar_OM", "CP720",
+        "ar_QA", "CP720", "ar_SY", "CP720", "ar_TN", "CP720", "ar_AE", "CP720",
+        "ar_YE", "CP720", "ast_ES", "CP850", "az_AZ", "CP866", "az_AZ", "CP857",
+        "be_BY", "CP866", "bg_BG", "CP866", "br_FR", "CP850", "ca_ES", "CP850",
+        "zh_CN", "CP936", "zh_TW", "CP950", "kw_GB", "CP850", "cs_CZ", "CP852",
+        "cy_GB", "CP850", "da_DK", "CP850", "de_AT", "CP850", "de_LI", "CP850",
+        "de_LU", "CP850", "de_CH", "CP850", "de_DE", "CP850", "el_GR", "CP737",
+        "en_AU", "CP850", "en_CA", "CP850", "en_GB", "CP850", "en_IE", "CP850",
+        "en_JM", "CP850", "en_BZ", "CP850", "en_PH", "CP437", "en_ZA", "CP437",
+        "en_TT", "CP850", "en_US", "CP437", "en_ZW", "CP437", "en_NZ", "CP850",
+        "es_PA", "CP850", "es_BO", "CP850", "es_CR", "CP850", "es_DO", "CP850",
+        "es_SV", "CP850", "es_EC", "CP850", "es_GT", "CP850", "es_HN", "CP850",
+        "es_NI", "CP850", "es_CL", "CP850", "es_MX", "CP850", "es_ES", "CP850",
+        "es_CO", "CP850", "es_ES", "CP850", "es_PE", "CP850", "es_AR", "CP850",
+        "es_PR", "CP850", "es_VE", "CP850", "es_UY", "CP850", "es_PY", "CP850",
+        "et_EE", "CP775", "eu_ES", "CP850", "fa_IR", "CP720", "fi_FI", "CP850",
+        "fo_FO", "CP850", "fr_FR", "CP850", "fr_BE", "CP850", "fr_CA", "CP850",
+        "fr_LU", "CP850", "fr_MC", "CP850", "fr_CH", "CP850", "ga_IE", "CP437",
+        "gd_GB", "CP850", "gv_IM", "CP850", "gl_ES", "CP850", "he_IL", "CP862",
+        "hr_HR", "CP852", "hu_HU", "CP852", "id_ID", "CP850", "is_IS", "CP850",
+        "it_IT", "CP850", "it_CH", "CP850", "iv_IV", "CP437", "ja_JP", "CP932",
+        "kk_KZ", "CP866", "ko_KR", "CP949", "ky_KG", "CP866", "lt_LT", "CP775",
+        "lv_LV", "CP775", "mk_MK", "CP866", "mn_MN", "CP866", "ms_BN", "CP850",
+        "ms_MY", "CP850", "nl_BE", "CP850", "nl_NL", "CP850", "nl_SR", "CP850",
+        "nn_NO", "CP850", "nb_NO", "CP850", "pl_PL", "CP852", "pt_BR", "CP850",
+        "pt_PT", "CP850", "rm_CH", "CP850", "ro_RO", "CP852", "ru_RU", "CP866",
+        "sk_SK", "CP852", "sl_SI", "CP852", "sq_AL", "CP852", "sr_RS", "CP855",
+        "sr_RS", "CP852", "sv_SE", "CP850", "sv_FI", "CP850", "sw_KE", "CP437",
+        "th_TH", "CP874", "tr_TR", "CP857", "tt_RU", "CP866", "uk_UA", "CP866",
+        "ur_PK", "CP720", "uz_UZ", "CP866", "uz_UZ", "CP857", "vi_VN", "CP1258",
+        "wa_BE", "CP850", "zh_HK", "CP950", "zh_SG", "CP936"};
+
+	  const char *lcToAnsiTable[] = {
+	    "af_ZA", "CP1252", "ar_SA", "CP1256", "ar_LB", "CP1256", "ar_EG", "CP1256",
+	    "ar_DZ", "CP1256", "ar_BH", "CP1256", "ar_IQ", "CP1256", "ar_JO", "CP1256",
+	    "ar_KW", "CP1256", "ar_LY", "CP1256", "ar_MA", "CP1256", "ar_OM", "CP1256",
+	    "ar_QA", "CP1256", "ar_SY", "CP1256", "ar_TN", "CP1256", "ar_AE", "CP1256",
+	    "ar_YE", "CP1256","ast_ES", "CP1252", "az_AZ", "CP1251", "az_AZ", "CP1254",
+	    "be_BY", "CP1251", "bg_BG", "CP1251", "br_FR", "CP1252", "ca_ES", "CP1252",
+	    "zh_CN", "CP936",  "zh_TW", "CP950",  "kw_GB", "CP1252", "cs_CZ", "CP1250",
+	    "cy_GB", "CP1252", "da_DK", "CP1252", "de_AT", "CP1252", "de_LI", "CP1252",
+	    "de_LU", "CP1252", "de_CH", "CP1252", "de_DE", "CP1252", "el_GR", "CP1253",
+	    "en_AU", "CP1252", "en_CA", "CP1252", "en_GB", "CP1252", "en_IE", "CP1252",
+	    "en_JM", "CP1252", "en_BZ", "CP1252", "en_PH", "CP1252", "en_ZA", "CP1252",
+	    "en_TT", "CP1252", "en_US", "CP1252", "en_ZW", "CP1252", "en_NZ", "CP1252",
+	    "es_PA", "CP1252", "es_BO", "CP1252", "es_CR", "CP1252", "es_DO", "CP1252",
+	    "es_SV", "CP1252", "es_EC", "CP1252", "es_GT", "CP1252", "es_HN", "CP1252",
+	    "es_NI", "CP1252", "es_CL", "CP1252", "es_MX", "CP1252", "es_ES", "CP1252",
+	    "es_CO", "CP1252", "es_ES", "CP1252", "es_PE", "CP1252", "es_AR", "CP1252",
+	    "es_PR", "CP1252", "es_VE", "CP1252", "es_UY", "CP1252", "es_PY", "CP1252",
+	    "et_EE", "CP1257", "eu_ES", "CP1252", "fa_IR", "CP1256", "fi_FI", "CP1252",
+	    "fo_FO", "CP1252", "fr_FR", "CP1252", "fr_BE", "CP1252", "fr_CA", "CP1252",
+	    "fr_LU", "CP1252", "fr_MC", "CP1252", "fr_CH", "CP1252", "ga_IE", "CP1252",
+	    "gd_GB", "CP1252", "gv_IM", "CP1252", "gl_ES", "CP1252", "he_IL", "CP1255",
+	    "hr_HR", "CP1250", "hu_HU", "CP1250", "id_ID", "CP1252", "is_IS", "CP1252",
+	    "it_IT", "CP1252", "it_CH", "CP1252", "iv_IV", "CP1252", "ja_JP", "CP932",
+	    "kk_KZ", "CP1251", "ko_KR", "CP949", "ky_KG", "CP1251", "lt_LT", "CP1257",
+	    "lv_LV", "CP1257", "mk_MK", "CP1251", "mn_MN", "CP1251", "ms_BN", "CP1252",
+	    "ms_MY", "CP1252", "nl_BE", "CP1252", "nl_NL", "CP1252", "nl_SR", "CP1252",
+	    "nn_NO", "CP1252", "nb_NO", "CP1252", "pl_PL", "CP1250", "pt_BR", "CP1252",
+	    "pt_PT", "CP1252", "rm_CH", "CP1252", "ro_RO", "CP1250", "ru_RU", "CP1251",
+	    "sk_SK", "CP1250", "sl_SI", "CP1250", "sq_AL", "CP1250", "sr_RS", "CP1251",
+	    "sr_RS", "CP1250", "sv_SE", "CP1252", "sv_FI", "CP1252", "sw_KE", "CP1252",
+	    "th_TH", "CP874", "tr_TR", "CP1254", "tt_RU", "CP1251", "uk_UA", "CP1251",
+	    "ur_PK", "CP1256", "uz_UZ", "CP1251", "uz_UZ", "CP1254", "vi_VN", "CP1258",
+	    "wa_BE", "CP1252", "zh_HK", "CP950", "zh_SG", "CP936"};
+
+      int tableLen = sizeof(lcToOemTable) / sizeof(lcToOemTable[0]);
+      int lcLen = 0, i;
+
+      // Detect required code page name from current locale
+      char *lc = setlocale(LC_CTYPE, "");
+
+      if (lc && lc[0]) {
+        // Compare up to the dot, if it exists, e.g. en_US.UTF-8
+        for (lcLen = 0; lc[lcLen] != '.' && lc[lcLen] != ':' && lc[lcLen] != '\0'; ++lcLen);
+
+        for (i = 0; i < tableLen; i += 2)
+
+          if (strncmp(lc, (lcToOemTable[i]), lcLen) == 0) {
+
+            strncpy(OEM_CP, lcToOemTable[i + 1],
+              sizeof(OEM_CP));
+
+            strncpy(ANSI_CP, lcToAnsiTable[i + 1],
+              sizeof(ANSI_CP));
+
+            break;
+          }
+      }
+    }
+}
+
+/* Convert a string from one encoding to the current locale using iconv().
+ * Be as non-intrusive as possible. If error is encountered during covertion
+ * just leave the string intact. */
+static void charset_to_intern(char *string, char *from_charset)
+{
+    iconv_t cd;
+    char *s,*d, *buf;
+    size_t slen, dlen, buflen;
+    const char *local_charset;
+
+    if(*from_charset == '\0')
+        return;
+
+    buf = NULL;
+    local_charset = nl_langinfo(CODESET);
+
+    if((cd = iconv_open(local_charset, from_charset)) == (iconv_t)-1)
+        return;
+
+    slen = strlen(string);
+    s = string;
+
+    /*  Make sure OUTBUFSIZ + 1 never ends up smaller than FILNAMSIZ
+     *  as this function also gets called with G.outbuf in fileio.c
+     */
+    buflen = FILNAMSIZ;
+    if (OUTBUFSIZ + 1 < FILNAMSIZ)
+    {
+        buflen = OUTBUFSIZ + 1;
+    }
+
+    d = buf = malloc(buflen);
+    if(!d)
+        goto cleanup;
+
+    bzero(buf,buflen);
+    dlen = buflen - 1;
+
+    if(iconv(cd, &s, &slen, &d, &dlen) == (size_t)-1)
+        goto cleanup;
+    strncpy(string, buf, buflen);
+
+    cleanup:
+    free(buf);
+    iconv_close(cd);
+}
+
+/* Convert a string from OEM_CP to the current locale charset. */
+inline void oem_intern(char *string)
+{
+    charset_to_intern(string, OEM_CP);
+}
+
+/* Convert a string from ISO_CP to the current locale charset. */
+inline void iso_intern(char *string)
+{
+    charset_to_intern(string, ISO_CP);
+}
+
+/* Convert a string from ANSI_CP to the current locale charset. */
+inline void ansi_intern(char *string)
+{
+    charset_to_intern(string, ANSI_CP);
+}
--- a/unix/unxcfg.h
+++ b/unix/unxcfg.h
@@ -174,8 +174,8 @@ typedef struct stat z_stat;
 #endif
 #ifndef NO_SETLOCALE
 # if (!defined(NO_WORKING_ISPRINT) && !defined(HAVE_WORKING_ISPRINT))
-   /* enable "enhanced" unprintable chars detection in fnfilter() */
-#  define HAVE_WORKING_ISPRINT
+   /* disable "enhanced" unprintable chars detection in fnfilter() */
+#  define NO_WORKING_ISPRINT
 # endif
 #endif
 
@@ -228,4 +228,36 @@ typedef struct stat z_stat;
 /* wild_dir, dirname, wildname, matchname[], dirnamelen, have_dirname, */
 /*    and notfirstcall are used by do_wild().                          */
 
+
+#define MAX_CP_NAME 25
+
+#ifdef SETLOCALE
+#  undef SETLOCALE
+#endif
+#define SETLOCALE(category, locale) setlocale(category, locale)
+#include <locale.h>
+
+#ifdef _ISO_INTERN
+#  undef _ISO_INTERN
+#endif
+#define _ISO_INTERN(str1) iso_intern(str1)
+
+#ifdef _OEM_INTERN
+#  undef _OEM_INTERN
+#endif
+#ifndef IZ_OEM2ISO_ARRAY
+#  define IZ_OEM2ISO_ARRAY
+#endif
+#define _OEM_INTERN(str1) oem_intern(str1)
+
+#ifdef _ANSI_INTERN
+#  undef _ANSI_INTERN
+#endif
+#define _ANSI_INTERN(str1) ansi_intern(str1)
+
+void iso_intern(char *);
+void oem_intern(char *);
+void ansi_intern(char *);
+void init_conversion_charsets(void);
+
 #endif /* !__unxcfg_h */
--- a/unzip.c
+++ b/unzip.c
@@ -327,11 +327,21 @@ static ZCONST char Far ZipInfoUsageLine2[] = "\nmain\
   -2  just filenames but allow -h/-t/-z  -l  long Unix \"ls -l\" format\n\
                                          -v  verbose, multi-page format\n";
 
+#ifndef UNIX
 static ZCONST char Far ZipInfoUsageLine3[] = "miscellaneous options:\n\
   -h  print header line       -t  print totals for listed files or for all\n\
-  -z  print zipfile comment   -T  print file times in sortable decimal format\
-\n  -C  be case-insensitive   %s\
+  -z  print zipfile comment   -T  print file times in sortable decimal format\n\
+  -C  be case-insensitive     %s\
   -x  exclude filenames that follow from listing\n";
+#else /* UNIX */
+static ZCONST char Far ZipInfoUsageLine3[] = "miscellaneous options:\n\
+  -h  print header line       -t  print totals for listed files or for all\n\
+  -z  print zipfile comment   -T  print file times in sortable decimal format\n\
+  -C  be case-insensitive   %s\
+  -x  exclude filenames that follow from listing\n\
+  -O CHARSET  specify a character encoding for DOS, Windows and OS/2 archives\n\
+  -I CHARSET  specify a character encoding for UNIX and other archives\n";
+#endif /* !UNIX */
 #ifdef MORE
    static ZCONST char Far ZipInfoUsageLine4[] =
      "  -M  page output through built-in \"more\"\n";
@@ -664,6 +674,17 @@ modifiers:\n\
   -U  use escapes for all non-ASCII Unicode  -UU ignore any Unicode fields\n\
   -C  match filenames case-insensitively     -L  make (some) names \
 lowercase\n %-42s  -V  retain VMS version numbers\n%s";
+#elif (defined UNIX)
+static ZCONST char Far UnzipUsageLine4[] = "\
+modifiers:\n\
+  -n  never overwrite existing files         -q  quiet mode (-qq => quieter)\n\
+  -o  overwrite files WITHOUT prompting      -a  auto-convert any text files\n\
+  -j  junk paths (do not make directories)   -aa treat ALL files as text\n\
+  -U  use escapes for all non-ASCII Unicode  -UU ignore any Unicode fields\n\
+  -C  match filenames case-insensitively     -L  make (some) names \
+lowercase\n %-42s  -V  retain VMS version numbers\n%s\
+  -O CHARSET  specify a character encoding for DOS, Windows and OS/2 archives\n\
+  -I CHARSET  specify a character encoding for UNIX and other archives\n\n";
 #else /* !VMS */
 static ZCONST char Far UnzipUsageLine4[] = "\
 modifiers:\n\
@@ -802,6 +823,10 @@ int unzip(__G__ argc, argv)
 #endif /* UNICODE_SUPPORT */
 
 
+#ifdef UNIX
+    init_conversion_charsets();
+#endif
+
 #if (defined(__IBMC__) && defined(__DEBUG_ALLOC__))
     extern void DebugMalloc(void);
 
@@ -1335,6 +1360,11 @@ int uz_opts(__G__ pargc, pargv)
     argc = *pargc;
     argv = *pargv;
 
+#ifdef UNIX
+    extern char OEM_CP[MAX_CP_NAME];
+    extern int forcedCP;
+#endif
+
     while (++argv, (--argc > 0 && *argv != NULL && **argv == '-')) {
         s = *argv + 1;
         while ((c = *s++) != 0) {    /* "!= 0":  prevent Turbo C warning */
@@ -1516,6 +1546,37 @@ int uz_opts(__G__ pargc, pargv)
                     }
                     break;
 #endif  /* MACOS */
+#ifdef UNIX
+                case ('I'):
+                    if (negative) {
+                        Info(slide, 0x401, ((char *)slide,
+                          "error:  encodings can't be negated"));
+                        return(PK_PARAM);
+                    } else {
+                        if(*s) { /* Handle the -Icharset case */
+                            /* Assume that charsets can't start with a dash to spot arguments misuse */
+                            if(*s == '-') {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                            forcedCP = 1;
+                        } else { /* -I charset */
+                            ++argv;
+                            if(!(--argc > 0 && *argv != NULL && **argv != '-')) {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            s = *argv;
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                            forcedCP = 1;
+                        }
+                        while(*(++s)); /* No params straight after charset name */
+                    }
+                    break;
+#endif /* ?UNIX */
                 case ('j'):    /* junk pathnames/directory structure */
                     if (negative)
                         uO.jflag = FALSE, negative = 0;
@@ -1591,6 +1652,37 @@ int uz_opts(__G__ pargc, pargv)
                     } else
                         ++uO.overwrite_all;
                     break;
+#ifdef UNIX
+                case ('O'):
+                    if (negative) {
+                        Info(slide, 0x401, ((char *)slide,
+                          "error:  encodings can't be negated"));
+                        return(PK_PARAM);
+                    } else {
+                        if(*s) { /* Handle the -Ocharset case */
+                            /* Assume that charsets can't start with a dash to spot arguments misuse */
+                            if(*s == '-') {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                            forcedCP = 1;
+                        } else { /* -O charset */
+                            ++argv;
+                            if(!(--argc > 0 && *argv != NULL && **argv != '-')) {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -O argument"));
+                                return(PK_PARAM);
+                            }
+                            s = *argv;
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                            forcedCP = 1;
+                        }
+                        while(*(++s)); /* No params straight after charset name */
+                    }
+                    break;
+#endif /* ?UNIX */
                 case ('p'):    /* pipes:  extract to stdout, no messages */
                     if (negative) {
                         uO.cflag = FALSE;
@@ -2162,6 +2254,7 @@ static void help_extended(__G)
   "         ACORN_FTYPE_NFS] Translate filetype and append to name.",
   "  -i   [MacOS] Ignore filenames in MacOS extra field.  Instead, use name in",
   "         standard header.",
+  "  -I CHARSET  [UNIX] Specify a character encoding for UNIX and other archives.",
   "  -j   Junk paths and deposit all files in extraction directory.",
   "  -J   [BeOS] Junk file attributes.  [MacOS] Ignore MacOS specific info.",
   "  -K   [AtheOS, BeOS, Unix] Restore SUID/SGID/Tacky file attributes.",
@@ -2172,6 +2265,8 @@ static void help_extended(__G)
   "  -N   [Amiga] Extract file comments as Amiga filenotes.",
   "  -o   Overwrite existing files without prompting.  Useful with -f.  Use with",
   "         care.",
+  "  -O CHARSET  [UNIX] Specify a character encoding for DOS, Windows",
+  "                and OS/2 archives.",
   "  -P p Use password p to decrypt files.  THIS IS INSECURE!  Some OS show",
   "         command line to other users.",
   "  -q   Perform operations quietly.  The more q (as in -qq) the quieter.",
@@ -2264,6 +2359,9 @@ static void help_extended(__G)
   "        representing the Unicode character number of the character in hex.",
   "  -UU [UNICODE]  Disable use of any UTF-8 path information.",
   "  -z  Include archive comment if any in listing.",
+  "  -O CHARSET  [UNIX] Specify a character encoding for DOS, Windows",
+  "                and OS/2 archives.",
+  "  -I CHARSET  [UNIX] Specify a character encoding for UNIX and other archives.",
   "",
   "",
   "funzip stream extractor:",
--- a/unzpriv.h
+++ b/unzpriv.h
@@ -3012,16 +3012,42 @@ char    *GetLoadPath     OF((__GPRO));                              /* local */
  *
  * All other ports are assumed to code zip entry filenames in ISO 8859-1.
  */
+
+// 2024-05-25 Removed "|| (isuxatt)": actually we know nothing
+// about local system's codepage of PKZIP 2.51 UNIX users.
+// Also removed "_ISO_INTERN((string)); \":
+// Windows ANSI is not always 1252, also standard defines default
+// charset as CP437, not ISO 8859-1. But in fact most of packers
+// just used local system's charset, so without any charset translation
+// we will at least make such archives processed correctly
+// on the same system - Ivan Sorokin <un...@mail.ru>
+
+// Defined only on Unix for now
+#ifndef _ANSI_INTERN
+#define _ANSI_INTERN(str1) {}
+#endif
+#ifndef UNIX
+int forcedCP = 0;
+#else
+extern int forcedCP;
+#endif
+
 #ifndef Ext_ASCII_TO_Native
 #  define Ext_ASCII_TO_Native(string, hostnum, hostver, isuxatt, islochdr) \
-    if (((hostnum) == FS_FAT_ && \
-         !(((islochdr) || (isuxatt)) && \
+    if ((forcedCP) == 1) { \
+        _OEM_INTERN((string)); \
+    } else if ((hostnum) == FS_NTFS_ && (hostver) >= 20) { \
+        _ANSI_INTERN((string)); \
+    } else if (((hostnum) == FS_FAT_ && \
+         !((islochdr) && \
            ((hostver) == 25 || (hostver) == 26 || (hostver) == 40))) || \
         (hostnum) == FS_HPFS_ || \
-        ((hostnum) == FS_NTFS_ && (hostver) == 50)) { \
+        (hostnum) == FS_NTFS_) { \
         _OEM_INTERN((string)); \
-    } else { \
-        _ISO_INTERN((string)); \
+    } else if (((hostnum) == FS_FAT_ && \
+         ((islochdr) && \
+          ((hostver) == 25 || (hostver) == 26 || (hostver) == 40)))) { \
+        _ANSI_INTERN((string)); \
     }
 #endif
 
--- a/zipinfo.c
+++ b/zipinfo.c
@@ -457,6 +457,10 @@ int zi_opts(__G__ pargc, pargv)
     int    tflag_slm=TRUE, tflag_2v=FALSE;
     int    explicit_h=FALSE, explicit_t=FALSE;
 
+#ifdef UNIX
+    extern char OEM_CP[MAX_CP_NAME];
+    extern int forcedCP;
+#endif
 
 #ifdef MACOS
     uO.lflag = LFLAG;         /* reset default on each call */
@@ -501,6 +505,37 @@ int zi_opts(__G__ pargc, pargv)
                             uO.lflag = 0;
                     }
                     break;
+#ifdef UNIX
+                case ('I'):
+                    if (negative) {
+                        Info(slide, 0x401, ((char *)slide,
+                          "error:  encodings can't be negated"));
+                        return(PK_PARAM);
+                    } else {
+                        if(*s) { /* Handle the -Icharset case */
+                            /* Assume that charsets can't start with a dash to spot arguments misuse */
+                            if(*s == '-') {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                            forcedCP = 1;
+                        } else { /* -I charset */
+                            ++argv;
+                            if(!(--argc > 0 && *argv != NULL && **argv != '-')) {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            s = *argv;
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                            forcedCP = 1;
+                        }
+                        while(*(++s)); /* No params straight after charset name */
+                    }
+                    break;
+#endif /* ?UNIX */
                 case 'l':      /* longer form of "ls -l" type listing */
                     if (negative)
                         uO.lflag = -2, negative = 0;
@@ -521,6 +556,37 @@ int zi_opts(__G__ pargc, pargv)
                         G.M_flag = TRUE;
                     break;
 #endif
+#ifdef UNIX
+                case ('O'):
+                    if (negative) {
+                        Info(slide, 0x401, ((char *)slide,
+                          "error:  encodings can't be negated"));
+                        return(PK_PARAM);
+                    } else {
+                        if(*s) { /* Handle the -Ocharset case */
+                            /* Assume that charsets can't start with a dash to spot arguments misuse */
+                            if(*s == '-') {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -I argument"));
+                                return(PK_PARAM);
+                            }
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                            forcedCP = 1;
+                        } else { /* -O charset */
+                            ++argv;
+                            if(!(--argc > 0 && *argv != NULL && **argv != '-')) {
+                                Info(slide, 0x401, ((char *)slide,
+                                  "error:  a valid character encoding should follow the -O argument"));
+                                return(PK_PARAM);
+                            }
+                            s = *argv;
+                            strncpy(OEM_CP, s, sizeof(OEM_CP));
+                            forcedCP = 1;
+                        }
+                        while(*(++s)); /* No params straight after charset name */
+                    }
+                    break;
+#endif /* ?UNIX */
                 case 's':      /* default:  shorter "ls -l" type listing */
                     if (negative)
                         uO.lflag = -2, negative = 0;

<<attachment: _________________________.htm.zip>>

Bug#779207: patch submission for improving code pages support in unzip

Reply via email to