GNU Bash 4.4 Test Discrepancy on OpenVMS

Eric W. Robertson Fri, 07 Oct 2016 13:43:58 -0700

While building and testing GNU Bash 4.4 on OpenVMS, the GNU Bash testscript issued the following difference between OpenVMS Bash producedoutput and reference output for the test sub-script tests/exp8.sub(lines 28 - 31)


unset array
declare -A array
array=( [$'x\001y\177z']=$'a\242b\002c' )
echo ${array[@]@A}

Currently, the reference result expected for ALL platformimplementations for the above sequence of Bash test commands is embodiedin tests/exp.right (line 236):


declare -A array=([$'x\001y\177z']=$'a\242b\002c' )

on OpenVMS the following output is generated instead:

declare -A array=([$'x\001y\177z']=$'a¢b\002c' )

After studying the applicable sections of the relevant ISO and POSIXstandards and inspection of Bash's execution within the OpenVMSDebugger, I have come to the conclusion that this difference arises outof an implementation dependent difference with respect to the localedependent characteristics of characters in the C/POSIX locale. Therelevant ISO and POSIX standards explicitly DO NOT specify anyparticular requirements of the C/POSIX locale regarding locale dependentcharacteristics for character codes outside of the Portable CharacterSet (PCS). Therefore, any programmed behavior relying on localedependent characteristics is subject to implementation differences withrespect to character codes in the context of the C/POSIX locale lyingoutside of PCS. Using the OpenVMS Debugger, it became apparent that theexpansion of the shell variable "array" ultimately results in a call tothe function ansic_quote() (located within source modulelib/sh/strtrans.c). The relevant excerpt from this function is:


   for (s = str; c = *s; s++)
    {
      b = l = 1;        /* 1 == add backslash; 0 == no backslash */
      clen = 1;

      switch (c)
    {
    case ESC: c = 'E'; break;
#ifdef __STDC__
    case '\a': c = 'a'; break;
    case '\v': c = 'v'; break;
#else
    case 0x07: c = 'a'; break;
    case 0x0b: c = 'v'; break;
#endif

    case '\b': c = 'b'; break;
    case '\f': c = 'f'; break;
    case '\n': c = 'n'; break;
    case '\r': c = 'r'; break;
    case '\t': c = 't'; break;
    case '\\':
    case '\'':
      break;
    default:
#if defined (HANDLE_MULTIBYTE)
      b = is_basic (c);
      /* XXX - clen comparison to 0 is dicey */

if ((b == 0 && ((clen = mbrtowc (&wc, s, MB_CUR_MAX, 0)) < 0 ||MB_INVALIDCH (clen) || iswprint (wc) == 0)) ||

          (b == 1 && ISPRINT (c) == 0))
#else
      if (ISPRINT (c) == 0)
#endif
        {
          *r++ = '\\';
          *r++ = TOCHAR ((c >> 6) & 07);
          *r++ = TOCHAR ((c >> 3) & 07);
          *r++ = TOCHAR (c & 07);
          continue;
        }
      l = 0;
      break;
    }
      if (b == 0 && clen == 0)
    break;

      if (l)
    *r++ = '\\';

      if (clen == 1)
    *r++ = c;
      else
    {
      for (b = 0; b < (int)clen; b++)
        *r++ = (unsigned char)s[b];
      s += clen - 1;    /* -1 because of the increment above */
    }
    }

In the case of the Bash build for OpenVMS, the macro HANDLE_MULTIBYTE isdefined by the Bash configure script. That being the case, it isapparent from the above code excerpt that the decision to quote or notto quote a particular character code in the expanded string isdetermined by the results of the functions is_basic(),mbrtowc(),iswprint(), and isprint() (indirectly through macro expansionof the ISPRINT() function macro). The is_basic() function seems to becoded in such a way that it it will return homogoneous results acrossplatform implementations. However, the results for all of the other,remaining functions are locale dependent. Therefore, for character codesoutside of PCS, the ANSI C quoting of the expanded string is ultimatelyimplementation dependent.

Since the octal character code 242 that is used in defining the valuefor the "array" shell variable is clearly outside of PCS, the result ofexpanding the shell variable value in this case cannot be guaranteed tobe homogoneous for all platform implementations. But, that is currentlythe way both the test script and the reference results are posed.

This naturally prompts a couple of questions: Is this in fact a bug?Further, if it is a bug, precisely where is the bug? Given what I knowat the moment, my own answer to these questions is that if it is a bug,the bug is in the test script and its corresponding reference resultswhich are not posed to handle platform implementation differences whichapplicable standards explicitly permit in the context of the C/POSIXlocale and character codes outside of PCS. However, I cannot be entirelycertain of this conclusion because the exp8.sub script does not containexplicit commentary on what the precise motivation is behind the abovesequence of Bash test commands and what particular significance (if any)the octal character code 242 is supposed to have relative to the goal ofthis particular sequence of Bash test commands. So, I will leave it tothe Bash experts to make a final, authoritative determination withrespect to this Bash test discrepancy.

While investigating this test discrepancy with Bash 4.4 on OpenVMS Icame across another potential source code bug relating to the expansionof the ISPRINT() function macro. The expansion of the ISPRINT() functionmacro is, in turn, partially dependent on the expansion of theIN_CTYPE_DOMAIN() function macro. In the source code moduleinclude/chartypes.h, the function macro IN_CTYPE_DOMAIN() does not seemto be correctly defined for platforms not providing the isascii()function. Given the normative definition of the isascii() function in"The Open Group Base Specifications Issue 7 (IEEE Std 1003.1-2008) 2016Edition", the current definition of the IN_CTYPE_DOMAIN() function macro(as the literal constant expression 1) is unlikely to result in anyclose approximation of correct behavior for most platforms notimplementing the isascii() function. Instead, I believe theIN_CTYPE_DOMAIN() function macro would be better defined as follows:


#if STDC_HEADERS || (!defined (isascii) && !HAVE_ISASCII)
#  define IN_CTYPE_DOMAIN(c) ((c & (((int)-1)<<7)) == 0)
#else
#  define IN_CTYPE_DOMAIN(c) isascii(c)
#endif

For platforms that do not implement the isascii() function the abovedefinition for the IN_CTYPE_DOMAIN() function macro is more likely toproduce correct behavior than its current definition in the Bash 4.4release.

As always any additional wisdom and/or feedback that can be providedregarding the above is greatly appreciated.


Thanks,

Eric

GNU Bash 4.4 Test Discrepancy on OpenVMS

Reply via email to