While building and testing GNU Bash 4.4 on OpenVMS, the GNU Bash test
script issued the following difference between OpenVMS Bash produced
output and reference output for the test sub-script tests/exp8.sub
(lines 28 - 31)
unset array
declare -A array
array=( [$'x\001y\177z']=$'a\242b\002c' )
echo ${array[@]@A}
Currently, the reference result expected for ALL platform
implementations for the above sequence of Bash test commands is embodied
in tests/exp.right (line 236):
declare -A array=([$'x\001y\177z']=$'a\242b\002c' )
on OpenVMS the following output is generated instead:
declare -A array=([$'x\001y\177z']=$'a¢b\002c' )
After studying the applicable sections of the relevant ISO and POSIX
standards and inspection of Bash's execution within the OpenVMS
Debugger, I have come to the conclusion that this difference arises out
of an implementation dependent difference with respect to the locale
dependent characteristics of characters in the C/POSIX locale. The
relevant ISO and POSIX standards explicitly DO NOT specify any
particular requirements of the C/POSIX locale regarding locale dependent
characteristics for character codes outside of the Portable Character
Set (PCS). Therefore, any programmed behavior relying on locale
dependent characteristics is subject to implementation differences with
respect to character codes in the context of the C/POSIX locale lying
outside of PCS. Using the OpenVMS Debugger, it became apparent that the
expansion of the shell variable "array" ultimately results in a call to
the function ansic_quote() (located within source module
lib/sh/strtrans.c). The relevant excerpt from this function is:
for (s = str; c = *s; s++)
{
b = l = 1; /* 1 == add backslash; 0 == no backslash */
clen = 1;
switch (c)
{
case ESC: c = 'E'; break;
#ifdef __STDC__
case '\a': c = 'a'; break;
case '\v': c = 'v'; break;
#else
case 0x07: c = 'a'; break;
case 0x0b: c = 'v'; break;
#endif
case '\b': c = 'b'; break;
case '\f': c = 'f'; break;
case '\n': c = 'n'; break;
case '\r': c = 'r'; break;
case '\t': c = 't'; break;
case '\\':
case '\'':
break;
default:
#if defined (HANDLE_MULTIBYTE)
b = is_basic (c);
/* XXX - clen comparison to 0 is dicey */
if ((b == 0 && ((clen = mbrtowc (&wc, s, MB_CUR_MAX, 0)) < 0 ||
MB_INVALIDCH (clen) || iswprint (wc) == 0)) ||
(b == 1 && ISPRINT (c) == 0))
#else
if (ISPRINT (c) == 0)
#endif
{
*r++ = '\\';
*r++ = TOCHAR ((c >> 6) & 07);
*r++ = TOCHAR ((c >> 3) & 07);
*r++ = TOCHAR (c & 07);
continue;
}
l = 0;
break;
}
if (b == 0 && clen == 0)
break;
if (l)
*r++ = '\\';
if (clen == 1)
*r++ = c;
else
{
for (b = 0; b < (int)clen; b++)
*r++ = (unsigned char)s[b];
s += clen - 1; /* -1 because of the increment above */
}
}
In the case of the Bash build for OpenVMS, the macro HANDLE_MULTIBYTE is
defined by the Bash configure script. That being the case, it is
apparent from the above code excerpt that the decision to quote or not
to quote a particular character code in the expanded string is
determined by the results of the functions is_basic(),
mbrtowc(),iswprint(), and isprint() (indirectly through macro expansion
of the ISPRINT() function macro). The is_basic() function seems to be
coded in such a way that it it will return homogoneous results across
platform implementations. However, the results for all of the other,
remaining functions are locale dependent. Therefore, for character codes
outside of PCS, the ANSI C quoting of the expanded string is ultimately
implementation dependent.
Since the octal character code 242 that is used in defining the value
for the "array" shell variable is clearly outside of PCS, the result of
expanding the shell variable value in this case cannot be guaranteed to
be homogoneous for all platform implementations. But, that is currently
the way both the test script and the reference results are posed.
This naturally prompts a couple of questions: Is this in fact a bug?
Further, if it is a bug, precisely where is the bug? Given what I know
at the moment, my own answer to these questions is that if it is a bug,
the bug is in the test script and its corresponding reference results
which are not posed to handle platform implementation differences which
applicable standards explicitly permit in the context of the C/POSIX
locale and character codes outside of PCS. However, I cannot be entirely
certain of this conclusion because the exp8.sub script does not contain
explicit commentary on what the precise motivation is behind the above
sequence of Bash test commands and what particular significance (if any)
the octal character code 242 is supposed to have relative to the goal of
this particular sequence of Bash test commands. So, I will leave it to
the Bash experts to make a final, authoritative determination with
respect to this Bash test discrepancy.
While investigating this test discrepancy with Bash 4.4 on OpenVMS I
came across another potential source code bug relating to the expansion
of the ISPRINT() function macro. The expansion of the ISPRINT() function
macro is, in turn, partially dependent on the expansion of the
IN_CTYPE_DOMAIN() function macro. In the source code module
include/chartypes.h, the function macro IN_CTYPE_DOMAIN() does not seem
to be correctly defined for platforms not providing the isascii()
function. Given the normative definition of the isascii() function in
"The Open Group Base Specifications Issue 7 (IEEE Std 1003.1-2008) 2016
Edition", the current definition of the IN_CTYPE_DOMAIN() function macro
(as the literal constant expression 1) is unlikely to result in any
close approximation of correct behavior for most platforms not
implementing the isascii() function. Instead, I believe the
IN_CTYPE_DOMAIN() function macro would be better defined as follows:
#if STDC_HEADERS || (!defined (isascii) && !HAVE_ISASCII)
# define IN_CTYPE_DOMAIN(c) ((c & (((int)-1)<<7)) == 0)
#else
# define IN_CTYPE_DOMAIN(c) isascii(c)
#endif
For platforms that do not implement the isascii() function the above
definition for the IN_CTYPE_DOMAIN() function macro is more likely to
produce correct behavior than its current definition in the Bash 4.4
release.
As always any additional wisdom and/or feedback that can be provided
regarding the above is greatly appreciated.
Thanks,
Eric