On 10/17/24 6:44 PM, Greg Wooledge wrote:
This issue came up on the Libera #bash IRC channel today:

Between bash 4.4 and 5.0, the definition of "IFS whitespace" has apparently
been expanded:

POSIX defines whitespace as a character in the current locale's `space'
character class, or a byte for which isspace() returns true. The word
splitting section references this definition, but leaves it up to the
application whether or not characters besides space/tab/newline are
considered IFS whitespace when they appear in $IFS.

At the time (previous edition of the standard), POSIX defined whitespace
as "In the POSIX locale, white space consists of one or more <blank> (
<space> and <tab> characters), <newline>, <carriage-return>, <form-feed>,
and <vertical-tab> characters." The word splitting section wasn't quite
as rigorous as the current version's, but it referenced this definition.

However, the conformance suite tests for this.

Before bash-5.0, Oracle contacted me about the results of their running
bash-4.4 through the conformance suite (they were considering shipping
the next version of Solaris with bash as the POSIX shell and wanted it
to pass the tests). Now, I had not run bash through this test suite
myself -- that came later -- so I took them at their word.

There were a couple of `read' tests for exactly this, including making
sure that leading and trailing whitespace got stripped if the (non-
space/tab/newline) characters were in $IFS.

So I changed it -- the test suite, something that companies have to pay
to take and want to pass, was supposed to reflect the normative text --
and shipped bash-5.0. Oracle was happy, this was a minor change that
affected few people, and then Oracle canceled Solaris 12 and decided to
stick with Solaris 11 forever.


In bash 4.4 and earlier, IFS whitespace is always space + tab + newline.
But in 5.0 and later, it's "whatever the locale's isspace() allows",

Yep, as POSIX specifies.

along with some kind of 0x00 to 0x7f range check (thanks emanuele6).

The comment in locale_setblanks explains this: some systems, like macOS,
return true from isspace() for characters between 0x80 and 0xff even
though they introduce multibyte characters (every locale besides "C"
in macOS uses UTF-8 encoding). Grisha reported it:

https://lists.gnu.org/archive/html/bug-bash/2023-05/msg00132.html


Now, that's not necessarily bad, but the man page still says:

Yes, the man page and info file need to evolve in the same manner as
the standard: define whitespace and then reference it as needed.

--
``The lyf so short, the craft so long to lerne.'' - Chaucer
                 ``Ars longa, vita brevis'' - Hippocrates
Chet Ramey, UTech, CWRU    c...@case.edu    http://tiswww.cwru.edu/~chet/

Attachment: OpenPGP_signature.asc
Description: OpenPGP digital signature

Reply via email to