Hi Philipp,

Philipp Buehler wrote on Wed, May 06, 2020 at 04:03:41PM +0200:
> Am 06.05.2020 15:54 schrieb Ingo Schwarze:

>> Your misunderstandiing is that file names consist of characters.
>> They do not.  They consist of bytes, and to match two bytes,
>> you need two question marks.

> One can hold for the OP; the ksh(1) manpage talks about
> "characters" in 'File name patterns' throughout.
> 
> Just two bytes ;-)

I guess that is because ksh(1) - both the program and the manual
page - predate the idea of multi-byte characters.  The ksh(1) manual
page uses the term "character" troughout when talking about bytes,
not just when talking about globbing.  That becomes clear at various
places, for example:

  [words] which are sequences of characters, are delimited by
  unquoted whitespace characters (space, tab, and newline) or ...
   --> obviously, non-ASCII whitespace is not considered here

  A parameter name is either one of the special single punctuation
  or digit character parameters described below ...
   --> obviously, non-ASCII digits are not considered here

  PS1 [...]   \nnn   The octal character nnn.
   --> obviously, the shell assumes there are at most 512 characters

Even more clearly, the subsection "File name patterns" says:

  alnum   cntrl   lower   space
  [..]
  These match characters using the macros specified in isalnum(3),
  isalpha(3), and so on.
   --> which explicitly says that "character" refers to single-byte
       characters

This is also fairly explicit:

  vi-show8  Prefix characters with the eighth bit set with "M-".
            If this option is not set, characters in the range 128-160
            are printed as is, which may cause problems.

  string > string  Strings compare greater than based on the
                   ASCII value of their characters.

Admittedly, there is a very small number of cases where our
ksh(1) actually does handle UTF-8 multi-byte characters:

     backward-char: [n] ^B, ^X^D
             Moves the cursor backward n characters.

     delete-char-backward: [n] ERASE, ^?, ^H
             Deletes n characters before the cursor.

     delete-char-forward: [n] Delete
             Deletes n characters after the cursor.

     forward-char: [n] ^F, ^XC
             Moves the cursor forward n characters.

There are also cases where it might make sense to handle UTF-8,
but currently characters are just bytes, for example:

     transpose-chars: ^T
             If at the end of line, or if the gmacs option is set, this
             exchanges the two previous characters; otherwise, it exchanges
             the previous and current characters and moves the cursor one
             character to the right.

I admit those few cases where UTF-8 is handled in a best-effort
manner aren't explained in the manual.  They only affect command
line use, not the shell programming language.

Also, the ksh(1) manual is far from alone in tacitly assuming that
characters are single-byte characters.  Consider manual pages like
cat(1), col(1), dd(1), diff(1), dig(1), expr(1), hexdump(1), join(1),
jot(1), patch(1), chdir(2), printf(3), strchr(3), strlcpy(3), etc.

When utilities specifically support multibyte characters, the
respective manual pages usually say so; consider colrm(1), column(1),
cut(1), fmt(1), fold(1), ls(1), mandoc(1), mbtowc(3), wcslen(3),
wprintf(3), etc.

It is unfortunate that the term "character" was first defined as "char",
large bodies of documentation were written, and then it was later
redefined to sometimes mean "wide character" and sometimes "multibyte
character" (which are to different concepts).

I don't have a good solution.  Sometimes, it is possible to explicitly
use the terms "single-byte character", "wide character", and "multi-byte
character", but i'm not convinced it would be a good idea to dig through
all out manual pages and consistently use these three terms everywhere.

In might not become too much of a digression in a very simple page
like strlen(3), but i'm not so sure about a page that is already
long and complicated, like ksh(1).

Yours,
  Ingo

Reply via email to