When using terminal software on non-OpenBSD to connect to my OpenBSD IRC
machine, I noticed that sometimes the local terminal disagrees with the remote
tmux and application (in this case, irssi) about the character width of some
lines, causing different kinds of breakage. Those lines happened to contain soft
hyphens (U+00AD), which behave as follows across a few different operating
systems:

OpenBSD-CURRENT:        iswprint(SHY) = 1 iswcntrl(SHY) = 1 wcwidth(SHY) = 0
NetBSD 9.1:             iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1
FreeBSD 12.2:           iswprint(SHY) = 0 iswcntrl(SHY) = 1 wcwidth(SHY) = -1
glibc (Debian sid):     iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1
musl (Alpine 3.13.3):   iswprint(SHY) = 1 iswcntrl(SHY) = 0 wcwidth(SHY) = 1

On Windows, PowerShell, PuTTY and MinTTY (shipped with the default install of
git from git-scm.com as part of MSYS2) render the soft hyphen as a visible
character with a width of 1 column.

The OpenBSD wcwidth(SHY) of 0 is what the problem comes down to (FreeBSD's
return values are also strange, but this is an OpenBSD list). There is a lot of
background discussion about whether or not Unicode intends the SHY to be
printable or not, and whether it should have width of 0 or 1, in eg. [0] and
[1], but for better or worse, it seems most other systems decided that SHY has a
width of 1 and should be a visible character (at least in terminal contexts).

Therefore, in the interest of interoperability, I propose the following diff to
special-case SHY into having a width of 1. I don't intend to go down the rabbit
hole of a discussion regarding what the 'correct' width is, but the discrepancy
with other systems causes real problems, and I think those other systems made
their decisions years ago (see eg. [0] for glibc).

Diff below only for gen_ctype_utf8.pl; I am not including the resulting
en_US.UTF-8.src diff, because it seems there is a Unicode 12.1.0 to 13.0.0
update that happens on regeneration of that file, and that is orthogonal to this
change (essentially: [2], which has not been committed yet)

[0]: https://sourceware.org/bugzilla/show_bug.cgi?id=22073
[1]: https://jkorpela.fi/shy.html
[2]: https://marc.info/?l=openbsd-tech&m=161534047428793&w=2

diff --git a/share/locale/ctype/gen_ctype_utf8.pl 
b/share/locale/ctype/gen_ctype_utf8.pl
index e23472efb2c..c593dc628ee 100755
--- a/share/locale/ctype/gen_ctype_utf8.pl
+++ b/share/locale/ctype/gen_ctype_utf8.pl
@@ -404,6 +404,9 @@ sub codepoint_columns
 
        # Several fonts provide glyphs in this range
        return 1 if $code >= 0xe000 and $code <= 0xf8ff;
+       # Soft hyphen (SHY) is in category Cf, which implies width 0, but since
+       # it is width 1 in nearly every other environment, set it here.
+       return 1 if $code == 0x00ad;
 
        return 0 if $charinfo->{category} eq 'Mn';
        return 0 if $charinfo->{category} eq 'Me';

-- 
Lauri Tirkkonen | lotheac @ IRCnet

Reply via email to