Hi Alex, At 2022-08-25T21:08:17+0200, Alejandro Colomar wrote: > The following code (found in regex.7) wants to represent an 'o' with a > '^' on top of it (. Is that code correct?
The exhibit, extracted, is this: \o'o\(ha' This means "overstrike the glyphs for 'o' and the 'ha' special character". The 'ha' special character is described variously as a caret, circumflex accent, or "hat", and it corresponds to 0x7E in the ASCII/ISO 8859/Unicode character sets. The answer to your question is... Yes and no. > It's working on the PDF (although it's ugly), but not on the terminal. > It was changed by a commit that changed ^ by \(ha for compatibility, > but I'm not sure if that's correct in this specific case. That change is not really material to the underlying problem. It's even arguably wrong, since the troff semantics of '^' are that it's a combining character (John Gardner, are you listening?). Let me go through the surface problem and then I'll get to the deeper, worse ones. In many circumstances there won't be a difference between ^ and \(ha anyway; if the output device doesn't have distinct glyphs for a "full" or "spacing" tilde and a "combining tilde accent", then there's no difference here. Furthermore, the fervent street preachers of the man page "I type ASCII and by God what I should get is ASCII" religion will hack up man.local or even patch the formatter to force a "full, spacing" tilde to be output when "^" occurs in man page sources. So, when you overstrike these two characters on a device that supports overstriking, you'll probably get an ugly oversized "^" that intersects the "o" below it. That gets us to problem layer two: devices that support overstriking. Video terminals and their emulators don't. It's been decades since paper terminals like the Teletype Model 37 where overstriking was marvelously simple--just print, backspace, and print, were in common use. When you tell a video terminal to do that, you get a destructive backspace and replacement of the second glyph with the first. So, in the above case, you'll get '^' instead of (UTF-8) 'ô', which will please almost no one. Long story short #1: it's a bad idea to ever use the '\o' escape sequence, not just in man pages, but in any document destined for a video terminal (emulator). The problem gets worse because of the quasi-negotiated portable subset of man(7) that makes the language smaller, makes formatters and processors of man(7) documents easier to write, and which keeps Ingo Schwarze and I from fighting more than we already do. The problem is worse because every avenue I can think of for circumventing it is foreclosed by portability considerations. In ordinary *roff documents, you might do something like the following. .ds ^o \o'\(hao'\" foo "\fI[[=o=]]\fP", "\fI[[=\o\*(^o=]]\fP", This defines a *roff string called '^o' to manage this troublesome character. Note that I've switched the order of the accent and the base character so that a destructively backspacing terminal will render the 'o' preferentially. This limps along better for general purpose use like spelling peoples' names, but it just moves the lump in the carpet if you're really trying to illustrate the combined character (UTF-8 again) 'ô'. However, string definitions are not portable man(7). One might try to use the real deal only on typesetters and abstract the character away on terminal devices. .if n foo "\fI[[=o=]]\fP", "\fI[[=<o with circumflex accent>=]]\fP", .if t foo "\fI[[=o=]]\fP", "\fI[[=\o'o\(ha'=]]\fP", But there are _two_ things wrong with this. (A) conditional expressions are even worse for man(7) portability than string definitions (because you need a more powerful interpreter) and (B) some nroff devices can render the (UTF-8) 'ô' glyph just fine, and it doesn't help anyone to throw away that advantage. But it gets even worse. Even if we had a great mandoc/*roff portability summit and admitted enough functionality to get either of the foregoing solutions into our officially blessed portable man(7) subset, we'd _still_ have a problem. And that is that the available glyph repertoire is not known until formatting time, and depends on the output device. Not only is the repertoire of special characters device dependent, but accented letters in particular were kicked away from the concern of the formatter per se by Kernighan's device-independent rewrite of troff circa 1980. In the 1976 version of CSTR #54 you'll find a fascinating lsit of all available special characters and their renderings by the Graphic Systems C/A/T phototypesetters. When it came time to give troff device independence, people clearly realized that it was utterly up to the device what glyphs were going to be available. And it gets worse yet! Back in 1980 people must have figured that video terminals would never get large font repertoires, and in the event they did, they'd become effective typesetter emulators and be able to do things like constructively overstrike an "o" with, in Unicode parlance, a "modifier letter circumflex accent". So, troff people merrily carried on building accented glyphs with tricks like the one you showed. And video terminals didn't need that crap because all they had were ASCII and nobody was going to do serious formatting work on them anyway. (This despite the fact that DEC was already clearly increasing its glyph repertoire by the time they put out the VT220 (1983)--but by then relations between the Bell Labs CSRC and DEC had long since soured for reasons I've seen alluded to but never spelled out. Even the VT100 (1978) had a handful of non-ASCII characters corresponding to the "special" repertoire of CSTR #54.[1]) Grrr...I'm just going to put another gigantic rant into a footnote.[2] The bottom line is that there is no portable solution. The quilt project had a similar issue in its man page. Here's what I proposed to them, as part of a patch set that got merged after 4 years. commit 34a4f3a5c9de82be774e8a50e22ebfef54ac6f5d Author: G. Branden Robinson <g.branden.robin...@gmail.com> AuthorDate: Wed Aug 3 21:24:45 2022 +0200 Commit: Jean Delvare <jdelv...@suse.de> CommitDate: Wed Aug 3 21:24:45 2022 +0200 Man page: render Andreas Gruenbacher's name with a u-umlaut diff --git a/doc/quilt.1.in b/doc/quilt.1.in index df21752..6905bb4 100644 --- a/doc/quilt.1.in +++ b/doc/quilt.1.in @@ -474,10 +474,11 @@ QUILT_COLORS='diff_hdr=35;44' .EE . .SH AUTHORS +.fchar \\[:u] ue .I Quilt started as a series of scripts written by Andrew Morton .RI ( patch\\-scripts ). -Based on Andrew's ideas, Andreas Gruenbacher completely rewrote the +Based on Andrew's ideas, Andreas Gr\\[:u]nbacher completely rewrote the scripts, with the help of several other contributors (see the file .I AUTHORS in the distribution). (The doubled backslashes are because they preprocess the page source with a backslash-eating tool.) But neither the `fchar` request nor the special character _name_ ':u' are portable. Nor will '^o' be. Why did my patch take 4 years to get merged? Don't blame Jean Delvare. I got involved with fixing up quilt's documentation because I ran into a bug with its "graph" command when working with some patches I wanted to apply to Bash. Hacking on quilt's man page prompted some questions about groff, so I pushed quilt onto the stack to spend the couple of weeks it would take to learn what I needed about the man(7) language in depth and detail. One thing led to another. I'll probably never get back to Bash. Regards, Branden [1] https://en.wikipedia.org/wiki/VT100 https://en.wikipedia.org/wiki/VT220 https://en.wikipedia.org/wiki/DEC_Special_Graphics [2] On top of that, part of the reason everything around terminal handling on Unix, and Linux, sucks is (I surmise) because the Bell Labs CSRC leapfrogged from noisy paper terminals to the Blit device which they were seemingly convinced was the future.[3] They backed that horse with every dollar available, neglecting "glass TTYs" as hard as they could. And the C suite at AT&T promptly proved, with the Blit/DMD 5620 just as with the 3B20 and the AT&T UNIX PC, that freed by divestiture and given a license to print money in the computer business, thoroughly incapable doing so. I don't blame the Bell Labs CSRC engineers for this state of affairs (though there might be some to assign); I'd be surprised if it weren't the case that corporate management promised that if the CSRC ate its own dog food with the blit that everyone else in America would be doing it too, and your beautiful and wildly successful Unix creation will be running in every home and business. And if you don't play along, your budget will be slashed to ribbons because we're a serious computer business now and your department has to be a profit center. Oh, the joke was on everyone. The Blit wasn't the future but something that looked a lot like it was, and the center of command line life moved from a paper terminal to emulator for a glass TTY manufactured by DEC running in a window system from MIT. As far as I know, the last person to seriously try to resolve the idiocy surrounding Unix terminal handling was Dennis Ritchie, with "streams".[4] The Unix System Group, the commercial Unix guys behind System III and so on, got hold of it and turned it into "STREAMS", of which Ritchie himself was not a fan.[5] On top of that, every BSD and ARPAnet weenie on the planet swore up and down that Berkeley sockets were totally better in every way. (Well, to be fair, license-wise, they were. I don't feel equipped to judge their comparative technical merits. W. Richard Stevens and Douglas Comer, however, are.) I confess I'm a little surprised that no Linux kernel hacker has yet proven arrogant enough to believe that they can succeed where even Ritchie failed. Personally, I'll bet on Lennart Poeterring gobbling it into systemd and making it 2% faster, thoroughly incomprehensible, and utterly unmaintainable by anyone except IBM/Red Hat staff. I mean, that's the whole purpose of systemd anyway.) [3] https://en.wikipedia.org/wiki/Blit_(computer_terminal) [4] https://cseweb.ucsd.edu/classes/fa01/cse221/papers/ritchie-stream-io-belllabs84.pdf [5] "[Streams] means something different when shouted."
signature.asc
Description: PGP signature