Hi James, [I'm reordering one sentence in your reply.]
At 2022-06-04T15:23:36-0400, James K. Lowden wrote: > To insert \& at the start of a line does not affect how the input is > parsed. Yes, it does. $ gdb ./build/troff GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git Copyright (C) 2021 Free Software Foundation, Inc. License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html> This is free software: you are free to change and redistribute it. There is NO WARRANTY, to the extent permitted by law. Type "show copying" and "show warranty" for details. This GDB was configured as "x86_64-linux-gnu". Type "show configuration" for configuration details. For bug reporting instructions, please see: <https://www.gnu.org/software/gdb/bugs/>. Find the GDB manual and other documentation resources online at: <http://www.gnu.org/software/gdb/documentation/>. For help, type "help". Type "apropos word" to search for commands related to "word"... Reading symbols from ./build/troff... ##(gdb) break input.cpp:1879 Breakpoint 1 at 0x4df2e: file ../src/roff/troff/input.cpp, line 1879. ##(gdb) break input.cpp:2833 Breakpoint 2 at 0x56857: file ../src/roff/troff/input.cpp, line 2833. ##(gdb) set args -F font -F build/font -R ##(gdb) run Starting program: /home/branden/src/GIT/groff/build/troff -F font -F build/font -R .br Breakpoint 2, process_input_stack () at ../src/roff/troff/input.cpp:2833 2833 tok.next(); ##(gdb) cont Continuing. x T ps x res 72000 1 1 x init p1 \&.br Breakpoint 1, token::next (this=0x555555606620 <tok>) at ../src/roff/troff/input.cpp:1879 1879 type = TOKEN_DUMMY; ##(gdb) cont Continuing. [I typed Control+D here.] x font 5 TR f5 s10000 V12000 H72000 md DFd t.br n12000 0 x trailer V792000 x stop [Inferior 1 (process 49612) exited normally] We didn't trip breakpoint 1 with an input line of '.br', but we did with '\&.br'. The execution trace was different. And this is not a trivial statement--the execution trace was different _while parsing input_. (I set breakpoint 2 to show that "nothing was up my sleeve".) > [\&] does not "affect how input is parsed". It's parsed like all > other input This is like saying that all states in a finite state machine (FSM) are equivalent. It's just false (for nontrivial FSMs, and a state machine fit to parse the roff language is far from trivial). > -- indeed, exactly like \| and \~. This is also demonstrably false. Each of these are turned into different tokens and affect the FSM differently. If you go to the function 'token::next' in the GNU troff sources you can see how the various input sequences are handled internally. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp#n1734 > Its only distinction from them is on output. Sometimes not even that. On nroff devices any of '\&', '\|', '\~', and '\^' will all serve to prevent kerning between adjacent glyphs--of course, kerning isn't done on nroff devices anyway. '\&' is translated to a "dummy token" (enum literal 'TOKEN_DUMMY') which becomes a "dummy node". > *Any* character before a leading dot prevents the dot from > being interpreted as a request. The salient difference is that \& > introduces nothing into the output stream. Hence, "zero width". If we apply that reasoning backwards from the output, we can infer a potentially infinite number of "zero-width space" escape sequences in the input between nearly all input characters on text lines. But that's not helpful. \& in no way tells the output to device to find a "zero-width space glyph", like Unicode U+200B, and stick it on the output. And that is what an increasing number of people who have grown up in the Unicode era will expect. > To me, the term "non-printing input break" verges on nonsense because > it suggests there might be such a thing as "printing input". "Non-printing" modifies the phrase "input break". I'll grant that, in English, an ambiguous parse is possible. I guess I could change the docs to say "non-printing break {in,of} input", though I would lament the extra 3 ens or so of space to utter it. But would you regard it as an improvement? > There is not: input is processed and rendered as output. Input is no > more printed than it is written to the keyboard. This objection arises from a misinterpretation, as noted above. > I humbly suggest on this point we return to status quo ante. A "zero > width space" is perfectly clear terminology. No. I give you U+200B. > The fact that \& is used occasionally to prevent non-requests from > being interpreted as requests is incidental, easily explained and > understood. That's not all it is used for; see also kerning adjustment prevention, suppression of end-of-sentence detection, and (I think) other applications. This is what makes it a bit of a magical thing in troff. > Does anyone remember being confused by it? Ask countless man page authors on the Linux man-pages list and elsewhere. Now, my above rejection of a reflexive reversion notwithstanding--if someone wants to read nothing but CSTR#54, they should by all means read nothing but CSTR#54--I feel all squicky inside any time I have to refer to "tokens" and "nodes" in our user-facing documentation, at least outside of discussing diversions or the production of "grout" (groff's device-independent output), where doing so is inescapable. Dave and I were recently talking about how the hack of letting the \| and \^ space widths be encoded in the "charset" sections of font description files as pseudo-special character definitions might have been a bad idea, particularly since (A) Unicode supports numerous other space widths and we don't have dedicated escape sequences for them and (B) I think this feature is virtually unused. To the extent that people are aware of it, I think it encourages people to think muddily about how GNU troff--and troffs in general, as I understand it--deal with space. Spaces of any width, even zero or negative, do not become glyphs. We do our users a disservice if we encourage them to think that they do. Any further revision of '\&' documentation must assiduously avoid leading the reader into that deep pit of misconception. Regards, Branden
signature.asc
Description: PGP signature