Re: Zero Width Space (was Re: How to print a literal '.' as the first character in a line?)

G. Branden Robinson Sat, 04 Jun 2022 13:53:44 -0700

Hi James,

[I'm reordering one sentence in your reply.]

At 2022-06-04T15:23:36-0400, James K. Lowden wrote:
> To insert \& at the start of a line does not affect how the input is
> parsed.

Yes, it does.

$ gdb ./build/troff
GNU gdb (Debian 10.1-1.7) 10.1.90.20210103-git
Copyright (C) 2021 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from ./build/troff...
##(gdb) break input.cpp:1879
Breakpoint 1 at 0x4df2e: file ../src/roff/troff/input.cpp, line 1879.
##(gdb) break input.cpp:2833
Breakpoint 2 at 0x56857: file ../src/roff/troff/input.cpp, line 2833.
##(gdb) set args -F font -F build/font -R
##(gdb) run
Starting program: /home/branden/src/GIT/groff/build/troff -F font -F build/font 
-R
.br

Breakpoint 2, process_input_stack () at ../src/roff/troff/input.cpp:2833
2833                tok.next();
##(gdb) cont
Continuing.
x T ps
x res 72000 1 1
x init
p1
\&.br

Breakpoint 1, token::next (this=0x555555606620 <tok>) at 
../src/roff/troff/input.cpp:1879
1879            type = TOKEN_DUMMY;
##(gdb) cont
Continuing.
[I typed Control+D here.]
x font 5 TR
f5
s10000
V12000
H72000
md
DFd
t.br
n12000 0
x trailer
V792000
x stop
[Inferior 1 (process 49612) exited normally]

We didn't trip breakpoint 1 with an input line of '.br', but we did
with '\&.br'.  The execution trace was different.  And this is not a
trivial statement--the execution trace was different _while parsing
input_.

(I set breakpoint 2 to show that "nothing was up my sleeve".)

> [\&] does not "affect how input is parsed".  It's parsed like all
> other input

This is like saying that all states in a finite state machine (FSM) are
equivalent.  It's just false (for nontrivial FSMs, and a state machine
fit to parse the roff language is far from trivial).

> -- indeed, exactly like \| and \~.

This is also demonstrably false.  Each of these are turned into
different tokens and affect the FSM differently.

If you go to the function 'token::next' in the GNU troff sources you can
see how the various input sequences are handled internally.

https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.cpp#n1734

> Its only distinction from them is on output.

Sometimes not even that.  On nroff devices any of '\&', '\|', '\~', and
'\^' will all serve to prevent kerning between adjacent glyphs--of
course, kerning isn't done on nroff devices anyway.

'\&' is translated to a "dummy token" (enum literal 'TOKEN_DUMMY')
which becomes a "dummy node".

> *Any* character before a leading dot prevents the dot from
> being interpreted as a request.  The salient difference is that \&
> introduces nothing into the output stream.  Hence, "zero width".

If we apply that reasoning backwards from the output, we can infer a
potentially infinite number of "zero-width space" escape sequences in
the input between nearly all input characters on text lines.  But that's
not helpful.

\& in no way tells the output to device to find a "zero-width space
glyph", like Unicode U+200B, and stick it on the output.  And that is
what an increasing number of people who have grown up in the Unicode era
will expect.

> To me, the term "non-printing input break" verges on nonsense because
> it suggests there might be such a thing as "printing input".

"Non-printing" modifies the phrase "input break".

I'll grant that, in English, an ambiguous parse is possible.

I guess I could change the docs to say "non-printing break {in,of}
input", though I would lament the extra 3 ens or so of space to utter
it.  But would you regard it as an improvement?

> There is not: input is processed and rendered as output.  Input is no
> more printed than it is written to the keyboard.  

This objection arises from a misinterpretation, as noted above.

> I humbly suggest on this point we return to status quo ante.  A "zero
> width space" is perfectly clear terminology.

No.  I give you U+200B.

> The fact that \& is used occasionally to prevent non-requests from
> being interpreted as requests is incidental, easily explained and
> understood.

That's not all it is used for; see also kerning adjustment prevention,
suppression of end-of-sentence detection, and (I think) other
applications.  This is what makes it a bit of a magical thing in troff.

> Does anyone remember being confused by it?

Ask countless man page authors on the Linux man-pages list and
elsewhere.

Now, my above rejection of a reflexive reversion notwithstanding--if
someone wants to read nothing but CSTR#54, they should by all means read
nothing but CSTR#54--I feel all squicky inside any time I have to refer
to "tokens" and "nodes" in our user-facing documentation, at least
outside of discussing diversions or the production of "grout"
(groff's device-independent output), where doing so is inescapable.

Dave and I were recently talking about how the hack of letting the \|
and \^ space widths be encoded in the "charset" sections of font
description files as pseudo-special character definitions might have
been a bad idea, particularly since (A) Unicode supports numerous other
space widths and we don't have dedicated escape sequences for them and
(B) I think this feature is virtually unused.  To the extent that people
are aware of it, I think it encourages people to think muddily about how
GNU troff--and troffs in general, as I understand it--deal with space.
Spaces of any width, even zero or negative, do not become glyphs.  We do
our users a disservice if we encourage them to think that they do.

Any further revision of '\&' documentation must assiduously avoid
leading the reader into that deep pit of misconception.

Regards,
Branden

signature.asc
Description: PGP signature

Re: Zero Width Space (was Re: How to print a literal '.' as the first character in a line?)

Reply via email to