Hi Oliver,

At 2025-05-12T12:32:18+0200, Oliver Corff via GNU roff typesetting
system discussion wrote:
> for the first time, I am experimenting with the html output features
> of groff.
> 
> When attempting to compile the attached document (which is compiled
> without problem when using any other -T option) by saying
> 
> $ groff -k -Thtml TA_html.ms > test.html
> 
> The generated html file test.html displays a lot of garbage.
> 
> I'm afraid I am missing some basic information here.
> 
> I even managed to crash groff (core dump) with longer input files to
> be typeset with the ms macro set.

I have some follow-up findings on this problem, which as I noted is an
old one (10+ years) and is proving to be a real devil to track down.

Let me offer first, possibly unhelpfully, that you picked apparently the
single worst macro package to start your experiments with.

The reason is that I see this bug manifest _only_ with the 's' macro
package, and not any of the other full-service ones we supply.

(I didn't check mom(7) because I don't know how to write a minimal
document in it, and I suspect Peter has already checked mom's results
with grohtml(1), at least up to the point where the pre/post-processing
and "devtagging" machinery frustrated progress.)

I'm attaching a set of closely similar documents for the ms, me, mm,
man, and mdoc macros.  The only one that shows this word-space
destruction defect is ms.  The problem appears to be happening inside
the formatter, since grout output clearly shows the issue when diffing
output to the "html" and "utf8" devices, respectively.

After, that is, a lot of noisy preamble used only for HTML output, which
I'm not sure is a helpful feature (or if it is, why it's restricted to
this output device); maybe it's debugging scaffolding for grohtml-
elated changes to the formatter that were never taken out.

$ diff -U0 MS-HTML MS-UTF8 
--- MS-HTML     2025-05-14 12:46:48.252239912 -0500
+++ MS-UTF8     2025-05-14 12:46:51.556223272 -0500
@@ -1 +1 @@
-x T html
+x T utf8
@@ -4,12 +3,0 @@
-x F /home/branden/src/GIT/groff/build/../tmac/troffrc
-x F composite.tmac
-x F fallbacks.tmac
-x F html.tmac
-x F www.tmac
-x F devtag.tmac
-x F en.tmac
-x F latin1.tmac
-x F pspic.tmac
-x F pdfpic.tmac
-x F /home/branden/src/GIT/groff/build/../tmac/troffrc-end
-x F html-end.tmac
@@ -17,23 +4,0 @@
-V40
-H240
-DFd
-h1560
-n40 0
-x F -
-x F s.tmac
-x F devtag.tmac
-x F refer-ms.tmac
-x F refer.tmac
-x F de.tmac
-x F trans.tmac
-x F latin1.tmac
-V40
-H0
-x X devtag:.fi 1
-x X devtag:.rj 0
-x X devtag:.in 0
-x X devtag:.ll 24
-x X devtag:.po 0
-x X devtag:.ta  L 120
-x X devtag:.ce 0
-x X devtag:.br
@@ -43 +8 @@
-V40
+V280
@@ -45,0 +11 @@
+DFd
@@ -47,3 +13 @@
-n0 0
-V40
-H0
+wh24
@@ -51,4 +15,4 @@
-n0 0
-V2147483480
-H24
-n0 0
+n40 0
+V2560
+H1560
+n40 0
@@ -56 +20 @@
-V2147483600
+V2640

The heart of this issue is changes like this:

 x font 1 R
 f1
 s10
-V40
+V280
 H0
 md
+DFd
 tbaz
-n0 0
-V40
-H0
+wh24
 tqux
-n0 0
-V2147483480
-H24
-n0 0
+n40 0

Here we can see in the "utf8" output, a horizontal motion flagged as a
word space "wh24", that is missing from "html" output.  We also have
these suspiciously useless 'n0 0' commands in the "html" output.

groff_out(5):
     n b a  Indicate a break.  No action is performed; the command is
            present to make the output more easily parsed.  The integers
            b and a describe the vertical space amounts before and after
            the break, respectively.  GNU troff issues this command but
            groff’s output driver library ignores it.  See v and V.

But the weirdest part is that, despite these indications that we have a
problem in the formatter itself, no other macro package causes the
problem, even when formatting very similar output.

$ diff -U0 MS-HTML MM-HTML
--- MS-HTML     2025-05-14 12:46:48.252239912 -0500
+++ MM-HTML     2025-05-14 12:47:16.224099732 -0500
@@ -23 +23 @@
-x F s.tmac
+x F m.tmac
@@ -25 +25 @@
-x F refer-ms.tmac
+x F refer-mm.tmac
@@ -30,2 +30,2 @@
-V40
-H0
+V80
+H168
@@ -35,3 +35,3 @@
-x X devtag:.ll 24
-x X devtag:.po 0
-x X devtag:.ta  L 120
+x X devtag:.ll 1440
+x X devtag:.po 168
+x X devtag:.ta  L 120 L 240 L 360 L 480 L 600 L 720 L 840 L 960 L 1080 L 1200 
L 1320 L 1440
@@ -39 +38,0 @@
-x X devtag:.br
@@ -43,2 +42,2 @@
-V40
-H0
+V80
+H168
@@ -47,3 +46 @@
-n0 0
-V40
-H0
+wh24
@@ -51,4 +48 @@
-n0 0
-V2147483480
-H24
-n0 0
+n40 0

The mm package is having no problem getting the formatter to put 'wh24'
commands on the output, and also causes it to produce 'n' commands that
wouldn't be nilpotent even if they weren't documentary.

Another clue is that the `pline` request I stuck between "baz" and "qux"
in my input produced a populated node list to the standard error stream
in every case except the buggy one.  Something odd is going on inside
the formatter; it's not like it (normally) waits until it's seen a word
space to populate the pending output line.  Observe:

$ printf 'ab\\c\n.pline\n' | ~/groff-HEAD/bin/groff -a
<beginning of page>
[{"type": "line_start_node", "diversion level": 0, "is_special_node": false},
{"type": "glyph_node", "diversion level": 0, "is_special_node": false, 
"character": "a"},
{"type": "glyph_node", "diversion level": 0, "is_special_node": false, 
"character": "b"},
{"type": "transparent_dummy_node", "diversion level": 0, "is_special_node": 
false}]
ab

My advice for the time being is to select _any_ other full-service macro
package with which to pursue your experiments with grohtml.  That feels
pretty lame to say, I admit.

Regards,
Branden

Attachment: oliver-html-device-kills-interword-space.man
Description: Unix manual page

Attachment: oliver-html-device-kills-interword-space.mdoc
Description: application/troff-mdoc

Attachment: oliver-html-device-kills-interword-space.me
Description: Troff ME-macros document

Attachment: oliver-html-device-kills-interword-space.mm
Description: application/troff-mm

Attachment: oliver-html-device-kills-interword-space.ms
Description: Troff MS-macros document

.mso de.tmac
.sp
baz
.pline
qux

Attachment: signature.asc
Description: PGP signature

Reply via email to