Hi Oliver, At 2025-05-12T12:32:18+0200, Oliver Corff via GNU roff typesetting system discussion wrote: > for the first time, I am experimenting with the html output features > of groff. > > When attempting to compile the attached document (which is compiled > without problem when using any other -T option) by saying > > $ groff -k -Thtml TA_html.ms > test.html > > The generated html file test.html displays a lot of garbage. > > I'm afraid I am missing some basic information here. > > I even managed to crash groff (core dump) with longer input files to > be typeset with the ms macro set.
I have some follow-up findings on this problem, which as I noted is an old one (10+ years) and is proving to be a real devil to track down. Let me offer first, possibly unhelpfully, that you picked apparently the single worst macro package to start your experiments with. The reason is that I see this bug manifest _only_ with the 's' macro package, and not any of the other full-service ones we supply. (I didn't check mom(7) because I don't know how to write a minimal document in it, and I suspect Peter has already checked mom's results with grohtml(1), at least up to the point where the pre/post-processing and "devtagging" machinery frustrated progress.) I'm attaching a set of closely similar documents for the ms, me, mm, man, and mdoc macros. The only one that shows this word-space destruction defect is ms. The problem appears to be happening inside the formatter, since grout output clearly shows the issue when diffing output to the "html" and "utf8" devices, respectively. After, that is, a lot of noisy preamble used only for HTML output, which I'm not sure is a helpful feature (or if it is, why it's restricted to this output device); maybe it's debugging scaffolding for grohtml- elated changes to the formatter that were never taken out. $ diff -U0 MS-HTML MS-UTF8 --- MS-HTML 2025-05-14 12:46:48.252239912 -0500 +++ MS-UTF8 2025-05-14 12:46:51.556223272 -0500 @@ -1 +1 @@ -x T html +x T utf8 @@ -4,12 +3,0 @@ -x F /home/branden/src/GIT/groff/build/../tmac/troffrc -x F composite.tmac -x F fallbacks.tmac -x F html.tmac -x F www.tmac -x F devtag.tmac -x F en.tmac -x F latin1.tmac -x F pspic.tmac -x F pdfpic.tmac -x F /home/branden/src/GIT/groff/build/../tmac/troffrc-end -x F html-end.tmac @@ -17,23 +4,0 @@ -V40 -H240 -DFd -h1560 -n40 0 -x F - -x F s.tmac -x F devtag.tmac -x F refer-ms.tmac -x F refer.tmac -x F de.tmac -x F trans.tmac -x F latin1.tmac -V40 -H0 -x X devtag:.fi 1 -x X devtag:.rj 0 -x X devtag:.in 0 -x X devtag:.ll 24 -x X devtag:.po 0 -x X devtag:.ta L 120 -x X devtag:.ce 0 -x X devtag:.br @@ -43 +8 @@ -V40 +V280 @@ -45,0 +11 @@ +DFd @@ -47,3 +13 @@ -n0 0 -V40 -H0 +wh24 @@ -51,4 +15,4 @@ -n0 0 -V2147483480 -H24 -n0 0 +n40 0 +V2560 +H1560 +n40 0 @@ -56 +20 @@ -V2147483600 +V2640 The heart of this issue is changes like this: x font 1 R f1 s10 -V40 +V280 H0 md +DFd tbaz -n0 0 -V40 -H0 +wh24 tqux -n0 0 -V2147483480 -H24 -n0 0 +n40 0 Here we can see in the "utf8" output, a horizontal motion flagged as a word space "wh24", that is missing from "html" output. We also have these suspiciously useless 'n0 0' commands in the "html" output. groff_out(5): n b a Indicate a break. No action is performed; the command is present to make the output more easily parsed. The integers b and a describe the vertical space amounts before and after the break, respectively. GNU troff issues this command but groff’s output driver library ignores it. See v and V. But the weirdest part is that, despite these indications that we have a problem in the formatter itself, no other macro package causes the problem, even when formatting very similar output. $ diff -U0 MS-HTML MM-HTML --- MS-HTML 2025-05-14 12:46:48.252239912 -0500 +++ MM-HTML 2025-05-14 12:47:16.224099732 -0500 @@ -23 +23 @@ -x F s.tmac +x F m.tmac @@ -25 +25 @@ -x F refer-ms.tmac +x F refer-mm.tmac @@ -30,2 +30,2 @@ -V40 -H0 +V80 +H168 @@ -35,3 +35,3 @@ -x X devtag:.ll 24 -x X devtag:.po 0 -x X devtag:.ta L 120 +x X devtag:.ll 1440 +x X devtag:.po 168 +x X devtag:.ta L 120 L 240 L 360 L 480 L 600 L 720 L 840 L 960 L 1080 L 1200 L 1320 L 1440 @@ -39 +38,0 @@ -x X devtag:.br @@ -43,2 +42,2 @@ -V40 -H0 +V80 +H168 @@ -47,3 +46 @@ -n0 0 -V40 -H0 +wh24 @@ -51,4 +48 @@ -n0 0 -V2147483480 -H24 -n0 0 +n40 0 The mm package is having no problem getting the formatter to put 'wh24' commands on the output, and also causes it to produce 'n' commands that wouldn't be nilpotent even if they weren't documentary. Another clue is that the `pline` request I stuck between "baz" and "qux" in my input produced a populated node list to the standard error stream in every case except the buggy one. Something odd is going on inside the formatter; it's not like it (normally) waits until it's seen a word space to populate the pending output line. Observe: $ printf 'ab\\c\n.pline\n' | ~/groff-HEAD/bin/groff -a <beginning of page> [{"type": "line_start_node", "diversion level": 0, "is_special_node": false}, {"type": "glyph_node", "diversion level": 0, "is_special_node": false, "character": "a"}, {"type": "glyph_node", "diversion level": 0, "is_special_node": false, "character": "b"}, {"type": "transparent_dummy_node", "diversion level": 0, "is_special_node": false}] ab My advice for the time being is to select _any_ other full-service macro package with which to pursue your experiments with grohtml. That feels pretty lame to say, I admit. Regards, Branden
oliver-html-device-kills-interword-space.man
Description: Unix manual page
oliver-html-device-kills-interword-space.mdoc
Description: application/troff-mdoc
oliver-html-device-kills-interword-space.me
Description: Troff ME-macros document
oliver-html-device-kills-interword-space.mm
Description: application/troff-mm
oliver-html-device-kills-interword-space.ms
Description: Troff MS-macros document
.mso de.tmac .sp baz .pline qux
signature.asc
Description: PGP signature