Re: [Groff] Building a troff parser

Ingo Schwarze Fri, 27 Feb 2015 03:59:05 -0800

Hi,

> Eric Andrew Lewis wrote on Thu, 26 Feb 2015 07:49:18 -0500:


>> I'm interested in building a troff parser to extract information
>> from manpages (e.g. what do the flags mean when we say `rm -rf *`?).
>>
>> I'm curious, would the marked up source be the format to parse?

Which format?  One thing making this a complex task is that there
are so many languages involved, for example:

 - mdoc macro language - called mdoc(7) below
 - man macro language - man(7)
 - low-level roff requests - roff(7)
 - tbl table description language - tbl(7)
 - eqn equation description language - eqn(7)
 - pic picture description language - pic(7)
 - ms macro language - ms(7)
 - me macro language - me(7)
 - mm macro language - mm(7)

For which specific purpose do you want to build this tool, and
which set of manual pages to you want to process with it?  The
difficulty very much depends on the answer to these questions.

If you want to avoid handling all of low-level roff(7) - see
below for why - you have to handle to output of various man(7)
code generators specially, in particular:

 - pod2man(1) output from perlpod(1) input documents
 - DocBook output
 - ...


Ralph Corderoy wrote on Thu, 26 Feb 2015 13:00:58 +0000:

> That's a hard problem.  You may want to look at Eric Raymond's
> doclifter.  http://www.catb.org/esr/doclifter/

If you want to handle any and all features of low-level roff(7),
it is indeed very hard.  Even doclifter handles only part of
that.

Getting back to the above list of languages, doclifter is the
best tool available for: man(7) pic(7) ms(7) me(7) mm(7)
It is clearly the wrong tool for mdoc(7), see below.


Doug McIlroy wrote on Thu, 26 Feb 2015 15:46:56 -0500:

> The syntax of troff and of the man-pages macros is in
>
>       man 7 groff

Ironically, the full documentation of roff syntax for documentation
is not in roff, but in texinfo format:

  http://www.gnu.org/software/groff/manual/html_node/

Also see the Heirloom troff manual,

  http://n-t-roff.github.io/heirloom/doctools/troff.pdf

That's an update of the one the OP cited,

  http://cm.bell-labs.com/sys/doc/troff.pdf

>       man 7 groff_man
>       man 7 groff_mdoc

An alternative, intentionally compatible definition of these
two languages is provided at:

  http://mdocml.bsd.lv/man/mdoc.7.html
  http://mdocml.bsd.lv/man/man.7.html

In cases of doubt, comparing both may help understanding.

> The markup, however, is not faithfully used.  In groff -man,
> you'll find boldface specified by .B , \fB, and perhaps .ft B
> or .ft 3.

Indeed, and .BR, .RB, .BI, .IB, .SH, .SS, and maybe even .SY.
The main problem with man(7) is that it's not a semantic, but
a presentational language in the first place.

> And you'll find .I used for names of parameters
> as well as for names of man pages (though parse context will
> usually resolve the ambiguity.  groff -mdoc  tries for more
> precision than man, but I suspect is sloppily used because
> there are so many details to learn.

That suspicion seems natural, but having seen lots and lots of
both mdoc(7) and man(7) documentation, i don't share this
suspicion.

The largest single body of mdoc(7) documentation descends from
the BSD system documentation of the Berkely Computer Systems
Research Group.  It is still used in OpenBSD, FreeBSD, NetBSD,
Dragonfly, and Minix 3.  In this body, mdoc(7) is not sloppily
used, since the author of the mdoc(7) language is also the
original author of these documents: Cynthia Livingston.

Of course, the original body of AT&T Version 7 UNIX documentation
written in man(7) was also very clean, but none of that remains
in use in any major current system.

Besides, in practice, there is a certain correlation between code
quality and documentation quality.  People who focus on clean,
small, and secure code often value clean and concise documentation
as well - and sometimes favour mdoc(7) over man(7).  People prone
to overengineering, bloat and sloppy work tend to produce either
no documentation at all or bulky, incomplete, poorly formatted
documentation.  They seem to prefer man(7) over mdoc(7) and often
use low-quality code generators, in particular DocBook.

So in practice, you find:
 - A large amount of clean man(7) documentation.
 - A very large amount of sloppy man(7) documentation.
 - A large amount of clean mdoc(7) documentation.
 - Some, but very little sloppy mdoc(7) documenation.
which means that the average mdoc(7) document is much *less*
sloppily written than the average man(7).  Besides, since the
bulk of mdoc(7) documenation is BSD documentation, it tends
to be very actively maintained.  Even though already reasonable,
quality of markup consistency is actively being worked on at
least in OpenBSD.


Kristaps Dzonsons wrote on Thu, 26 Feb 2015 15:13:56 +0100:

> If the pages are in mdoc(7) (which you indicated), just use
> libmandoc(3) (http://mdocml.bsd.lv) to parse the file and
> extract flags (`Fl') in the SYNOPSIS and correlate them to
> their explanation in the DESCRIPTION's `Bl -tag' list.
> Not difficult at all.

Indeed, but depending on what exactly you want to do,
still a considerable amount of work, even if you want to
handle mdoc(7) only.

Thinking about it, multiple possible GSoC topics come to mind
in this area, but it doesn't look like you are a student, right?
So it may not help you if i were to set up a mentoring proposal,
or would it?


Steffen Nurpmeso wrote on Fri, 27 Feb 2015 11:31:36 +0100:

> For the mdocmx(7) project i have written a simple mdoc(7)
> parser in awk(1), the entire thing 18966 bytes [...]

That's clearly bad advice.  There are still missing parts
in mandoc, but the mdoc(7) parser is among the parts that are
most stable and best understood.  Rewriting *that* over and over
again is not going to solve a problem.  Besides, an mdoc(7)
parser written in awk(1) already exists, written in 1991
by Henry Spencer:

  http://manpages.bsd.lv/history/spencer_22_10_2011.txt
  http://manpages.bsd.lv/history.html#x1991_awf

Yours,
  Ingo

Re: [Groff] Building a troff parser

Reply via email to