Hi, > Eric Andrew Lewis wrote on Thu, 26 Feb 2015 07:49:18 -0500:
>> I'm interested in building a troff parser to extract information >> from manpages (e.g. what do the flags mean when we say `rm -rf *`?). >> >> I'm curious, would the marked up source be the format to parse? Which format? One thing making this a complex task is that there are so many languages involved, for example: - mdoc macro language - called mdoc(7) below - man macro language - man(7) - low-level roff requests - roff(7) - tbl table description language - tbl(7) - eqn equation description language - eqn(7) - pic picture description language - pic(7) - ms macro language - ms(7) - me macro language - me(7) - mm macro language - mm(7) For which specific purpose do you want to build this tool, and which set of manual pages to you want to process with it? The difficulty very much depends on the answer to these questions. If you want to avoid handling all of low-level roff(7) - see below for why - you have to handle to output of various man(7) code generators specially, in particular: - pod2man(1) output from perlpod(1) input documents - DocBook output - ... Ralph Corderoy wrote on Thu, 26 Feb 2015 13:00:58 +0000: > That's a hard problem. You may want to look at Eric Raymond's > doclifter. http://www.catb.org/esr/doclifter/ If you want to handle any and all features of low-level roff(7), it is indeed very hard. Even doclifter handles only part of that. Getting back to the above list of languages, doclifter is the best tool available for: man(7) pic(7) ms(7) me(7) mm(7) It is clearly the wrong tool for mdoc(7), see below. Doug McIlroy wrote on Thu, 26 Feb 2015 15:46:56 -0500: > The syntax of troff and of the man-pages macros is in > > man 7 groff Ironically, the full documentation of roff syntax for documentation is not in roff, but in texinfo format: http://www.gnu.org/software/groff/manual/html_node/ Also see the Heirloom troff manual, http://n-t-roff.github.io/heirloom/doctools/troff.pdf That's an update of the one the OP cited, http://cm.bell-labs.com/sys/doc/troff.pdf > man 7 groff_man > man 7 groff_mdoc An alternative, intentionally compatible definition of these two languages is provided at: http://mdocml.bsd.lv/man/mdoc.7.html http://mdocml.bsd.lv/man/man.7.html In cases of doubt, comparing both may help understanding. > The markup, however, is not faithfully used. In groff -man, > you'll find boldface specified by .B , \fB, and perhaps .ft B > or .ft 3. Indeed, and .BR, .RB, .BI, .IB, .SH, .SS, and maybe even .SY. The main problem with man(7) is that it's not a semantic, but a presentational language in the first place. > And you'll find .I used for names of parameters > as well as for names of man pages (though parse context will > usually resolve the ambiguity. groff -mdoc tries for more > precision than man, but I suspect is sloppily used because > there are so many details to learn. That suspicion seems natural, but having seen lots and lots of both mdoc(7) and man(7) documentation, i don't share this suspicion. The largest single body of mdoc(7) documentation descends from the BSD system documentation of the Berkely Computer Systems Research Group. It is still used in OpenBSD, FreeBSD, NetBSD, Dragonfly, and Minix 3. In this body, mdoc(7) is not sloppily used, since the author of the mdoc(7) language is also the original author of these documents: Cynthia Livingston. Of course, the original body of AT&T Version 7 UNIX documentation written in man(7) was also very clean, but none of that remains in use in any major current system. Besides, in practice, there is a certain correlation between code quality and documentation quality. People who focus on clean, small, and secure code often value clean and concise documentation as well - and sometimes favour mdoc(7) over man(7). People prone to overengineering, bloat and sloppy work tend to produce either no documentation at all or bulky, incomplete, poorly formatted documentation. They seem to prefer man(7) over mdoc(7) and often use low-quality code generators, in particular DocBook. So in practice, you find: - A large amount of clean man(7) documentation. - A very large amount of sloppy man(7) documentation. - A large amount of clean mdoc(7) documentation. - Some, but very little sloppy mdoc(7) documenation. which means that the average mdoc(7) document is much *less* sloppily written than the average man(7). Besides, since the bulk of mdoc(7) documenation is BSD documentation, it tends to be very actively maintained. Even though already reasonable, quality of markup consistency is actively being worked on at least in OpenBSD. Kristaps Dzonsons wrote on Thu, 26 Feb 2015 15:13:56 +0100: > If the pages are in mdoc(7) (which you indicated), just use > libmandoc(3) (http://mdocml.bsd.lv) to parse the file and > extract flags (`Fl') in the SYNOPSIS and correlate them to > their explanation in the DESCRIPTION's `Bl -tag' list. > Not difficult at all. Indeed, but depending on what exactly you want to do, still a considerable amount of work, even if you want to handle mdoc(7) only. Thinking about it, multiple possible GSoC topics come to mind in this area, but it doesn't look like you are a student, right? So it may not help you if i were to set up a mentoring proposal, or would it? Steffen Nurpmeso wrote on Fri, 27 Feb 2015 11:31:36 +0100: > For the mdocmx(7) project i have written a simple mdoc(7) > parser in awk(1), the entire thing 18966 bytes [...] That's clearly bad advice. There are still missing parts in mandoc, but the mdoc(7) parser is among the parts that are most stable and best understood. Rewriting *that* over and over again is not going to solve a problem. Besides, an mdoc(7) parser written in awk(1) already exists, written in 1991 by Henry Spencer: http://manpages.bsd.lv/history/spencer_22_10_2011.txt http://manpages.bsd.lv/history.html#x1991_awf Yours, Ingo