On Wed, Jul 01, 2009 at 11:04:03PM +0500, Stepan Golosunov wrote: > 30.06.2009 в 12:06:44 +0100 Colin Watson написал: > > w3mman should set MAN_KEEP_FORMATTING=1 in the environment to instruct > > man not to invoke col to strip away formatting characters, which it > > normally does by default when writing to a pipe. I added this feature to > > man-db with the express intention that it should be used by programs > > like pinfo and w3mman that invoke man and can do something with its > > formatted output. Patch attached. > > Actually, w3mman in lenny shows underlined characters *unless* called > with MAN_KEEP_FORMATTING=1 (they just aren't underlined).
Assuming that you're referring to the same test case (LC_ALL=ru_RU.UTF-8 w3mman cp), this appears to be a separate bug; w3mman2html.cgi is failing to deal with the sequence "_" BACKSPACE <UTF-8 character>, presumably stripping off the first byte of the UTF-8 character and attempting to underline that. I imagine it has the same trouble with bold (<UTF-8 character> BACKSPACE <same UTF-8 character>). This should be straightforward enough to fix if you have the patience to dig through the relevant regular expressions. :-) It clearly ought to be fixed. > But it hides non-ascii section headings when called *without* > MAN_KEEP_FORMATTING=1. > And this seems to be because man in this case produces something > bogus. > > > This is the first section heading (ИМЯ), generated by > "MAN_KEEP_FORMATTING=1 man cp|hd" (in ru_RU.UTF-8 locale): > d0 98 08 d0 98 d0 9c 08 d0 9c d0 af 08 d0 af 0a > > > But "man cp|hd" generates invalid utf-8: > d0 d0 98 d0 d0 9c d0 d0 af 0a > > It's supposed to be as in "echo ИМЯ|hd": > d0 98 d0 9c d0 af 0a Sure, I'm entirely familiar with that symptom, which is actually a col bug, namely #319952. The point of MAN_KEEP_FORMATTING=1 is to skip the call to col, thus as a side-effect dodging that bug. For a program that handles the formatting typically emitted by groff, it is unambiguously correct to set MAN_KEEP_FORMATTING=1 to skip the col invocation. It hadn't occurred to me that w3mman would handle UTF-8 characters wrongly in this mode, but that should be easy enough to fix. -- Colin Watson [[email protected]] -- To UNSUBSCRIBE, email to [email protected] with a subject of "unsubscribe". Trouble? Contact [email protected]

