Hi Aaron, Aaron Davies wrote on Mon, Nov 30, 2015 at 08:00:06PM -0500:
> $ locale charmap > ANSI_X3.4-1968 Heh. I didn't see that name for ASCII before and had to look it up to learn what it means. :) The GNU nroff(1) script never heard about that name for ASCII either, so it falls back to LC_ALL, and since that is "C", it falls back further to $LESSCHARSET. Is $LESSCHARSET defined on your system, and if so, what is its value? > $ echo '\(bu' | groff -Tascii | hexdump -C > 00000000 2b 08 6f 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |+.o.............| [...] > $ echo '\(bu' | groff -mtty -Tascii | hexdump -C > 00000000 2b 08 6f 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |+.o.............| [...] That's both correct. > $ echo '\(bu' | groff -Tutf8 | hexdump -C > 00000000 c2 b7 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................| [...] > $ echo '\(bu' | groff -mtty -Tutf8 | hexdump -C > 00000000 c2 b7 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a 0a |................| [...] 0xc2 - 0xc0 = 0x02; 0x0200 >> 2 = 0x80 0xb7 - 0x80 = 0x37; 0x37 + 0x80 = 0xb7 So that is U+00B7, MIDDLE DOT. Perhaps not the best possible choice, but not wrong either, and certainly valid UTF-8. > i think i can reconstruct the final rendering command now: > > /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char \ > -P-c -mandoc -Tutf8|/usr/bin/iconv -f utf-8 -t ANSI_X3.4-1968//translit That is ugly. Someone is trying to sell you hardware? Trying to burn as many processor cycles as possible, and then some more? ;-) Apart from the obvious contortion, it doesn't even look right. If the input would really contain non-ASCII characters, the first iconv stage would be insufficient, the pipeline would need preconv(1) before groff. I suggest fixing your system to simply do this instead: /usr/bin/groff -P-c -mandoc -Tascii That's all that is needed. Even the -mtty-char is redundant, your troffrc includes tty.tmac anyway. > and the first two stages: > > $ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE | \ > /usr/bin/groff -mtty-char -P-c -mandoc -Tutf8|hexdump -C [...] > 000000a0 20 20 20 c2 b7 20 20 20 2d 08 2d 2d 08 2d 62 08 | .. -.--.-b.| [...] > > which renders as what i presume is supposed to be a unicode bullet > (though it's a bit hard to tell in any of my terminal fonts): MIDDLE DOT, not BULLET. > though i can't figure out how 0xc2b7 corresponds to codepoint 2022 It doesn't, it's U+00B7, \(md, see the fallback in tty.tmac. > the final stage, where iconv is supposed to convert it to ascii, > is where the bullet turns into a question mark: Yeah, so your's is an iconv(1) problem, not a groff(1) problem. In this case, iconv(1) looks like a solution in search of a problem, and like a broken solution, too. :-/ Just get rid of the pointless iconv(1), and you are fine. > so it looks like the real problem is that the groff stage in the > middle of the pipeline isn't generating a unicode output that iconv > understands how to transliterate No, groff output is valid, i don't know why iconv(1) fails to handle it. > incidentally, the manpage where i originally discovered this issue > has a similar problem with \', but as it's using that to represent > an apostrophe, which is wrong, that's not really your problem True, that's an acute accent. > (the odd thing is that groff is rendering that as 0xc2b4, which > AFAICT *is* correct UTF-8, so maybe my iconv just doesn't have a > transliteration rule for it?) Yes, 0xc2b4 = U+00B4 = ACUTE ACCENT, that is fine. Indeed, you need to debug iconv, not groff. Yours, Ingo