Re: [Groff] bullets render as question marks

Ingo Schwarze Mon, 30 Nov 2015 18:01:41 -0800

Hi Aaron,

Aaron Davies wrote on Mon, Nov 30, 2015 at 08:00:06PM -0500:


> $ locale charmap
> ANSI_X3.4-1968

Heh.  I didn't see that name for ASCII before and had to look it
up to learn what it means.  :)

The GNU nroff(1) script never heard about that name for ASCII
either, so it falls back to LC_ALL, and since that is "C",
it falls back further to $LESSCHARSET.  Is $LESSCHARSET defined
on your system, and if so, what is its value?

> $ echo '\(bu' | groff -Tascii | hexdump -C
> 00000000  2b 08 6f 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |+.o.............|
[...]
> $ echo '\(bu' | groff -mtty -Tascii | hexdump -C
> 00000000  2b 08 6f 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |+.o.............|
[...]

That's both correct.

> $ echo '\(bu' | groff -Tutf8 | hexdump -C
> 00000000  c2 b7 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |................|
[...]
> $ echo '\(bu' | groff -mtty -Tutf8 | hexdump -C
> 00000000  c2 b7 0a 0a 0a 0a 0a 0a  0a 0a 0a 0a 0a 0a 0a 0a  |................|
[...]

 0xc2 - 0xc0 = 0x02; 0x0200 >> 2 = 0x80
 0xb7 - 0x80 = 0x37; 0x37 + 0x80 = 0xb7

So that is U+00B7, MIDDLE DOT.

Perhaps not the best possible choice, but not wrong either, and
certainly valid UTF-8.

> i think i can reconstruct the final rendering command now:
> 
> /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE|/usr/bin/groff -mtty-char \
> -P-c -mandoc -Tutf8|/usr/bin/iconv -f utf-8 -t ANSI_X3.4-1968//translit

That is ugly.  Someone is trying to sell you hardware?  Trying to
burn as many processor cycles as possible, and then some more?  ;-)

Apart from the obvious contortion, it doesn't even look right.  If the
input would really contain non-ASCII characters, the first iconv stage
would be insufficient, the pipeline would need preconv(1) before
groff.  

I suggest fixing your system to simply do this instead:

  /usr/bin/groff -P-c -mandoc -Tascii

That's all that is needed.  Even the -mtty-char is redundant,
your troffrc includes tty.tmac anyway.

> and the first two stages:
> 
> $ /usr/bin/iconv -f utf-8 -t utf-8 $TMPFILE | \
> /usr/bin/groff -mtty-char -P-c -mandoc -Tutf8|hexdump -C
[...]
> 000000a0  20 20 20 c2 b7 20 20 20  2d 08 2d 2d 08 2d 62 08  |   ..   -.--.-b.|
[...]
> 
> which renders as what i presume is supposed to be a unicode bullet
> (though it's a bit hard to tell in any of my terminal fonts):

MIDDLE DOT, not BULLET.

> though i can't figure out how 0xc2b7 corresponds to codepoint 2022

It doesn't, it's U+00B7, \(md, see the fallback in tty.tmac.

> the final stage, where iconv is supposed to convert it to ascii,
> is where the bullet turns into a question mark:

Yeah, so your's is an iconv(1) problem, not a groff(1) problem.
In this case, iconv(1) looks like a solution in search of a problem,
and like a broken solution, too.  :-/

Just get rid of the pointless iconv(1), and you are fine.

> so it looks like the real problem is that the groff stage in the
> middle of the pipeline isn't generating a unicode output that iconv
> understands how to transliterate

No, groff output is valid, i don't know why iconv(1) fails to handle it.

> incidentally, the manpage where i originally discovered this issue
> has a similar problem with \', but as it's using that to represent
> an apostrophe, which is wrong, that's not really your problem

True, that's an acute accent.

> (the odd thing is that groff is rendering that as 0xc2b4, which
> AFAICT *is* correct UTF-8, so maybe my iconv just doesn't have a
> transliteration rule for it?)

Yes, 0xc2b4 = U+00B4 = ACUTE ACCENT, that is fine.

Indeed, you need to debug iconv, not groff.

Yours,
  Ingo

Re: [Groff] bullets render as question marks

Reply via email to