Package: unicode
Version: 2.8-1.1
Severity: normal

unicode(1) makes some effort to make its help text (from the
--help option) conform to the charset that is nominated by the
environmentally-selected locale, but it screws this up.  I see two
distinct failure modes, and I'm not clear whether this is one bug or two.
For the purposes of my test cases here, I'll be specifying a locale only
through the LC_ALL environment variable, in order to avoid the trouble
that I described in Bug#1061103 regarding unicode(1) having faulty logic
for discerning the locale and its charset.

When it perceives a UTF-8 locale, unicode(1) successfully emits help text,
encoded in UTF-8:

$ env - LC_ALL=de_DE.utf8 unicode --help |egrep 'I/O|EU' | LC_ALL=C od -tc
0000000                                                                
0000020                                   I   /   O       c   h   a   r
0000040   a   c   t   e   r       s   e   t   ,       I       a   m    
0000060   g   u   e   s   s   i   n   g       U   T   F   -   8  \n    
0000100                                                                
0000120                               D   i   s   p   l   a   y       A
0000140   S   C   I   I       t   a   b   l   e       (   E   U 342 200
0000160 223   U   K       T   r   a   d   e       a   n   d       C   o
0000200   o   p   e   r   a   t   i   o   n  \n
0000212

When it perceives an ASCII locale, it again emits help text, but it
doesn't conform to the ASCII encoding.  Instead it uses UTF-8:

$ env - LC_ALL=C unicode --help |egrep 'I/O|EU' | LC_ALL=C od -tc
0000000                                                                
0000020                                   I   /   O       c   h   a   r
0000040   a   c   t   e   r       s   e   t   ,       I       a   m    
0000060   g   u   e   s   s   i   n   g       A   N   S   I   _   X   3
0000100   .   4   -   1   9   6   8  \n                                
0000120                                                                
0000140   D   i   s   p   l   a   y       A   S   C   I   I       t   a
0000160   b   l   e       (   E   U 342 200 223   U   K       T   r   a
0000200   d   e       a   n   d       C   o   o   p   e   r   a   t   i
0000220   o   n  \n
0000223

When it perceives a Latin-1 locale, it fails to emit any help text,
apparently due to that en dash not being encodable in Latin-1:

$ env - LC_ALL=de_DE.iso88591 unicode --help |egrep 'I/O|EU' | LC_ALL=C od -tc
Traceback (most recent call last):
  File "/usr/bin/unicode", line 1014, in <module>
    main()
  File "/usr/bin/unicode", line 941, in main
    (options, arguments) = parser.parse_args()
  File "/usr/lib/python3.9/optparse.py", line 1387, in parse_args
    stop = self._process_args(largs, rargs, values)
  File "/usr/lib/python3.9/optparse.py", line 1427, in _process_args
    self._process_long_opt(rargs, values)
  File "/usr/lib/python3.9/optparse.py", line 1501, in _process_long_opt
    option.process(opt, value, values, self)
  File "/usr/lib/python3.9/optparse.py", line 784, in process
    return self.take_action(
  File "/usr/lib/python3.9/optparse.py", line 807, in take_action
    parser.print_help()
  File "/usr/lib/python3.9/optparse.py", line 1647, in print_help
    file.write(self.format_help())
UnicodeEncodeError: 'latin-1' codec can't encode character '\u2013' in position 
1661: ordinal not in range(256)
0000000

unicode(1) ought to successfully emit help text in any of these locales,
and the help text ought to always conform to the environmentally-selected
locale.  For ASCII and Latin-1 locales this implies that it can't use
that en dash, and must substitute an ASCII "-".  I have no strong opinion
about whether it should use the en dash in a UTF-8 locale.

-zefram

Reply via email to