Bug#388689: tr: fails to replace umlauts

Bob Proulx Thu, 21 Sep 2006 22:54:41 -0700

martin f krafft wrote:
> Package: coreutils
> File: /usr/bin/tr
> 
> Correct me if I am wrong, but this is a bug to me. The result should
> be 'ü', not 'Ü'.


No argument there.

> piper:~> echo Ü | xxd -ps
> c39c0a
> piper:~> echo Ü | LC_CTYPE=de_CH.UTF-8 sed -e 'y/[:upper:]/[:lower:]/' | xxd 
> -ps
> c39c0a

I don't believe the character classes are expanded within sed's
transliterate command.  You would need to specify them explicitly.
That would be suitable for a wishlist upstream for sed, not coreutils.
Until then I believe with sed you must use explicit lists such as this:

  echo abcuABCUÜ | LC_CTYPE=de_CH.UTF-8 sed 
'y/ABCDEFGHIJKLMNOPQRSTUÜVWXYZ/abcdefghijklmnopqrstuüvwxyz/'
  abcuabcuü

But I know your intention was tr because of File: and bug report says
/usr/bin/tr (coreutils) and so your intended example was probably this
type of example instead:

  echo abcABC Ü | LC_ALL=de_CH.UTF-8 tr '[:upper:]' '[:lower:]'
  abcabc Ü

  echo abcABC Ü | LC_ALL=de_CH.UTF-8 tr -d '[:upper:]'
  abc Ü

  echo abcuABCU Ü | LC_ALL=de_CH.UTF-8 tr -d '[=U=]'
  abcuABC Ü

It is a known deficiency in coreutils in general that the utilities
are not multibyte aware.  The following can be found in the upstream
source package TODO file.

  Adapt tools like wc, tr, fmt, etc. (most of the textutils) to be
    multibyte aware.  The problem is that I want to avoid duplicating
    significant blocks of logic, yet I also want to incur only minimal
    (preferably `no') cost when operating in single-byte mode.

Some vendors have hacked in patches to make the utilities multibyte
aware but none of those patches have been considered clean enough to
incorporate into the upstream source yet.  Debian's maintainer has
stated that he does not want to diverge from upstream this radically.
The patches are very messy.  The best course of action would be to get
this resolved upstream with the functionally properly integrated.

Bob

-- 
Bob Proulx <[EMAIL PROTECTED]>
http://www.proulx.com/~bob/

Bug#388689: tr: fails to replace umlauts

Reply via email to