Package: sed
Version: 4.1.5-1
Severity: normal

*** Please type your report below this line ***

First of all, note: I see no any acknowledgements on utf- and
i18n/l10n- readiness of this sed version in man or info pages.
Nevertheless, it looks like it is.

Commands like:
sed 's/c/\U&/g'
sed 's/c/\U&\E/g'
sed 's/c/\u&/g'
sed 's/c/\u&\E/g'
occasionally produce broken output when used 'c' char is char (or
sequence of chars) from non-ASCII national alphabet (Russian in my
case).

Examples:

# First, I make easiest test file with data.
# Literal is russian analog for english "da d".

# $ echo -n "da d" > sed.in  # english
$ echo -n "да д" > sed.in

# There are seven octets in dump: pair for each russian utf8 char
#   plus fifth octet is ASCII space.

$ hexdump -C sed.in
00000000  d0 b4 d0 b0 20 d0 b4                      \
|.... ..|

# Now, sed is started with command, whoes english analog is:
#   sed 's/d/\u&/g' < sed.in > sed.out

# $ sed 's/d/\u&/g' < sed.in > sed.out ;  echo $?  # english
$ sed 's/д/\u&/g' < sed.in > sed.out ;  echo $?
0

# Now in dump we can see: first lowercase letter (it matches)
#   has been gracefully converted to uppercase; second letter
#   doesn't match and stay as is; then space; then something
#   crazy (7 octets) in the place where third letter (4th
#   char in literal) was...

$ hexdump -C sed.out
00000000  d0 94 d0 b0 20 fc 88 81  99 b5 82 b4        \
|...........|

# And a bit another result for \U:

# $ sed 's/d/\U&/g' < sed.in > sed.out  # english
$ sed 's/д/\U&/g' < sed.in > sed.out

# Now the last character is uppercased as it should be,
#   but before it something strange (6 octets) inserted.

$ hexdump -C sed.out
00000000  d0 94 d0 b0 20 fc 88 81  99 b5 82 d0 94      \
|.... ........|


It seems, sed s///g 'dislikes' mixed input of one-octet and
two-octet chars like Russian chars and spaces or newlines,
or like mix of Russian and ASCII chars. Long line of Russian
chars without spaces seems to be upper/lower-cased okey,
as expected.



-- System Information:
Debian Release: 4.0
  APT prefers stable
  APT policy: (500, 'stable')
Architecture: i386 (i686)
Shell:  /bin/sh linked to /bin/bash
Kernel: Linux 2.6.18.test.001
Locale: LANG=ru_RU.UTF-8, LC_CTYPE=ru_RU.UTF-8 (charmap=UTF-8)

Versions of packages sed depends on:
ii  libc6                  2.3.6.ds1-13etch7 GNU C Library: Shared libraries

sed recommends no packages.

-- no debconf information

Reply via email to