> Sed silently ignores (or what it does? - no info) invalid
> multibyte sequences in the input: no halt, no message,
> no false exit-code.
This is unfortunate but expected. "." does not match a bad sequence,
see the fast path for UTF-8 in lib/regexec.c's check_node_accept_bytes:
returning 0 means that . does not match.
3781 unsigned char c = re_string_byte_at (input, str_idx), d;
3782 if (BE (c < 0xc2, 1))
3783 return 0;
3784
3785 if (str_idx + 2 > input->len)
3786 return 0;
3787
3788 d = re_string_byte_at (input, str_idx + 1);
3789 if (c < 0xe0)
3790 return (d < 0x80 || d > 0xbf) ? 0 : 2;
3791 else if (c < 0xf0)
3792 {
3793 char_len = 3;
3794 if (c == 0xe0 && d < 0xa0)
3795 return 0;
3796 }
3797 else if (c < 0xf8)
3798 {
3799 char_len = 4;
3800 if (c == 0xf0 && d < 0x90)
3801 return 0;
3802 }
3803 else if (c < 0xfc)
3804 {
3805 char_len = 5;
3806 if (c == 0xf8 && d < 0x88)
3807 return 0;
3808 }
3809 else if (c < 0xfe)
3810 {
3811 char_len = 6;
3812 if (c == 0xfc && d < 0x84)
3813 return 0;
3814 }
3815 else
3816 return 0;
3817
3818 if (str_idx + char_len > input->len)
3819 return 0;
3820
3821 for (i = 1; i < char_len; ++i)
3822 {
3823 d = re_string_byte_at (input, str_idx + i);
3824 if (d < 0x80 || d > 0xbf)
3825 return 0;
3826 }
3827 return char_len;
Use LANG=C if you can have invalid multibyte sequences in the input.
Do you think it could be worthwhile then to add a `z' command to zap the
current buffer independent of the presence of invalid multibyte sequences?
Paolo
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]