read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)

2011-11-23 Thread Matthew Story
Bash Version: GNU bash, version 4.1.7(1)-release (amd64-portbld-freebsd8.0)
  OS: FreeBSD 8.0
Hardware: amd64
 Environment: jail
 Description: read terminates reading all records at first null-byte ( chr(0) ) 
in a stream, null-bytes are valid ascii characters and should not cause read to 
stop reading
  a line this behavior is not reproducible using bourne shell.

Steps To Reproduce:

[bash ~]$ printf 'foo\0bar\n' | while read line; do echo "$line"; done
foo
[bash ~]$ # verify that printf is yielding the expected output
[bash ~]$ printf 'foo\0bar\n' | od -a
000f   o   o nul   b   a   r  nl
010
[bash ~]$ # verify that it is not just echo with awk
[bash ~]$ printf 'foo\0bar\n' | while read line; do awk -v line="$line" 'BEGIN 
{ print line; }'; done | od -a
000f   o   o  nl
004
[bash ~]$ # same awk test with a subshell and no read -- note that null-byte is 
removed, but feed does not end
[bash ~]$ awk -v line=`printf 'foo\0bar\n'` 'BEGIN { print line; }' | od -a
000f   o   o   b   a   r  nl
007'
[bash ~]$ # behavior with multiple lines
[bash ~]$ printf 'foo\0bar\nbaz\n' | while read line; do echo "$line"; done
foo
baz
[bash ~]$ # behavior with read -r
[bash ~]$ printf 'foo\0bar\nbaz\n' | while read -r line; do echo "$line"; done
foo
baz

Behavior in bourne shell:

$ # note that the null-byte is dropped, but the line is read
$ printf 'foo\0bar\nbaz\n' | while read line; do echo "$line"; done | od -a
000f   o   o   b   a   r  nl   b   a   z  nl
013
$ # test with awk instead of echo
$ printf 'foo\0bar\nbaz\n' | while read line; do awk -v line="$line" 'BEGIN { 
print line; }'; done | od -a
000f   o   o   b   a   r  nl   b   a   z  nl
013


Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)

2011-11-23 Thread Matthew Story
On Nov 23, 2011, at 4:47 PM, Chet Ramey wrote:

> On 11/23/11 9:03 AM, Matthew Story wrote:
>> Bash Version: GNU bash, version 4.1.7(1)-release (amd64-portbld-freebsd8.0)
>>  OS: FreeBSD 8.0
>>Hardware: amd64
>> Environment: jail
>> Description: read terminates reading all records at first null-byte ( chr(0) 
>> ) in a stream, null-bytes are valid ascii characters and should not cause 
>> read to stop reading
>>  a line this behavior is not reproducible using bourne shell.
> 
> Bash doesn't stop reading at the NUL; it reads it and the rest of the line
> up to a newline.  Since bash treats the line read as a C string, the NUL
> terminates the value assigned to `foo'.  

it seems to terminate all assignment of the current line at the first `\0', not 
merely the value assigned to `foo':

[bash ~]$ printf '%s\0 %s\n%s\n' foo bar baz | while read foo bar; do echo 
"$foo" "$bar"; done | od -a
000f   o   o  sp  nl   b   a   z  sp  nl
012

it's clear from this example (and from what you've said above) that read does 
not halt entirely here as it processes the next line correctly, and ` bar' is 
not left in the buffer after first read.  What I find confusing about this 
comment and Greg's comment (copied in below) ...

> [Greg Wooledge ]
> What happens here is bash reads until the newline, but only the "foo"
> part is visible in the variable, because the NUL effective ends the
> string.  

... is that if the line were fully assigned along the lines of my understanding 
of read, foo should be `foo', up to the null-bye which effectively terminates 
the string, and then bar should be `bar'.



> Bash doesn't drop NULs like the
> FreeBSD (not the Bourne) shell.

FreeBSD sh indeed, apologies for the misstatement.

> 
> Chet
> -- 
> ``The lyf so short, the craft so long to lerne.'' - Chaucer
>``Ars longa, vita brevis'' - Hippocrates
> Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
> 



Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)

2011-11-23 Thread Matthew Story

>> Bash doesn't drop NULs like the
>> FreeBSD (not the Bourne) shell.

one last note on bash dropping NULs:

[bash ~]$ foo=`printf 'foo\0bar'`
[bash ~]$ echo $foo |od -a
000f   o   o   b   a   r  nl
007

> 
> FreeBSD sh indeed, apologies for the misstatement.
> 
>> 
>> Chet
>> -- 
>> ``The lyf so short, the craft so long to lerne.'' - Chaucer
>>   ``Ars longa, vita brevis'' - Hippocrates
>> Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
>> 
> 



Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)

2011-11-23 Thread Matthew Story

On Nov 23, 2011, at 7:09 PM, Chet Ramey wrote:

> On 11/23/11 6:54 PM, Matthew Story wrote:
>> On Nov 23, 2011, at 4:47 PM, Chet Ramey wrote:
>> 
>>> On 11/23/11 9:03 AM, Matthew Story wrote:
>>>> [... snip]
> 
> Yes, sorry.  That's what the "bash treats the line read as a C string"
> was intended to imply.  Since the line read is a C string, the NUL
> terminates it and what remains is assigned to the named variables.  I
> should have used `line' in my explanation instead of `foo'.

I understand that the underlying implementation of the bash builtins is `C', 
and I understand that `C' stings are NUL terminated.  It seems unreasonable to 
me to expect understanding of this implementation detail when using bash to 
read streams into variables via the `read' builtin.  Further-more, neither the 
man-page nor the gnu website document this behavior of bash:

read
  read [-ers] [-a aname] [-d delim] [-i text] [-n nchars] [-N nchars] 
[-p prompt] [-t timeout] [-u fd] [name ...]
One line is read from the standard input, or from the file descriptor fd 
supplied as an argument to the -u option, and the first word is assigned to the 
first name, the second word to the secondname, and so on, with leftover words 
and their intervening separators assigned to the last name. If there are fewer 
words read from the input stream than names, the remaining names are assigned 
empty values. The characters in the value of the IFS variable are used to split 
the line into words. The backslash character ‘\’ may be used to remove any 
special meaning for the next character read and for line continuation. If no 
names are supplied, the line read is assigned to the variable REPLY. The return 
code is zero, unless end-of-file is encountered, read times out (in which case 
the return code is greater than 128), or an invalid file descriptor is supplied 
as the argument to -u.

I personally do not read "One line" as meaning "One string of characters 
terminated either by a null byte or a new-line", I read it as "One string of 
characters terminated by a new-line".  But "One string of characters terminated 
either by a null byte or a new line" is not the actual functionality.  The 
actual functionality is:

"One line is read from the standard input, or from the file descriptor fd 
supplied as an argument to the -u option, then read byte-wise up to the first 
contained NUL, or end of string, ..."

Furthermore, I do not see the use-case for this behavior ... I simply cannot 
fathom a case of I/O redirection in shell where I would choose to inject a NUL 
byte to coerce this sort of behavior from the read builtin, and can't imagine 
that anyone is relying on this `C string' feature of read currently in bash, 
especially considering that it is not consistent with NUL handling in other 
assignments in bash:

[matt@matt0 ~]$ foo=`printf 'foo\0bar'`; echo "$foo" | od -a
000f   o   o   b   a   r  nl
007
[bash ~]$ foo=$(printf 'foo\0bar'); echo "$foo" | od -a
000f   o   o   b   a   r  nl
007

which strip NUL.

I see one of three possible resolutions here:

1. NUL bytes do not terminate variable assignment from `read', behavior of 
echo/variable assignments persists as is
2. NUL bytes are stripped by read on assignment, and this functionality is 
documented as expected.
3. the existing functionality of the system is documented in the man-page and 
on gnu.org as expected

I would prefer the first, and would be happy to attempt in providing a patch, 
if that's useful.

cheers,
-matt

> 
> Chet
> -- 
> ``The lyf so short, the craft so long to lerne.'' - Chaucer
>``Ars longa, vita brevis'' - Hippocrates
> Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/

Additional Notes:

The only occurrence of the pattern `NUL' in the FreeBSD man-page for bash is:

   Pattern Matching

   Any character that appears in a pattern, other than the special pattern
   characters described below, matches itself.  The NUL character may  not
   occur  in  a pattern.  A backslash escapes the following character; the
   escaping backslash is discarded when  matching.   The  special  pattern
   characters must be quoted if they are to be matched literally.

All other references in the man-page are to the null string (empty string) not 
to an explicit NUL byte (e.g. ascii 0), the same is true of the gnu.org 
documentation.




Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)

2011-11-28 Thread Matthew Story
Attached a patch to discard null-bytes while read, this preserves the 
functionality Greg demonstrated (not sure if this is desirable ...) wherein a 
delim of '' (e.g. -d '') will split on null byte.

With patch, read functions this way:

bash-4.2$ printf 'foo\0bar\n' | while read line; do echo "$line"; done
foobar
bash-4.2$ printf 'foo\0bar\0' | while read -d '' line; do echo "$line"; done
foo
bar

I find this behavior incongruent with what I expect from setting things like 
IFS to empty string (e.g. delim is every character), but it seems like it is 
already in use.  I have a patch to make terminate input line after every 
character for -d '', and after null-byte on -d '\0', if you are interested in 
that functionality, I'll send that patch for your consideration as well.


git am patch for read builtin



0001-Strip-null-bytes-from-read-when-DELIM-is-not.patch.gz
Description: GNU Zip compressed data


git am patch for man page and texi



0002-Update-documentation-both-man-and-info-to-reflect-re.patch.gz
Description: GNU Zip compressed data


I have patches for the generated documentation, but they are quite large, if 
you want them I'm happy to send them along as well.

cheers,
-matt



On Nov 24, 2011, at 12:08 AM, Chet Ramey wrote:

> On 11/23/11 9:44 PM, Matthew Story wrote:
>> 
>> On Nov 23, 2011, at 7:09 PM, Chet Ramey wrote:
>> 
>>> On 11/23/11 6:54 PM, Matthew Story wrote:
>>>> On Nov 23, 2011, at 4:47 PM, Chet Ramey wrote:
>>>> 
>>>>> On 11/23/11 9:03 AM, Matthew Story wrote:
>>>>>> [... snip]
>>> 
>>> Yes, sorry.  That's what the "bash treats the line read as a C string"
>>> was intended to imply.  Since the line read is a C string, the NUL
>>> terminates it and what remains is assigned to the named variables.  I
>>> should have used `line' in my explanation instead of `foo'.
>> 
>> I understand that the underlying implementation of the bash builtins is
>> `C', and I understand that `C' stings are NUL terminated.  It seems
>> unreasonable to me to expect understanding of this implementation detail
>> when using bash to read streams into variables via the `read' builtin.
> 
> I took a look around at Posix and some other shells.  Posix passes on
> the issue completely: the input to read may not contain NUL bytes at all.
> The Bourne shell, from v7 to SVR4.2, uses NUL as a line terminator.  Other
> shells, including ksh93, ash and pdksh derivatives like dash and mksh,
> discard NUL bytes in read.  zsh doesn't discard NULs and handles them
> pretty well, putting them into a variable's value.
> 
> The discard behavior seems fairly standard, and I will look at putting it
> into the next version of bash.
> 
> Chet
> 
> -- 
> ``The lyf so short, the craft so long to lerne.'' - Chaucer
>``Ars longa, vita brevis'' - Hippocrates
> Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
> 



Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)

2011-11-29 Thread Matthew Story

On Nov 29, 2011, at 9:39 AM, Chet Ramey wrote:

> On 11/29/11 8:29 AM, Greg Wooledge wrote:
> 
>> [...snip]
> 
> It's possible to have both.  You can handle matching a NUL delimiter and
> skip NUL bytes in the input if the delimiter isn't NUL.

This is exactly the behavior that my patch provides, along with documentation 
of the -d '' terminates input line on '\0' .

>  That allows the
> bash behavior and compatibility with other shells that don't provide `-d'.
> 
> There is already `read -n 1' to read only a single character from the input
> stream.  I don't see much value in translating '\0' to 0 for `-d'.

Fair point.




Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)

2011-11-29 Thread Matthew Story
Re-sending in text form instead of attached gzip ... as this seems to be the 
prevailing style on list ... this patch discards null-bytes in the read builtin 
along with documentation of existing -d '' behavior and expected discard 
behavior without for both the man and info.

(run with patch -p1)

commit df4bdef6d6066beeac57cf36f54cff7bde8f5ea3
Author: Matthew Story 
Date:   Mon Nov 28 22:51:59 2011 -0500

Update documentation (both man and info) to reflect read NUL character
behavior, and -d ''. 

Signed-off-by: Matthew Story 

diff --git a/doc/bash.1 b/doc/bash.1
index 0ba4f8e..ef3f174 100644
--- a/doc/bash.1
+++ b/doc/bash.1
@@ -8284,6 +8284,8 @@ The characters in
 are used to split the line into words.
 The backslash character (\fB\e\fP) may be used to remove any special
 meaning for the next character read and for line continuation.
+Unless '' (an empty string) is supplied as an argument to the 
+\fB\-d\fP option, NUL characters are stripped from input.  
 Options, if supplied, have the following meanings:
 .RS 
 .PD 0
@@ -8299,7 +8301,7 @@ Other \fIname\fP arguments are ignored.
 .TP 
 .B \-d \fIdelim\fP
 The first character of \fIdelim\fP is used to terminate the input line,
-rather than newline.
+rather than newline. If '' (an empty string) is suppilied as \fIdelim\fP, NUL 
characters are used to terminate the input line.
 .TP
 .B \-e
 If the standard input
diff --git a/doc/bashref.texi b/doc/bashref.texi
index b4fd8d3..36f61d3 100644
--- a/doc/bashref.texi
+++ b/doc/bashref.texi
@@ -3890,6 +3890,8 @@ variable @env{REPLY}.
 The return code is zero, unless end-of-file is encountered, @code{read}
 times out (in which case the return code is greater than 128), or an
 invalid file descriptor is supplied as the argument to @option{-u}.
+Unless '' (an empty string) is supplied as an argument to 
+@option{-d}, NUL characters are stripped from input.

 Options, if supplied, have the following meanings:

@@ -3902,7 +3904,8 @@ Other @var{name} arguments are ignored.

 @item -d @var{delim}
 The first character of @var{delim} is used to terminate the input line,
-rather than newline.
+rather than newline. If '' (an empty string) is suppilied as @var{delim},
+NUL characters are used to terminate the input line.

 @item -e
 Readline (@pxref{Command Line Editing}) is used to obtain the line.

commit 7b9ce5228c6483729d7823c853c9eadcc7721fb3
Author: Matthew Story 
Date:   Mon Nov 28 22:50:33 2011 -0500

Strip null-bytes from read when DELIM is not `'.

Preserves what seems to be used behavior in read -d '' (split on
NUL character)

Signed-off-by: Matthew Story 

diff --git a/builtins/read.def b/builtins/read.def
index 1b87faa..47074bb 100644
--- a/builtins/read.def
+++ b/builtins/read.def
@@ -563,6 +563,8 @@ read_builtin (list)
  saw_escape++;
  input_string[i++] = CTLESC;
}
+  if (c == '\0')  /* drop literal NUL if delim is not '' */
+   continue;

 add_char:
   input_string[i++] = c;