read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)
Bash Version: GNU bash, version 4.1.7(1)-release (amd64-portbld-freebsd8.0) OS: FreeBSD 8.0 Hardware: amd64 Environment: jail Description: read terminates reading all records at first null-byte ( chr(0) ) in a stream, null-bytes are valid ascii characters and should not cause read to stop reading a line this behavior is not reproducible using bourne shell. Steps To Reproduce: [bash ~]$ printf 'foo\0bar\n' | while read line; do echo "$line"; done foo [bash ~]$ # verify that printf is yielding the expected output [bash ~]$ printf 'foo\0bar\n' | od -a 000f o o nul b a r nl 010 [bash ~]$ # verify that it is not just echo with awk [bash ~]$ printf 'foo\0bar\n' | while read line; do awk -v line="$line" 'BEGIN { print line; }'; done | od -a 000f o o nl 004 [bash ~]$ # same awk test with a subshell and no read -- note that null-byte is removed, but feed does not end [bash ~]$ awk -v line=`printf 'foo\0bar\n'` 'BEGIN { print line; }' | od -a 000f o o b a r nl 007' [bash ~]$ # behavior with multiple lines [bash ~]$ printf 'foo\0bar\nbaz\n' | while read line; do echo "$line"; done foo baz [bash ~]$ # behavior with read -r [bash ~]$ printf 'foo\0bar\nbaz\n' | while read -r line; do echo "$line"; done foo baz Behavior in bourne shell: $ # note that the null-byte is dropped, but the line is read $ printf 'foo\0bar\nbaz\n' | while read line; do echo "$line"; done | od -a 000f o o b a r nl b a z nl 013 $ # test with awk instead of echo $ printf 'foo\0bar\nbaz\n' | while read line; do awk -v line="$line" 'BEGIN { print line; }'; done | od -a 000f o o b a r nl b a z nl 013
Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)
On Nov 23, 2011, at 4:47 PM, Chet Ramey wrote: > On 11/23/11 9:03 AM, Matthew Story wrote: >> Bash Version: GNU bash, version 4.1.7(1)-release (amd64-portbld-freebsd8.0) >> OS: FreeBSD 8.0 >>Hardware: amd64 >> Environment: jail >> Description: read terminates reading all records at first null-byte ( chr(0) >> ) in a stream, null-bytes are valid ascii characters and should not cause >> read to stop reading >> a line this behavior is not reproducible using bourne shell. > > Bash doesn't stop reading at the NUL; it reads it and the rest of the line > up to a newline. Since bash treats the line read as a C string, the NUL > terminates the value assigned to `foo'. it seems to terminate all assignment of the current line at the first `\0', not merely the value assigned to `foo': [bash ~]$ printf '%s\0 %s\n%s\n' foo bar baz | while read foo bar; do echo "$foo" "$bar"; done | od -a 000f o o sp nl b a z sp nl 012 it's clear from this example (and from what you've said above) that read does not halt entirely here as it processes the next line correctly, and ` bar' is not left in the buffer after first read. What I find confusing about this comment and Greg's comment (copied in below) ... > [Greg Wooledge ] > What happens here is bash reads until the newline, but only the "foo" > part is visible in the variable, because the NUL effective ends the > string. ... is that if the line were fully assigned along the lines of my understanding of read, foo should be `foo', up to the null-bye which effectively terminates the string, and then bar should be `bar'. > Bash doesn't drop NULs like the > FreeBSD (not the Bourne) shell. FreeBSD sh indeed, apologies for the misstatement. > > Chet > -- > ``The lyf so short, the craft so long to lerne.'' - Chaucer >``Ars longa, vita brevis'' - Hippocrates > Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/ >
Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)
>> Bash doesn't drop NULs like the >> FreeBSD (not the Bourne) shell. one last note on bash dropping NULs: [bash ~]$ foo=`printf 'foo\0bar'` [bash ~]$ echo $foo |od -a 000f o o b a r nl 007 > > FreeBSD sh indeed, apologies for the misstatement. > >> >> Chet >> -- >> ``The lyf so short, the craft so long to lerne.'' - Chaucer >> ``Ars longa, vita brevis'' - Hippocrates >> Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/ >> >
Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)
On Nov 23, 2011, at 7:09 PM, Chet Ramey wrote: > On 11/23/11 6:54 PM, Matthew Story wrote: >> On Nov 23, 2011, at 4:47 PM, Chet Ramey wrote: >> >>> On 11/23/11 9:03 AM, Matthew Story wrote: >>>> [... snip] > > Yes, sorry. That's what the "bash treats the line read as a C string" > was intended to imply. Since the line read is a C string, the NUL > terminates it and what remains is assigned to the named variables. I > should have used `line' in my explanation instead of `foo'. I understand that the underlying implementation of the bash builtins is `C', and I understand that `C' stings are NUL terminated. It seems unreasonable to me to expect understanding of this implementation detail when using bash to read streams into variables via the `read' builtin. Further-more, neither the man-page nor the gnu website document this behavior of bash: read read [-ers] [-a aname] [-d delim] [-i text] [-n nchars] [-N nchars] [-p prompt] [-t timeout] [-u fd] [name ...] One line is read from the standard input, or from the file descriptor fd supplied as an argument to the -u option, and the first word is assigned to the first name, the second word to the secondname, and so on, with leftover words and their intervening separators assigned to the last name. If there are fewer words read from the input stream than names, the remaining names are assigned empty values. The characters in the value of the IFS variable are used to split the line into words. The backslash character ‘\’ may be used to remove any special meaning for the next character read and for line continuation. If no names are supplied, the line read is assigned to the variable REPLY. The return code is zero, unless end-of-file is encountered, read times out (in which case the return code is greater than 128), or an invalid file descriptor is supplied as the argument to -u. I personally do not read "One line" as meaning "One string of characters terminated either by a null byte or a new-line", I read it as "One string of characters terminated by a new-line". But "One string of characters terminated either by a null byte or a new line" is not the actual functionality. The actual functionality is: "One line is read from the standard input, or from the file descriptor fd supplied as an argument to the -u option, then read byte-wise up to the first contained NUL, or end of string, ..." Furthermore, I do not see the use-case for this behavior ... I simply cannot fathom a case of I/O redirection in shell where I would choose to inject a NUL byte to coerce this sort of behavior from the read builtin, and can't imagine that anyone is relying on this `C string' feature of read currently in bash, especially considering that it is not consistent with NUL handling in other assignments in bash: [matt@matt0 ~]$ foo=`printf 'foo\0bar'`; echo "$foo" | od -a 000f o o b a r nl 007 [bash ~]$ foo=$(printf 'foo\0bar'); echo "$foo" | od -a 000f o o b a r nl 007 which strip NUL. I see one of three possible resolutions here: 1. NUL bytes do not terminate variable assignment from `read', behavior of echo/variable assignments persists as is 2. NUL bytes are stripped by read on assignment, and this functionality is documented as expected. 3. the existing functionality of the system is documented in the man-page and on gnu.org as expected I would prefer the first, and would be happy to attempt in providing a patch, if that's useful. cheers, -matt > > Chet > -- > ``The lyf so short, the craft so long to lerne.'' - Chaucer >``Ars longa, vita brevis'' - Hippocrates > Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/ Additional Notes: The only occurrence of the pattern `NUL' in the FreeBSD man-page for bash is: Pattern Matching Any character that appears in a pattern, other than the special pattern characters described below, matches itself. The NUL character may not occur in a pattern. A backslash escapes the following character; the escaping backslash is discarded when matching. The special pattern characters must be quoted if they are to be matched literally. All other references in the man-page are to the null string (empty string) not to an explicit NUL byte (e.g. ascii 0), the same is true of the gnu.org documentation.
Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)
Attached a patch to discard null-bytes while read, this preserves the functionality Greg demonstrated (not sure if this is desirable ...) wherein a delim of '' (e.g. -d '') will split on null byte. With patch, read functions this way: bash-4.2$ printf 'foo\0bar\n' | while read line; do echo "$line"; done foobar bash-4.2$ printf 'foo\0bar\0' | while read -d '' line; do echo "$line"; done foo bar I find this behavior incongruent with what I expect from setting things like IFS to empty string (e.g. delim is every character), but it seems like it is already in use. I have a patch to make terminate input line after every character for -d '', and after null-byte on -d '\0', if you are interested in that functionality, I'll send that patch for your consideration as well. git am patch for read builtin 0001-Strip-null-bytes-from-read-when-DELIM-is-not.patch.gz Description: GNU Zip compressed data git am patch for man page and texi 0002-Update-documentation-both-man-and-info-to-reflect-re.patch.gz Description: GNU Zip compressed data I have patches for the generated documentation, but they are quite large, if you want them I'm happy to send them along as well. cheers, -matt On Nov 24, 2011, at 12:08 AM, Chet Ramey wrote: > On 11/23/11 9:44 PM, Matthew Story wrote: >> >> On Nov 23, 2011, at 7:09 PM, Chet Ramey wrote: >> >>> On 11/23/11 6:54 PM, Matthew Story wrote: >>>> On Nov 23, 2011, at 4:47 PM, Chet Ramey wrote: >>>> >>>>> On 11/23/11 9:03 AM, Matthew Story wrote: >>>>>> [... snip] >>> >>> Yes, sorry. That's what the "bash treats the line read as a C string" >>> was intended to imply. Since the line read is a C string, the NUL >>> terminates it and what remains is assigned to the named variables. I >>> should have used `line' in my explanation instead of `foo'. >> >> I understand that the underlying implementation of the bash builtins is >> `C', and I understand that `C' stings are NUL terminated. It seems >> unreasonable to me to expect understanding of this implementation detail >> when using bash to read streams into variables via the `read' builtin. > > I took a look around at Posix and some other shells. Posix passes on > the issue completely: the input to read may not contain NUL bytes at all. > The Bourne shell, from v7 to SVR4.2, uses NUL as a line terminator. Other > shells, including ksh93, ash and pdksh derivatives like dash and mksh, > discard NUL bytes in read. zsh doesn't discard NULs and handles them > pretty well, putting them into a variable's value. > > The discard behavior seems fairly standard, and I will look at putting it > into the next version of bash. > > Chet > > -- > ``The lyf so short, the craft so long to lerne.'' - Chaucer >``Ars longa, vita brevis'' - Hippocrates > Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/ >
Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)
On Nov 29, 2011, at 9:39 AM, Chet Ramey wrote: > On 11/29/11 8:29 AM, Greg Wooledge wrote: > >> [...snip] > > It's possible to have both. You can handle matching a NUL delimiter and > skip NUL bytes in the input if the delimiter isn't NUL. This is exactly the behavior that my patch provides, along with documentation of the -d '' terminates input line on '\0' . > That allows the > bash behavior and compatibility with other shells that don't provide `-d'. > > There is already `read -n 1' to read only a single character from the input > stream. I don't see much value in translating '\0' to 0 for `-d'. Fair point.
Re: read fails on null-byte: v4.1.7 FreeBSD 8.0 (amd64)
Re-sending in text form instead of attached gzip ... as this seems to be the prevailing style on list ... this patch discards null-bytes in the read builtin along with documentation of existing -d '' behavior and expected discard behavior without for both the man and info. (run with patch -p1) commit df4bdef6d6066beeac57cf36f54cff7bde8f5ea3 Author: Matthew Story Date: Mon Nov 28 22:51:59 2011 -0500 Update documentation (both man and info) to reflect read NUL character behavior, and -d ''. Signed-off-by: Matthew Story diff --git a/doc/bash.1 b/doc/bash.1 index 0ba4f8e..ef3f174 100644 --- a/doc/bash.1 +++ b/doc/bash.1 @@ -8284,6 +8284,8 @@ The characters in are used to split the line into words. The backslash character (\fB\e\fP) may be used to remove any special meaning for the next character read and for line continuation. +Unless '' (an empty string) is supplied as an argument to the +\fB\-d\fP option, NUL characters are stripped from input. Options, if supplied, have the following meanings: .RS .PD 0 @@ -8299,7 +8301,7 @@ Other \fIname\fP arguments are ignored. .TP .B \-d \fIdelim\fP The first character of \fIdelim\fP is used to terminate the input line, -rather than newline. +rather than newline. If '' (an empty string) is suppilied as \fIdelim\fP, NUL characters are used to terminate the input line. .TP .B \-e If the standard input diff --git a/doc/bashref.texi b/doc/bashref.texi index b4fd8d3..36f61d3 100644 --- a/doc/bashref.texi +++ b/doc/bashref.texi @@ -3890,6 +3890,8 @@ variable @env{REPLY}. The return code is zero, unless end-of-file is encountered, @code{read} times out (in which case the return code is greater than 128), or an invalid file descriptor is supplied as the argument to @option{-u}. +Unless '' (an empty string) is supplied as an argument to +@option{-d}, NUL characters are stripped from input. Options, if supplied, have the following meanings: @@ -3902,7 +3904,8 @@ Other @var{name} arguments are ignored. @item -d @var{delim} The first character of @var{delim} is used to terminate the input line, -rather than newline. +rather than newline. If '' (an empty string) is suppilied as @var{delim}, +NUL characters are used to terminate the input line. @item -e Readline (@pxref{Command Line Editing}) is used to obtain the line. commit 7b9ce5228c6483729d7823c853c9eadcc7721fb3 Author: Matthew Story Date: Mon Nov 28 22:50:33 2011 -0500 Strip null-bytes from read when DELIM is not `'. Preserves what seems to be used behavior in read -d '' (split on NUL character) Signed-off-by: Matthew Story diff --git a/builtins/read.def b/builtins/read.def index 1b87faa..47074bb 100644 --- a/builtins/read.def +++ b/builtins/read.def @@ -563,6 +563,8 @@ read_builtin (list) saw_escape++; input_string[i++] = CTLESC; } + if (c == '\0') /* drop literal NUL if delim is not '' */ + continue; add_char: input_string[i++] = c;