Re: Bash scripting and large files: input with the read builtin from a redirection gives unexpected result with files larger than 2GB.
On 3/2/12 6:47 AM, Jean-François Gagné wrote: > uname output: Linux 2.6.32-5-amd64 #1 SMP Tue Jun 14 09:42:28 UTC > 2011 x86_64 GNU/Linux > Machine Type: x86_64-pc-linux-gnu > > Bash Version: 4.1 > Patch Level: 5 > Release Status: release > > Description: > When reading data with the 'read' buildin from a redirection, read has > unexpected behavior after reading 2G of data. > > Repeat-By: > > > yes "0123456789abcdefghijklmnopqrs" | head -n 1 > file > while read line; do file=${line:0:10}; echo $file; done < file | uniq -c > > > results in > > > 71582790 0123456789 > 1 mnopqrs > 3 0123456789 > 1 mnopqrs > 3 0123456789 > 1 mnopqrs > 3 0123456789 > 1 mnopqrs > 3 0123456789 > ... > > So the problem happens after reading 71.582.790 x30 = 2.147.483.700 bytes of > data, just a little over 2^31. Compile and run the attached program. If it prints out `4', which it does on all of the Debian systems I've tried, file offsets are limited to 32 bits, and accessing files greater than 2 GB is going to be unreliable. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/ #include #include #include main(int c, char **v) { printf("%d\n", (int)sizeof(off_t)); exit(0); }
Re: Bash scripting and large files: input with the read builtin from a redirection gives unexpected result with files larger than 2GB.
Hi Chet, Chet Ramey wrote: > Compile and run the attached program. If it prints out `4', which it does > on all of the Debian systems I've tried, file offsets are limited to 32 > bits, and accessing files greater than 2 GB is going to be unreliable. off_t is typedef'd to off64_t if you compile with -D_FILE_OFFSET_BITS=64. The AC_SYS_LARGEFILE autoconf macro is supposed to cause bash to be built with -D_FILE_OFFSET_BITS=64 (and seems to work that way here, based on the result of "eu-readelf -s /bin/bash | grep open"). Jean-François is using a 64-bit architecture, where _FILE_OFFSET_BITS=64 is the default, anyway. So I fear the problem is somewhere else. (Maybe an "int" is used as an offset somewhere.) Hope that helps, Jonathan
Re: Bash scripting and large files: input with the read builtin from a redirection gives unexpected result with files larger than 2GB.
Chet Ramey wrote: > Jean-François Gagné wrote: > > uname output: Linux 2.6.32-5-amd64 #1 SMP Tue Jun 14 09:42:28 UTC > > 2011 x86_64 GNU/Linux > > Machine Type: x86_64-pc-linux-gnu > > Compile and run the attached program. If it prints out `4', which it does > on all of the Debian systems I've tried, file offsets are limited to 32 > bits, and accessing files greater than 2 GB is going to be unreliable. Apparently all of the Debian systems you have tried are 32-bits systems. On the reporter's 64-bit amd64 system it will print out 8. Additionally the bash configure script includes the AC_SYS_LARGEFILE macro which will test the ability of the system to use large files and if the system is capable it will define _FILE_OFFSET_BITS=64 and in that case the size off_t will be 8 bytes too. If you compile the test program with -D_FILE_OFFSET_BITS=64 the result will also be 8 even on 32-bit systems. By default the 32-bit bash will be large file aware on all systems that support large files and will have been compiled with _FILE_OFFSET_BITS=64. I just looked in the config.log from a build of bash and it included these lines in the resulting config.log file. configure:4710: checking for special C compiler options needed for large files configure:4805: result: no configure:4811: checking for _FILE_OFFSET_BITS value needed for large files ... configure:4922: result: 64 ... ac_cv_sys_file_offset_bits=64 Bob
Re: Bash scripting and large files: input with the read builtin from a redirection gives unexpected result with files larger than 2GB.
Bob Proulx writes: > Chet Ramey wrote: >> Jean-François Gagné wrote: >> > uname output: Linux 2.6.32-5-amd64 #1 SMP Tue Jun 14 09:42:28 UTC >> > 2011 x86_64 GNU/Linux >> > Machine Type: x86_64-pc-linux-gnu >> >> Compile and run the attached program. If it prints out `4', which it does >> on all of the Debian systems I've tried, file offsets are limited to 32 >> bits, and accessing files greater than 2 GB is going to be unreliable. > > Apparently all of the Debian systems you have tried are 32-bits > systems. On the reporter's 64-bit amd64 system it will print out 8. But it won't help if you don't use it. diff --git a/lib/sh/zread.c b/lib/sh/zread.c index 0fd1199..3731a41 100644 --- a/lib/sh/zread.c +++ b/lib/sh/zread.c @@ -161,7 +161,7 @@ zsyncfd (fd) int fd; { off_t off; - int r; + off_t r; off = lused - lind; r = 0; Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
Re: Bash scripting and large files: input with the read builtin from a redirection gives unexpected result with files larger than 2GB.
On 3/4/12 3:51 PM, Andreas Schwab wrote: > Bob Proulx writes: > >> Chet Ramey wrote: >>> Jean-François Gagné wrote: uname output: Linux 2.6.32-5-amd64 #1 SMP Tue Jun 14 09:42:28 UTC 2011 x86_64 GNU/Linux Machine Type: x86_64-pc-linux-gnu >>> >>> Compile and run the attached program. If it prints out `4', which it does >>> on all of the Debian systems I've tried, file offsets are limited to 32 >>> bits, and accessing files greater than 2 GB is going to be unreliable. >> >> Apparently all of the Debian systems you have tried are 32-bits >> systems. On the reporter's 64-bit amd64 system it will print out 8. > > But it won't help if you don't use it. > > diff --git a/lib/sh/zread.c b/lib/sh/zread.c > index 0fd1199..3731a41 100644 > --- a/lib/sh/zread.c > +++ b/lib/sh/zread.c > @@ -161,7 +161,7 @@ zsyncfd (fd) > int fd; > { >off_t off; > - int r; > + off_t r; > >off = lused - lind; >r = 0; That's true, and I made this change some months ago. The question is whether or not it makes a real difference, since the only use of that variable is to check whether the return value from lseek is -1. I suppose under the right set of circumstances it could. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/
Re: Bash scripting and large files: input with the read builtin from a redirection gives unexpected result with files larger than 2GB.
On 3/4/12 4:32 PM, Chet Ramey wrote: > That's true, and I made this change some months ago. The question is > whether or not it makes a real difference, since the only use of that > variable is to check whether the return value from lseek is -1. I > suppose under the right set of circumstances it could. Sorry, I forgot to attach the patch. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/ *** ../bash-4.2-patched/lib/sh/zread.c Mon Mar 2 08:54:45 2009 --- lib/sh/zread.c Thu Jul 28 18:16:53 2011 *** *** 161,166 int fd; { ! off_t off; ! int r; off = lused - lind; --- 161,165 int fd; { ! off_t off, r; off = lused - lind; *** *** 169,173 r = lseek (fd, -off, SEEK_CUR); ! if (r >= 0) lused = lind = 0; } --- 168,172 r = lseek (fd, -off, SEEK_CUR); ! if (r != -1) lused = lind = 0; }
Re: Bash scripting and large files: input with the read builtin from a redirection gives unexpected result with files larger than 2GB.
Chet Ramey writes: > That's true, and I made this change some months ago. The question is > whether or not it makes a real difference, Of course it does, int != off_t. Andreas. -- Andreas Schwab, sch...@linux-m68k.org GPG Key fingerprint = 58CA 54C7 6D53 942B 1756 01D3 44D5 214B 8276 4ED5 "And now for something completely different."
Re: Bash scripting and large files: input with the read builtin from a redirection gives unexpected result with files larger than 2GB.
On 3/2/12 6:47 AM, Jean-François Gagné wrote: > Description: > When reading data with the 'read' buildin from a redirection, read has > unexpected behavior after reading 2G of data. > > Repeat-By: > > > yes "0123456789abcdefghijklmnopqrs" | head -n 1 > file > while read line; do file=${line:0:10}; echo $file; done < file | uniq -c > > > results in > > > 71582790 0123456789 > 1 mnopqrs > 3 0123456789 > 1 mnopqrs > 3 0123456789 > 1 mnopqrs > 3 0123456789 > 1 mnopqrs > 3 0123456789 > ... > > So the problem happens after reading 71.582.790 x30 = 2.147.483.700 bytes of > data, just a little over 2^31. > > but the following: > > cat file | while read line; do file=${line:0:10}; echo $file; done | uniq -c > > works fine: > > 1 0123456789 This works fine with the patch I posted. Chet -- ``The lyf so short, the craft so long to lerne.'' - Chaucer ``Ars longa, vita brevis'' - Hippocrates Chet Ramey, ITS, CWRUc...@case.eduhttp://cnswww.cns.cwru.edu/~chet/