On Thu, Apr 28, 2022 at 07:26:19PM +0400, Alexey via Bug reports for the GNU Bourne Again SHell wrote: > Hello. > > I promised you more examples, and here they are: > Very common case to build a list of files for further processing: > declare -a FILES > #1 > FILES=(); time readarray -t FILES <<<"$(find "$d" -xdev -maxdepth 5 -type > f)" > #2 > # <<< act as a tmp file (due to result bigger than 64K) > FILES=(); time while read -r f; do FILES+=("$f"); done <<<"$(find / -xdev > -maxdepth 5 -type f)" > #3 > FILES=(); time while read -r f; do FILES+=("$f"); done < <(find / -xdev > -maxdepth 5 -type f) > > From these examples we can see that: > - example #1 approximately 2 times faster than example #2, and 4 times > faster than example #3. > - to be more honest, first example should be appended with at least empty > loop: for f in "${FILES[@]}"; do :; done > after such modification example #2 became comparable with example #1
I have a few comments about these examples. First, when it comes to performance, it should be expected that the main time bottleneck will be the find command, as it searches through the file system. Not the shell reading its results. Your first two examples both run the find command first, wait for it to finish, and then dump its results into a temp file for reading by the shell loop. The total time required will be the time spent doing the find, plus the time spent populating the array in the shell. Your third example is the only one which runs the two processes simultaneously. The shell loop will populate the array while the find command is still running. In a typical case, I would expect them both to terminate at about the same moment. (In your benchmarking, you ran the find command multiple times, which would have allowed the kernel to read a bunch of file system metadata into memory. That artificially lowers the time used by find in your testing, compared to how such a script would work in real life.) Second, none of your examples work with arbitrary filenames, which may contain newline characters. The solution to that is to use find -print0 and to read the NUL-delimited stream in the shell. In your first two examples, this is not possible. The command substitution will discard the NUL bytes from the stream (with or without a warning, depending on the bash version). Your third example can easily be extended to support NULs. That makes it the best choice in terms of correctness. Finally, I'm a little bit surprised that you omitted the obvious fourth example, readarray < <(find). You've already observed that readarray is faster than a while read loop (comparing #1 to #2), so why are you intentionally crippling the process substitution variant (#3) by forcing it to use the slower loop? In bash 4.4 or newer, the readarray can also take a -d '' option to read a NUL-delimited stream. So, unless you're supporting older bashes, the best choice in terms of correctness *and* speed should be: files=() time readarray -d '' files < <(find / -xdev -maxdepth 5 -type f -print0) (Obviously this depends on GNU/BSD find with its -print0 option, but since you're using -maxdepth which is *also* nonstandard, -print0 should be available on your platform.) > Also there is a problem that we can't use `mapfile -t <<<"$()"' as > equivalent to `mapfile -t < <()', because > here-string appends a newline, so MAPFILE will have one empty element > instead of no elements in case of empty subshell result. But the command substition *removed* the newline first. In the case where no filenames contain newlines, you're removing one newline and adding one newline, so the stream remains unchanged. Demonstration: unicorn:~$ mapfile -t f < <(find .profile .bashrc -print); echo "${#f[@]}" 2 unicorn:~$ mapfile -t f <<<"$(find .profile .bashrc -print)"; echo "${#f[@]}" 2 Of course, they stop being equivalent when you switch to -print0. > Bash could do 4096b read() to some internal buffer related to > file-descriptor and have an emulated lseek() > within that buffer. That would only fix the case where the rest of the input is supposed to be processed by bash. It would *not* fix the common case where bash is reading a little bit of the stream, and then executing another program to read the remainder of the stream. For the second program, some of its data has already been consumed.