Re: How to deal with space in command line?

Greg Wooledge Mon, 20 Sep 2010 06:15:07 -0700

On Sat, Sep 18, 2010 at 09:16:46PM -0500, Peng Yu wrote:
> Hi,
> 
> stat --printf "%y %n\n" `find . -type f -print`


Chris and Pierre already helped with this specific example.  I'd like to
address the more general case.

In the original design of the Unix shell, in many ways and places,
it's quite apparent that the designers never really intended to handle
filenames that contain whitespace.  Things like your stat `find . -print`
example look like they ought to work, but they don't -- precisely because
the shell's word-splitting operates on whitespace, but filenames are
ALLOWED to contain whitespace.  There is an obstacle here, and there is
NO WAY to overcome it.  It's completely impossible.  The only solution
is to use an entirely different approach altogether.

Thus, the proposed alternatives such as find . -exec stat {} + which
Chris and Pierre have already provided.

To clarify the problem, when you write `...` or $(...) you produce a
single string which is the entire output all shoved together ("serialized"
is the fancy word for it).  The shell takes this single string and then
tries to break it apart into meaningful chunks (word splitting).

However, with serialized filenames, there is no way to tell where one
filename ends and the next begins.  If you see the string "foo bar",
you can't tell whether that's one file name with a space in the middle,
or two filenames with a space between them.

Likewise, newlines are allowed in filenames.  If you see the string
"foo\nbar" where \n is a newline, you can't tell whether it's one
filename or two filenames.

The only character that is NOT allowed in a Unix filename is NUL (ASCII 0).
So, if you have a serialized stream of filenames "foo\0bar\0" then you
know that there are two filenames, and the NUL (\0) bytes tell you where
they end.  That's wonderful if you're reading from a stream or a file.
But it doesn't help you with command substitution (`...` or $(...))
because you can't work with NUL bytes in the shell.  The command
substitution goes into a C string in memory.  When you try to read back
the contents of that C string, you stop at the first NUL, because that's
what NUL means in a C string -- "end of string".

Bash and ksh actually handle this differently, but neither one will work
for what your example was trying to do.  In bash, the NUL bytes are
stripped away entirely:

arc3:/tmp/foo$ touch foo bar
arc3:/tmp/foo$ echo "$(find . -print0)"
../foo./bar
arc3:/tmp/foo$ echo "$(find . -print0)" | hd
00000000  2e 2e 2f 66 6f 6f 2e 2f  62 61 72 0a              |../foo./bar.|
0000000c

In ksh, the NUL bytes are retained, and thus you get the behavior I
described above (stopping at the first one):

arc3:/tmp/foo$ ksh -c 'echo "$(find . -print0)"'
.

Thus, $(find ...) is never going to be useful, in either shell, under
any circumstance.  It simply cannot produce useful output when operating
on real filenames outside of controlled environments.

If you want to work with find, you must throw away command substitution
entirely.  This is regrettable, because it would be extremely convenient
if you could do something like vi `find . -name '*.c'`, but you simply
can't do it.

So, what does that leave you?

 * You can use -exec, or
 * You can read the output of find ... -print0 as a stream.

You've already seen one example of -exec.  When using -exec, the find
command (which is external, not part of the shell) is told to execute
yet another external command for each file that it matches (or for
clumps of matched filenames, when using the newer + terminator).

The disadvantage of -exec is that if you wanted to do something within
your shell (putting the filenames into an array, incrementing a counter
variable, etc.), you can't.  You're already two processes removed from
your shell.  Likewise, you can't -exec a shell function that you wrote.
You would have to use a separate script, or write out the shell code in
quotes and call -exec sh -c '....'.

If you want to work on filenames recursively within your script, you
will almost always end up using the following idiom, because all the
alternatives are ruled out one way or another:

while IFS= read -r -d '' filename; do
 ...
done < <(find ... -print0)

This example uses two bash features (process substitution, and read -d '')
so it's extremely non-portable.  The obvious pipeline alternative
(find ... -print0 | while read) is ruled out because the while read occurs
in a subshell, and thus any variables set by the subshell are lost after
the loop.  The read -d '' is a special trick that tells bash's read command
to stop at each NUL byte instead of each newline.  The output of find
is never put into a string in memory (as with command substitution), so
the problems we had when trying to work with NULs in a command substitution
don't apply here.

For example, if we wanted to do vi `find . -name '*.c'` but actually have
it WORK in the general case, we end up needing this monstrosity:

unset array
while IFS= read -r -d '' f; do array+=("$f"); done \
  < <(find . -name '*.c' -print0)
vi "${arr...@]}"

... which uses three bash extensions and one BSD/GNU extension.  To the
best of my knowledge, the task is completely impossible in strict POSIX.
(You can work around the -print0 by using -exec printf '%s\0' {} + but
then there's no way to read the NUL-delimited stream, and no arrays
to put it into, as you cannot set positional parameters individually.)

Re: How to deal with space in command line?

Reply via email to