why does errexit exist in its current utterly useless form?

matei . david Fri, 14 Dec 2012 15:10:12 -0800

I recently worked on a project involving many bash scripts, and I've been 
trying to use errexit to stop various parts of a script as soon as anything 
returns a non-0 return code. As it turns out, this is an utterly useless 
endeavour. In asking this question on this forum, I hope somebody out there can 
help me, who understands bash, POSIX, and why decisions were made to arrive at 
the current situation.


To recapitulate, errexit is turned on by "set -e" or "set -o errexit". This is 
what TFM says about it:

"Exit immediately if a pipeline (see Pipelines), which may consist of a single 
simple command (see Simple Commands), a subshell command enclosed in 
parentheses (see Command Grouping), or one of the commands executed as part of 
a command list enclosed by braces (see Command Grouping) returns a non-zero 
status. The shell does not exit if the command that fails is part of the 
command list immediately following a while or until keyword, part of the test 
in an if statement, part of any command executed in a && or || list except the 
command following the final && or ||, any command in a pipeline but the last, 
or if the command’s return status is being inverted with !. A trap on ERR, if 
set, is executed before the shell exits. This option applies to the shell 
environment and each subshell environment separately (see Command Execution 
Environment), and may cause subshells to exit before executing all the commands 
in the subshell."

Let's leave pipelines aside, because that adds more complexity to an already 
messy problem. So we're talking just simple commands.

My initial gripe about errexit (and its man page description) is that the 
following doesn't behave as a newbie would expect it to:

set -e
f() {
  false
  echo "NO!!"
}
f || { echo "f failed" >&2; exit 1; }

Above, "false" usually stands for some complicated command, or part of a 
sequence of many commands, and "echo NO!!" stands for a statement releasing a 
lock, for instance. The newbie assumes that the lock won't be released unless 
executing f goes well. Moreover, the newbie likes many error messages, hence 
the extra message in the main script.

Running it, you get:

NO!!

First of all, f is called as the LHS of ||, so we don't want the entire shell 
to crash if f returns non-0. That much, a not entirely dumb newbie can 
understand. But, lo and behold, NO!! gets printed. Do you see this explained in 
TFM, because I don't.

Question 1: Is this a bug in the manual itself?

As the hours of debugging pass by, the newbie learns about shells and subshells 
and subprocesses and what not. Also, apparently that one can see the current 
shell settings with $- or $SHELLOPTS. So the newbie changes f to:

f() {
  echo $-
  false
  echo "NO!!"
}

You get:

ehB
NO!!

This is now getting confusing: errexit seems to be active as bash executes f, 
but still it doesn't stop.

Question 2: Is this a bug in how $- is maintained by bash?

Next, the newbie thinks, oh, I'll just set errexit again inside f. How about:

f() {
  set -e
  echo $-
  false
  echo "NO!!"
}

You get:

ehB
NO!!

At this point, the newbie thinks, perhaps errexit isn't working after all.

Question 3: Under the current design (which I consider flawed), why doesn't 
bash at least print a warning that errexit is not active, when the user tries 
to set it?

As even more hours pass by, the newbie learns things about various other 
shells, POSIX mode, standards, etc. Useful things, but arguably useless for the 
task at hand. So, from what I the newbie gathered so far...

One can work around this by using && to connect all statements in f, or using 
"|| return 1" after each of them. This is ok if f is 2 lines, not if it's 200.

I also learned one can actually write a tiny function which tests if the ERR 
signal is active, and if it is not, to executed the invoking function (f) in a 
different shell, passing the entire environment, including function defs, with 
typeset. This is really awkward, but possible. However, it only works for 
functions, not for command lists run in a subshell, as in:

( false; echo "NO!!" ) || { echo "failed" >&2; exit 1; }

The common suggestion I see on various forums is- don't use errexit. I now 
understand this from a user perspective, and that's why I call errexit "utterly 
useless" in the subject. But, if I may ask, why is bash in this position?

Question 4: Back to the original f, why did bash (or even POSIX) decide to 
protect the "false" statement? Protecting f is clearly necessary, for otherwise 
|| would be useless. But why the "false"?

Question 4a (perhaps the same): TFM says: "the shell does not exit if the 
command that fails is part of the command list immediately following a while". 
Why protect commands in such a list other than the last?

And independent of the question(s) 4, the last one:

Question 5: Even assuming bash/POSIX decides to protect lists of commands where 
only the last is tested, why does bash entirely disable the errexit option?  
Playing around with it, it seems to me that bash completely disables the ERR 
signal itself, rather than changing the default action/trap for it. So why not 
change the default action to "ignore", but allow the user as in the 3rd example 
above to re-enable errexit explicitly? This way, it seems like there already is 
all the machinery available to detect errors, but instead of it being turned 
off, bash chooses to make it completely unavailable.

Note, if this is an issue about f changing the outter shell settings, why not 
allow this in subshells: I would gladly enclose all of f in paranthesis, as in:

f() {
(
  false
  echo "NO!!"
)
}

If only it made a difference, but currently it doesn't.

why does errexit exist in its current utterly useless form?

Reply via email to