I recently worked on a project involving many bash scripts, and I've been trying to use errexit to stop various parts of a script as soon as anything returns a non-0 return code. As it turns out, this is an utterly useless endeavour. In asking this question on this forum, I hope somebody out there can help me, who understands bash, POSIX, and why decisions were made to arrive at the current situation.
To recapitulate, errexit is turned on by "set -e" or "set -o errexit". This is what TFM says about it: "Exit immediately if a pipeline (see Pipelines), which may consist of a single simple command (see Simple Commands), a subshell command enclosed in parentheses (see Command Grouping), or one of the commands executed as part of a command list enclosed by braces (see Command Grouping) returns a non-zero status. The shell does not exit if the command that fails is part of the command list immediately following a while or until keyword, part of the test in an if statement, part of any command executed in a && or || list except the command following the final && or ||, any command in a pipeline but the last, or if the command’s return status is being inverted with !. A trap on ERR, if set, is executed before the shell exits. This option applies to the shell environment and each subshell environment separately (see Command Execution Environment), and may cause subshells to exit before executing all the commands in the subshell." Let's leave pipelines aside, because that adds more complexity to an already messy problem. So we're talking just simple commands. My initial gripe about errexit (and its man page description) is that the following doesn't behave as a newbie would expect it to: set -e f() { false echo "NO!!" } f || { echo "f failed" >&2; exit 1; } Above, "false" usually stands for some complicated command, or part of a sequence of many commands, and "echo NO!!" stands for a statement releasing a lock, for instance. The newbie assumes that the lock won't be released unless executing f goes well. Moreover, the newbie likes many error messages, hence the extra message in the main script. Running it, you get: NO!! First of all, f is called as the LHS of ||, so we don't want the entire shell to crash if f returns non-0. That much, a not entirely dumb newbie can understand. But, lo and behold, NO!! gets printed. Do you see this explained in TFM, because I don't. Question 1: Is this a bug in the manual itself? As the hours of debugging pass by, the newbie learns about shells and subshells and subprocesses and what not. Also, apparently that one can see the current shell settings with $- or $SHELLOPTS. So the newbie changes f to: f() { echo $- false echo "NO!!" } You get: ehB NO!! This is now getting confusing: errexit seems to be active as bash executes f, but still it doesn't stop. Question 2: Is this a bug in how $- is maintained by bash? Next, the newbie thinks, oh, I'll just set errexit again inside f. How about: f() { set -e echo $- false echo "NO!!" } You get: ehB NO!! At this point, the newbie thinks, perhaps errexit isn't working after all. Question 3: Under the current design (which I consider flawed), why doesn't bash at least print a warning that errexit is not active, when the user tries to set it? As even more hours pass by, the newbie learns things about various other shells, POSIX mode, standards, etc. Useful things, but arguably useless for the task at hand. So, from what I the newbie gathered so far... One can work around this by using && to connect all statements in f, or using "|| return 1" after each of them. This is ok if f is 2 lines, not if it's 200. I also learned one can actually write a tiny function which tests if the ERR signal is active, and if it is not, to executed the invoking function (f) in a different shell, passing the entire environment, including function defs, with typeset. This is really awkward, but possible. However, it only works for functions, not for command lists run in a subshell, as in: ( false; echo "NO!!" ) || { echo "failed" >&2; exit 1; } The common suggestion I see on various forums is- don't use errexit. I now understand this from a user perspective, and that's why I call errexit "utterly useless" in the subject. But, if I may ask, why is bash in this position? Question 4: Back to the original f, why did bash (or even POSIX) decide to protect the "false" statement? Protecting f is clearly necessary, for otherwise || would be useless. But why the "false"? Question 4a (perhaps the same): TFM says: "the shell does not exit if the command that fails is part of the command list immediately following a while". Why protect commands in such a list other than the last? And independent of the question(s) 4, the last one: Question 5: Even assuming bash/POSIX decides to protect lists of commands where only the last is tested, why does bash entirely disable the errexit option? Playing around with it, it seems to me that bash completely disables the ERR signal itself, rather than changing the default action/trap for it. So why not change the default action to "ignore", but allow the user as in the 3rd example above to re-enable errexit explicitly? This way, it seems like there already is all the machinery available to detect errors, but instead of it being turned off, bash chooses to make it completely unavailable. Note, if this is an issue about f changing the outter shell settings, why not allow this in subshells: I would gladly enclose all of f in paranthesis, as in: f() { ( false echo "NO!!" ) } If only it made a difference, but currently it doesn't.