Hey. So, in parallel mode it can be very useful for client programs to know in what sequence xargs executed them. For example, to generate output files that are easily re-combined to have output order be guaranteed to exactly match input order, one might like to do something like this:
seq 1 9 | xargs -n2 -P2 sh -c "echo $$ > $XARGS_SEQ" cat out.* In serial/-P1 mode a client program could always maintain this state in the filesystem (e.g. sharing the output file with appending, etc.). In parallel mode it makes more sense for xargs itself to maintain and export this information than for clients to implement more complex machinery to re-construct it. For example, a high-resolution time stamp read by clients at start-up approximately works, but is subject to a race condition -- before the first child gets scheduled to read the time, xargs and the OS may have started the next child. That rare case would result in the child for later input reading an earlier time than the child for earlier input. There are convolutions clients may use to ensure reliable sequencing, but it seems simpler overall for xargs to provide a reliable number. An environment variable is one simple and natural method for exporting said number to client programs. E.g., GNU parallel has a $PARALLEL_SEQ (as well as a {#} substituted into the command feature). Of course, one does not always need or want the complexity of GNU parallel. The attached patch file should apply cleanly to the current version control findutils. It implements and documents this new feature. Besides whether maintainers find this feature useful enough to add (which I hope they do) there are a five possible issues or points of possible disagreement - the name of the variable, zero padding, the environment overhead adjustment and the range of the counter. The name is trivial. I motivate other implementation choices below. Filename generation by shells is usually lexicographically sorted. Padding makes makes "cat out.*" generally just "do the right thing" when text matching "*" comes from $XARGS_SEQ values as in my initial example. Padding is a non-critical choice as users can either remove 0-padding by using, e.g., `echo $XARGS_SEQ | sed 's/\<0*//'`, or add it using, e.g., `printf %09d $XARGS_SEQ`. (Numbers start at 1. So, lone 0s do not occur.) Personally, I think it easiest for xargs to pad and for clients to remove the 0s in the (I believe) rare cases when padding is undesired. So, my patch pads. One could also allow users to specify what they want with a new command-line option, but simply settling on one way or the other is probably adequate. Padding to a fixed size variable also simplifies adjusting the POSIX compliance 2048 byte head room calculation which I also tried to update. It's a bit open to interpretation whether any adjustment is even needed since setting the variable at all is not in POSIX and $XARGS_SEQ might conceptually "count" as a client program variable. However, my patch assumes one wants a more conservative interpretation since the intent of the limit seems to be to let client programs assume they can use at least 2048 bytes more environment data over what xargs uses. So, the patch bumps the headroom by the amount used by XARGS_SEQ. Because we just use a flat 20 + 8, we bump by slightly more than we use on CPU architectures with smaller than 8 byte pointers. That is easy enough to fine-tune if adjustment precision is the only deal breaker, though. One might also want to only provide the variable in parallel mode. That also makes the adjustment slightly more work, and there are uses for XARGS_SEQ in serial mode (granted, those uses may have simple enough workarounds, unlike true parallel operation). It's probably simpler to explain to users that it is just always provided. The fourth objection I can foresee at the moment might be using only a 4 byte counter. At 100,000 jobs per seconds, 4 GigaJobs might overflow the counter in just 12 hours. On the other hand, 4 GigaJobs is an awful lot of jobs. Combining 4 billion outputs (i.e. actually using the full range of the counter) is unlikely to be the way this feature is applied. On yet another hand, xargs has always largely been about avoiding system limit problems. In any case, it's probably easy to change to 8 bytes if desired.
diff --git a/xargs/xargs.1 b/xargs/xargs.1 index ebcb1e0..b216c61 100644 --- a/xargs/xargs.1 +++ b/xargs/xargs.1 @@ -65,7 +65,10 @@ than there were items in the input. This will normally have significant performance benefits. Some commands can usefully be executed in parallel too; see the .B \-P -option. +option. To help commands form easily sequenced output file names, GNU +.B xargs +exports to commands an environment variable XARGS_SEQ containing the +invocation number (padded with leading zeros). .P Because Unix filenames can contain blanks and newlines, this default behaviour is often problematic; filenames containing blanks diff --git a/xargs/xargs.c b/xargs/xargs.c index 5fd93cd..b634461 100644 --- a/xargs/xargs.c +++ b/xargs/xargs.c @@ -369,7 +369,7 @@ main (int argc, char **argv) int (*read_args) (void) = read_line; void (*act_on_init_result)(void) = noop; enum BC_INIT_STATUS bcstatus; - enum { XARGS_POSIX_HEADROOM = 2048u }; + enum { XARGS_POSIX_HEADROOM = 2076u }; /* 2048 +28 for $XARGS_SEQ usage */ struct sigaction sigact; if (argv[0]) @@ -1155,6 +1155,7 @@ prep_child_for_exec (void) static int xargs_do_exec (struct buildcmd_control *ctl, void *usercontext, int argc, char **argv) { + static unsigned int xargs_seq_num; pid_t child; int fd[2]; int buf; @@ -1164,6 +1165,7 @@ xargs_do_exec (struct buildcmd_control *ctl, void *usercontext, int argc, char * (void) argc; (void) usercontext; + xargs_seq_num++; /* count command invocations for user */ if (!query_before_executing || print_args (true)) { if (proc_max) @@ -1203,6 +1205,12 @@ xargs_do_exec (struct buildcmd_control *ctl, void *usercontext, int argc, char * case 0: /* Child. */ { + char env_var[24]; /* For <=9 digit numbers max is 10+10 bytes */ + sprintf(env_var, "XARGS_SEQ=%09u", xargs_seq_num); + putenv(env_var); /* export XARGS_SEQ; ENOMEM failure is + * conceivable, but neither failure nor falling + * back makes much sense in that situation. */ + close (fd[0]); child_error = EXIT_SUCCESS;