XARGS_SEQ code & docs patch

C Blake Sun, 28 Oct 2012 13:34:08 -0700

Hey.  So, in parallel mode it can be very useful for client programs
to know in what sequence xargs executed them.  For example, to generate
output files that are easily re-combined to have output order be
guaranteed to exactly match input order, one might like to do something
like this:


    seq 1 9 | xargs -n2 -P2 sh -c "echo $$ > $XARGS_SEQ"
    cat out.*

In serial/-P1 mode a client program could always maintain this state
in the filesystem (e.g. sharing the output file with appending, etc.).
In parallel mode it makes more sense for xargs itself to maintain and
export this information than for clients to implement more complex
machinery to re-construct it.

For example, a high-resolution time stamp read by clients at start-up
approximately works, but is subject to a race condition -- before the
first child gets scheduled to read the time, xargs and the OS may have
started the next child.  That rare case would result in the child for
later input reading an earlier time than the child for earlier input.
There are convolutions clients may use to ensure reliable sequencing,
but it seems simpler overall for xargs to provide a reliable number.

An environment variable is one simple and natural method for exporting
said number to client programs.  E.g., GNU parallel has a $PARALLEL_SEQ
(as well as a {#} substituted into the command feature).  Of course,
one does not always need or want the complexity of GNU parallel.

The attached patch file should apply cleanly to the current version
control findutils.  It implements and documents this new feature.

Besides whether maintainers find this feature useful enough to add
(which I hope they do) there are a five possible issues or points
of possible disagreement - the name of the variable, zero padding,
the environment overhead adjustment and the range of the counter.
The name is trivial.  I motivate other implementation choices below.

Filename generation by shells is usually lexicographically sorted.
Padding makes makes "cat out.*" generally just "do the right thing" when
text matching "*" comes from $XARGS_SEQ values as in my initial example.
Padding is a non-critical choice as users can either remove 0-padding
by using, e.g., `echo $XARGS_SEQ | sed 's/\<0*//'`, or add it using,
e.g., `printf %09d $XARGS_SEQ`.  (Numbers start at 1.  So, lone 0s do
not occur.)  Personally, I think it easiest for xargs to pad and for
clients to remove the 0s in the (I believe) rare cases when padding is
undesired.  So, my patch pads.  One could also allow users to specify
what they want with a new command-line option, but simply settling on
one way or the other is probably adequate.

Padding to a fixed size variable also simplifies adjusting the POSIX
compliance 2048 byte head room calculation which I also tried to update.
It's a bit open to interpretation whether any adjustment is even needed
since setting the variable at all is not in POSIX and $XARGS_SEQ might
conceptually "count" as a client program variable.  However, my patch
assumes one wants a more conservative interpretation since the intent
of the limit seems to be to let client programs assume they can use at
least 2048 bytes more environment data over what xargs uses.  So, the
patch bumps the headroom by the amount used by XARGS_SEQ.  Because we
just use a flat 20 + 8, we bump by slightly more than we use on CPU
architectures with smaller than 8 byte pointers.  That is easy enough
to fine-tune if adjustment precision is the only deal breaker, though.

One might also want to only provide the variable in parallel mode.
That also makes the adjustment slightly more work, and there are uses
for XARGS_SEQ in serial mode (granted, those uses may have simple
enough workarounds, unlike true parallel operation).  It's probably
simpler to explain to users that it is just always provided.

The fourth objection I can foresee at the moment might be using only a
4 byte counter.  At 100,000 jobs per seconds, 4 GigaJobs might overflow
the counter in just 12 hours.  On the other hand, 4 GigaJobs is an awful
lot of jobs.  Combining 4 billion outputs (i.e. actually using the full
range of the counter) is unlikely to be the way this feature is applied.
On yet another hand, xargs has always largely been about avoiding system
limit problems.  In any case, it's probably easy to change to 8 bytes
if desired.

diff --git a/xargs/xargs.1 b/xargs/xargs.1
index ebcb1e0..b216c61 100644
--- a/xargs/xargs.1
+++ b/xargs/xargs.1
@@ -65,7 +65,10 @@ than there were items in the input.  This will normally have
 significant performance benefits.  Some commands can usefully be
 executed in parallel too; see the
 .B \-P
-option.
+option.  To help commands form easily sequenced output file names, GNU
+.B xargs
+exports to commands an environment variable XARGS_SEQ containing the
+invocation number (padded with leading zeros).
 .P
 Because Unix filenames can contain blanks and newlines, this default
 behaviour is often problematic; filenames containing blanks
diff --git a/xargs/xargs.c b/xargs/xargs.c
index 5fd93cd..b634461 100644
--- a/xargs/xargs.c
+++ b/xargs/xargs.c
@@ -369,7 +369,7 @@ main (int argc, char **argv)
   int (*read_args) (void) = read_line;
   void (*act_on_init_result)(void) = noop;
   enum BC_INIT_STATUS bcstatus;
-  enum { XARGS_POSIX_HEADROOM = 2048u };
+  enum { XARGS_POSIX_HEADROOM = 2076u };  /* 2048 +28 for $XARGS_SEQ usage */
   struct sigaction sigact;
 
   if (argv[0])
@@ -1155,6 +1155,7 @@ prep_child_for_exec (void)
 static int
 xargs_do_exec (struct buildcmd_control *ctl, void *usercontext, int argc, char **argv)
 {
+  static unsigned int xargs_seq_num;
   pid_t child;
   int fd[2];
   int buf;
@@ -1164,6 +1165,7 @@ xargs_do_exec (struct buildcmd_control *ctl, void *usercontext, int argc, char *
   (void) argc;
   (void) usercontext;
 
+  xargs_seq_num++;              /* count command invocations for user */
   if (!query_before_executing || print_args (true))
     {
       if (proc_max)
@@ -1203,6 +1205,12 @@ xargs_do_exec (struct buildcmd_control *ctl, void *usercontext, int argc, char *
 
 	case 0:		/* Child.  */
 	  {
+            char env_var[24];   /* For <=9 digit numbers max is 10+10 bytes */
+            sprintf(env_var, "XARGS_SEQ=%09u", xargs_seq_num);
+            putenv(env_var);    /* export XARGS_SEQ; ENOMEM failure is
+                                 * conceivable, but neither failure nor falling
+                                 * back makes much sense in that situation. */
+
 	    close (fd[0]);
 	    child_error = EXIT_SUCCESS;

XARGS_SEQ code & docs patch

Reply via email to