Bug#638256: race condition causes many syslog messages: File a[[:xdigit:]]{13} is in wrong format - aborting

Martin Dorey Wed, 17 Aug 2011 17:15:25 -0700

Package: at
Version: 3.1.10.2
Severity: normal
Tags: patch

In my experience, when a file system fills up, it often seems that there's room
in the inode table when there's no room for file bodies.  Unintentionally empty
files become common.  If /var/spool/cron/atjobs ends up containing a zero-length
job file for a job due over an hour ago, then atd sometimes goes nuts, logging
an unbounded number of messages like:


2011-08-09T16:11:12-07:00 merc55rm atd[4194]: File a00afd014dc579 is in wrong 
format - aborting

Even when the original reason for the full file system is removed, atd can go
on to fill it again with gigabytes of syslog entries.

Google finds many similar complaints over the years:

http://goo.gl/rvdP6
http://ubuntuforums.org/archive/index.php/t-1575261.html
https://bugzilla.redhat.com/show_bug.cgi?id=718422
http://forums.fedoraforum.org/showthread.php?t=252412

http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=101919 may have been the cause
of an apparently corrupted at job.  What I'm focusing on here is the manifold
repetition of the error message, assuming a corrupted at job.

On one particular machine, I was able to reproduce this, most of the time, with
this test:

sudo su -
cd /var/spool/cron/atjobs
/etc/init.d/atd stop
touch a0000000000000
chmod +x a0000000000000
grep -w atd /var/log/syslog | wc -l
/etc/init.d/atd start
sleep 1
grep -w atd /var/log/syslog | wc -l
sleep 1
grep -w atd /var/log/syslog | wc -l

The number of lines of atd message rises for a while, then stabilizes.

I wasn't able to reliably reproduce the issue on a couple of other machines,
including the one from which I'm reporting the bug, unless I started atd from
within valgrind:

valgrind --trace-children=yes -q /etc/init.d/atd start

An strace extract, available on request, from the system where I could easily
reproduce the problem, showed that atd stabilized when the child that logs
the "aborting" message exits before the main atd process goes to sleep.
If the parent wins the race, then the sleep is interrupted by SIGCHLD.  The
parent then calls run_loop again and the cycle repeats.

My suggested fix sleeps again if we're woken prematurely.

-- System Information:
Debian Release: 5.0.8
  APT prefers oldstable
  APT policy: (500, 'oldstable')
Architecture: amd64 (x86_64)

Kernel: Linux 2.6.26-2-amd64 (SMP w/4 CPU cores)
Locale: LANG=en_US.UTF-8, LC_CTYPE=en_US.UTF-8 (charmap=UTF-8) (ignored: LC_ALL 
set to en_US.UTF-8)
Shell: /bin/sh linked to /bin/bash

Versions of packages at depends on:
ii  exim4                     4.69-9+lenny4  metapackage to ease Exim MTA (v4) 
ii  exim4-daemon-heavy [mail- 4.69-9+lenny4  Exim MTA (v4) daemon with extended
ii  libc6                     2.7-18lenny7   GNU C Library: Shared libraries
ii  libpam0g                  1.0.1-5+lenny1 Pluggable Authentication Modules l
ii  lsb-base                  3.2-20         Linux Standard Base 3.2 init scrip

at recommends no packages.

at suggests no packages.

-- no debconf information

--- /tmp/at-3.1.10.2/atd.c.orig 2011-08-16 19:14:41.000000000 -0700
+++ /tmp/at-3.1.10.2/atd.c      2011-08-16 19:15:54.000000000 -0700
@@ -792,11 +792,12 @@
 
     daemon_setup();
 
+    now = time(NULL);
     do {
-       now = time(NULL);
        next_invocation = run_loop();
-       if (next_invocation > now) {
+       while (!term_signal && next_invocation > now) {
            sleep(next_invocation - now);
+           now = time(NULL);
        }
     } while (!term_signal);
     daemon_cleanup();

Bug#638256: race condition causes many syslog messages: File a[[:xdigit:]]{13} is in wrong format - aborting

Reply via email to